Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.

sikorski@mstdn.science

mxk@hachyderm.io

@vathpela @tek I would argue that in modern times this really shouldn't be an issue to be concerned about. It's not like telnet and plain serial connections are still most central communication protocols. And if your storage is causing bit flips you have other issues than readable plain text.

mo@mastodon.ml

@vathpela IMHO, redundancy and/or checksums should be implemented on different layer, not in text encoding

Like, there's many, many ways to keep bits from corrupting, which are applicable in different cases
And forcing one particular inside of text encoding itself is...meh

Same for compression btw. For some texts (CJK in particular) UTF-8 is sub-optimal, but even basic deflate makes it compact enough

TL;DR: UTF-8 is not perfect, but having one encoding for every text outweighs

@tek

tek@freeradical.zone

@mo @vathpela Also, UTF-8 is trivially easy to synchronize. If you delete a byte out of the middle of a file, at most you’ll lost the one affected character (well, code point). The ones before and after it will be fine. That’s not true of some other Unicode encodings, like double width ones where everything after would be out of sync.

fabian@mainz.social

@tek Still I am regularly confronted with IT systems that do not (properly) support it and display my name with an umlaut wrong.

madduci@mastodon.social

@tek and it is still being handled wrongly in many places

root42@chaos.social

@tek This! UTF-8 is a great encoding. Unicode can be a mess at times though.

debaer@23.social

@tek But UTF-EBCDIC is still younger than EBCDIC was when UTF-EBCDIC was invented.

djl@mastodon.mit.edu

@vathpela @tek

Nah. It stopped sucking when Unicode became variable-width even in a 32-bit encoding. Or at least it no longer became valid to correctly point out that it sucks, since there now isn't anything that doesn't.

ahltorp@mastodon.nu

@mxk @vathpela @tek I don’t know any way to run telnet over a non-checksummed connection.

timwardcam@c.im

@tek Every now and then the Cambridge CST exam papers include a question like "explain why even experienced programmers sometimes have problems with character codes".

You could write pretty well anything you liked.

Originally what was expected was an essay about things like escape sequences on Flexowriter tapes; in my day it was about conversion between EBCDIC and ASCII; these days it might be about obscure characters in URLs.

mansr@society.oftrolls.com

@mo @vathpela @tek Variable length encoding adds a little complexity at the input and output stages, but I think the benefits outweigh that, especially the 8-bit compatibility that allows a lot of software to work (at least to some extent) unmodified.

jaddle@toot.community

@tek
And yet, my bank still won't let me add a contact (for etransfers) with an accent in their name.

enno@mastodon.gamedev.place

@tek @loke @vathpela there is a BOM defined for UTF-8, as pointless as that may seem, and it's screwing up that whole beautiful ASCII compatibility whenever someone uses it.

alper@rls.social

@tek MySQL will still happily mangle it.

loke@functional.cafe

@enno @tek @vathpela I'd go as far as saying it's actively harmful. There are exactly zero cases when it's useful, and it will actively mess things up in most cases.

But, of course windows applications tend to add them at times.

vathpela@infosec.exchange

@glent @ahltorp @mxk @tek do y'all just not believe people still have to deal with actual UARTs, or what?

mxk@hachyderm.io

@vathpela @glent @ahltorp @tek I do work with actual uarts but only for debugging purposes as a fallback when ssh fails.
That doesn't stop me from considering using utf-8 a net benefit.

ahltorp@mastodon.nu

@vathpela @glent @mxk But even if it’s raw UART with no layer in between, it’s no more of a problem than with Ascii or ISO 8859, if you don’t count the larger surface area of a wide character, which is sort of unavoidable.

vathpela@infosec.exchange

@mxk @glent @ahltorp @tek I agree, but I also think it could and should have improved.

CIRCLE WITH A DOT

Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.