Another look at #Rust, another two disappointments:The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously

fiona@blahaj.zone

Another look at #Rust, another two disappointments:

The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously?

At least the unicode-segmentation-package that seems to be a popular way to get that functionality has a version that indicates that it may be usable for production (that is version ≥ 1.0.0), six owners of which a non-zero number even manage to pass a basic vibe check, and only has dev-dependencies, making this something I’d actually consider using.

Still: This should really be part of the standard library!

Another thing I was then looking at was the random number facilities, which were another disappointment: Not only is it still fully experimental, it is also woefully incomplete and cannot even create a random integer between 1 and 10. This is something extremely basic that should be possible.
I’m not even talking about things like uniform floating point distributions here, let alone stuff like a normal distribution, all of which C++ btw supports directly in the standard library! (Not saying C++ does it all perfectly, but it’s good enough to be useful. This isn’t!)

#Rustlang

michalfita@mastodon.social

@Fiona What would be percentage of Rust's standard library users needing Unicode Segmentation in their every day projects? All stuff going in the the standard library needs to be maintained over time, and that's a burden. In fact growing one from what I read.

fiona@blahaj.zone

@michalfita@mastodon.social Literally everyone who does basically anything whatsoever with text.

If I give you a valid unicode string with more than one codepoint in it and ask you whether the first character is an “a”, you literally cannot answer that question reliably if you don’t have support for unicode segmentation. It is that basic!

And by pointing to a third party dependency of unknown trustworthiness you are creating a situation where people will just ignore the real complexity and ship broken software. That may be acceptable if we are talking about badly designed toy languages that everyone knows are doing insane stuff when you look at them funny, but Rust claims to do better here.

kasperd@westergaard.social

One flaw I frequently encounter in software is text searches incorrectly assuming o and ø are the same letter. When I search for one and the other letter is also included in the search result there can be so many false positives that it renders the search result entirely unusable.

I don't know exactly how such a flaw is introduced. Even the most basic search algorithm I could imagine wouldn't have that flaw. It seems somebody must have gone out of their way to make the search behave in this incorrect way.

fiona@blahaj.zone

@kasperd@westergaard.social @michalfita@mastodon.social This happens if your search uses a compatibility normalization (NFKD or NFKC), which strips out a lot of distinctions between characters.

There is actually even a somewhat valid reason to do so, since it allows you to search for characters you may not have an easy way to enter. Say for example the text contains a “ℕ”, or a “ﬂ”-ligature: I can enter the former on my keyboard, but I also maintain my own layout and a giant XCompose file; do you? And the latter is even more critical, since you probably really want a search for “flying” to find an occurrence of “ﬂying”; Similarly if I tell you that 𝐸 = 𝑚𝑐², you would likely want to be able to find that in a text by searching for “E = mc”, or maybe even “E = mc2”.

The problem there is where you draw the line. I can kinda agree that a search for “Musli” should not find “Müsli”, but should “Muesli”? And what if the word isn’t German or the person that is searching doesn’t know that the appropriate decomposition of an umlaut is ⟨base vowel⟩+e? Should they not be able to find it? I correctly remembered that the fallback for ø
is similarly “oe”, but how many other people who don’t use nordic languages are aware of that?

All this is to say: There are good reasons for this behavior, even though I agree that it can be very annoying and that a stricter middle-ground should be at least available and maybe even be the default.

kasperd@westergaard.social

It's not like there is a valid alternative spelling for words using the letter ø. It's common to see ø replaced with o or oe when constrained to writing in ASCII. But neither is a correct spelling of the word.

My favorite example of this is the word sukkerrør which will get a different meaning if you replace ø with oe. Sukkerrør means sugarcanes and sukkerroer means sugar beets.

CIRCLE WITH A DOT

Another look at #Rust, another two disappointments:The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously