Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Another look at #Rust, another two disappointments:The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously

Another look at #Rust, another two disappointments:The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously

Scheduled Pinned Locked Moved Uncategorized
rustrustlang
6 Posts 3 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • fiona@blahaj.zoneF This user is from outside of this forum
    fiona@blahaj.zoneF This user is from outside of this forum
    fiona@blahaj.zone
    wrote last edited by
    #1

    Another look at #Rust, another two disappointments:

    The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously?

    At least the
    unicode-segmentation-package that seems to be a popular way to get that functionality has a version that indicates that it may be usable for production (that is version ≥ 1.0.0), six owners of which a non-zero number even manage to pass a basic vibe check, and only has dev-dependencies, making this something I’d actually consider using.

    Still: This should really be part of the standard library!

    Another thing I was then looking at was the random number facilities, which were another disappointment: Not only is it still fully experimental, it is also
    woefully incomplete and cannot even create a random integer between 1 and 10. This is something extremely basic that should be possible.
    I’m not even talking about things like uniform floating point distributions here, let alone stuff like a normal distribution, all of which C++ btw supports directly in the standard library! (Not saying C++ does it all perfectly, but it’s good enough to be useful. This isn’t!)

    #Rustlang

    michalfita@mastodon.socialM 1 Reply Last reply
    0
    • fiona@blahaj.zoneF fiona@blahaj.zone

      Another look at #Rust, another two disappointments:

      The standard library specifies that its string type is encoded in UTF8 (good!), but provides no way to work with grapheme clusters and the documentation just points at crates.io… Seriously?

      At least the
      unicode-segmentation-package that seems to be a popular way to get that functionality has a version that indicates that it may be usable for production (that is version ≥ 1.0.0), six owners of which a non-zero number even manage to pass a basic vibe check, and only has dev-dependencies, making this something I’d actually consider using.

      Still: This should really be part of the standard library!

      Another thing I was then looking at was the random number facilities, which were another disappointment: Not only is it still fully experimental, it is also
      woefully incomplete and cannot even create a random integer between 1 and 10. This is something extremely basic that should be possible.
      I’m not even talking about things like uniform floating point distributions here, let alone stuff like a normal distribution, all of which C++ btw supports directly in the standard library! (Not saying C++ does it all perfectly, but it’s good enough to be useful. This isn’t!)

      #Rustlang

      michalfita@mastodon.socialM This user is from outside of this forum
      michalfita@mastodon.socialM This user is from outside of this forum
      michalfita@mastodon.social
      wrote last edited by
      #2

      @Fiona What would be percentage of Rust's standard library users needing Unicode Segmentation in their every day projects? All stuff going in the the standard library needs to be maintained over time, and that's a burden. In fact growing one from what I read.

      fiona@blahaj.zoneF 1 Reply Last reply
      0
      • michalfita@mastodon.socialM michalfita@mastodon.social

        @Fiona What would be percentage of Rust's standard library users needing Unicode Segmentation in their every day projects? All stuff going in the the standard library needs to be maintained over time, and that's a burden. In fact growing one from what I read.

        fiona@blahaj.zoneF This user is from outside of this forum
        fiona@blahaj.zoneF This user is from outside of this forum
        fiona@blahaj.zone
        wrote last edited by
        #3

        @michalfita@mastodon.social Literally everyone who does basically anything whatsoever with text.

        If I give you a valid unicode string with more than one codepoint in it and ask you whether the first character is an “a”, you literally cannot answer that question reliably if you don’t have support for unicode segmentation. It is
        that basic!

        And by pointing to a third party dependency of unknown trustworthiness you are creating a situation where people will just ignore the real complexity and ship broken software. That may be acceptable if we are talking about badly designed toy languages that everyone knows are doing insane stuff when you look at them funny, but Rust claims to do better here.

        kasperd@westergaard.socialK 1 Reply Last reply
        0
        • fiona@blahaj.zoneF fiona@blahaj.zone

          @michalfita@mastodon.social Literally everyone who does basically anything whatsoever with text.

          If I give you a valid unicode string with more than one codepoint in it and ask you whether the first character is an “a”, you literally cannot answer that question reliably if you don’t have support for unicode segmentation. It is
          that basic!

          And by pointing to a third party dependency of unknown trustworthiness you are creating a situation where people will just ignore the real complexity and ship broken software. That may be acceptable if we are talking about badly designed toy languages that everyone knows are doing insane stuff when you look at them funny, but Rust claims to do better here.

          kasperd@westergaard.socialK This user is from outside of this forum
          kasperd@westergaard.socialK This user is from outside of this forum
          kasperd@westergaard.social
          wrote last edited by
          #4

          One flaw I frequently encounter in software is text searches incorrectly assuming o and ø are the same letter. When I search for one and the other letter is also included in the search result there can be so many false positives that it renders the search result entirely unusable.

          I don't know exactly how such a flaw is introduced. Even the most basic search algorithm I could imagine wouldn't have that flaw. It seems somebody must have gone out of their way to make the search behave in this incorrect way.

          fiona@blahaj.zoneF 1 Reply Last reply
          0
          • kasperd@westergaard.socialK kasperd@westergaard.social

            One flaw I frequently encounter in software is text searches incorrectly assuming o and ø are the same letter. When I search for one and the other letter is also included in the search result there can be so many false positives that it renders the search result entirely unusable.

            I don't know exactly how such a flaw is introduced. Even the most basic search algorithm I could imagine wouldn't have that flaw. It seems somebody must have gone out of their way to make the search behave in this incorrect way.

            fiona@blahaj.zoneF This user is from outside of this forum
            fiona@blahaj.zoneF This user is from outside of this forum
            fiona@blahaj.zone
            wrote last edited by
            #5

            @kasperd@westergaard.social @michalfita@mastodon.social This happens if your search uses a compatibility normalization (NFKD or NFKC), which strips out a lot of distinctions between characters.

            There is actually even a somewhat valid reason to do so, since it allows you to search for characters you may not have an easy way to enter. Say for example the text contains a “ℕ”, or a “fl”-ligature:
            I can enter the former on my keyboard, but I also maintain my own layout and a giant XCompose file; do you? And the latter is even more critical, since you probably really want a search for “flying” to find an occurrence of “flying”; Similarly if I tell you that 𝐸 = 𝑚𝑐², you would likely want to be able to find that in a text by searching for “E = mc”, or maybe even “E = mc2”.

            The problem there is where you draw the line. I can kinda agree that a search for “Musli” should not find “Müsli”, but should “Muesli”? And what if the word isn’t German or the person that is searching doesn’t know that the appropriate decomposition of an umlaut is ⟨base vowel⟩+e? Should they not be able to find it? I correctly remembered that the fallback for ø
            is similarly “oe”, but how many other people who don’t use nordic languages are aware of that?

            All this is to say: There are good reasons for this behavior, even though I agree that it can be very annoying and that a stricter middle-ground should be at least available and maybe even be the default.

            kasperd@westergaard.socialK 1 Reply Last reply
            0
            • fiona@blahaj.zoneF fiona@blahaj.zone

              @kasperd@westergaard.social @michalfita@mastodon.social This happens if your search uses a compatibility normalization (NFKD or NFKC), which strips out a lot of distinctions between characters.

              There is actually even a somewhat valid reason to do so, since it allows you to search for characters you may not have an easy way to enter. Say for example the text contains a “ℕ”, or a “fl”-ligature:
              I can enter the former on my keyboard, but I also maintain my own layout and a giant XCompose file; do you? And the latter is even more critical, since you probably really want a search for “flying” to find an occurrence of “flying”; Similarly if I tell you that 𝐸 = 𝑚𝑐², you would likely want to be able to find that in a text by searching for “E = mc”, or maybe even “E = mc2”.

              The problem there is where you draw the line. I can kinda agree that a search for “Musli” should not find “Müsli”, but should “Muesli”? And what if the word isn’t German or the person that is searching doesn’t know that the appropriate decomposition of an umlaut is ⟨base vowel⟩+e? Should they not be able to find it? I correctly remembered that the fallback for ø
              is similarly “oe”, but how many other people who don’t use nordic languages are aware of that?

              All this is to say: There are good reasons for this behavior, even though I agree that it can be very annoying and that a stricter middle-ground should be at least available and maybe even be the default.

              kasperd@westergaard.socialK This user is from outside of this forum
              kasperd@westergaard.socialK This user is from outside of this forum
              kasperd@westergaard.social
              wrote last edited by
              #6

              It's not like there is a valid alternative spelling for words using the letter ø. It's common to see ø replaced with o or oe when constrained to writing in ASCII. But neither is a correct spelling of the word.

              My favorite example of this is the word sukkerrør which will get a different meaning if you replace ø with oe. Sukkerrør means sugarcanes and sukkerroer means sugar beets.

              1 Reply Last reply
              1
              0
              • R relay@relay.infosec.exchange shared this topic
              Reply
              • Reply as topic
              Log in to reply
              • Oldest to Newest
              • Newest to Oldest
              • Most Votes


              • Login

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • World
              • Users
              • Groups