Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Scheduled Pinned Locked Moved Uncategorized
47 Posts 13 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

    @zeux @TomF @wolf480pl @pervognsen
    That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.

    But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.

    One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)

    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.place
    wrote last edited by
    #32

    @zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.

    It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).

    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
    0
    • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

      @rygorous @zeux @TomF @pervognsen
      > x86 APX

      did Intel really reuse that brand after 40 years?

      wolf480pl@mstdn.ioW This user is from outside of this forum
      wolf480pl@mstdn.ioW This user is from outside of this forum
      wolf480pl@mstdn.io
      wrote last edited by
      #33

      @rygorous @zeux @TomF @pervognsen
      also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes

      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
      0
      • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

        @rygorous @zeux @TomF @pervognsen
        also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes

        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.place
        wrote last edited by
        #34

        @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

        LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

        LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

        dougall@mastodon.socialD wolf480pl@mstdn.ioW 2 Replies Last reply
        0
        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

          @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

          LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

          LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

          dougall@mastodon.socialD This user is from outside of this forum
          dougall@mastodon.socialD This user is from outside of this forum
          dougall@mastodon.social
          wrote last edited by
          #35

          @rygorous @wolf480pl I found it pretty mind blowing that Apple cores are able to execute them as single 128-bit loads and stores, thereby doubling the number of possible scalar register loads/stores per cycle compared to x86 cores.

          (It's very nice raw throughput, though I'm guessing some weird memory renaming tricks can maybe make up some of the difference on the x86 side?)

          1 Reply Last reply
          0
          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

            @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

            - Really compact: ARC, Thumb
            - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
            - Bad: most 32b fixed-size encodings, zSeries

            wren6991@types.plW This user is from outside of this forum
            wren6991@types.plW This user is from outside of this forum
            wren6991@types.pl
            wrote last edited by
            #36

            @rygorous @TomF @zeux @wolf480pl @pervognsen Would be interested to see how RISC-V fared with addition of B, Zcmp and Zcb. I find it's usually pretty close to Thumb. Zcb in particular is just "oops we forgot to put this in the C extension" and there's not much excuse not to implement it

            rygorous@mastodon.gamedev.placeR 1 Reply Last reply
            0
            • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

              @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

              - Really compact: ARC, Thumb
              - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
              - Bad: most 32b fixed-size encodings, zSeries

              iximeow@haunted.computerI This user is from outside of this forum
              iximeow@haunted.computerI This user is from outside of this forum
              iximeow@haunted.computer
              wrote last edited by
              #37

              @rygorous oh this is a super neat post, thanks for sharing

              1 Reply Last reply
              0
              • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                @zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.

                It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).

                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                rygorous@mastodon.gamedev.place
                wrote last edited by
                #38

                @zeux @TomF @wolf480pl @pervognsen In general, the thing to keep in mind is that the part that matters for density is usually boring int code, which is the majority of it almost everywhere.

                You can have 90% of the instructions in your manual have awkwardly redundant encodings and not have it matter too much for size as long as the encodings for the 10-15 insns that really matter are good.

                rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                0
                • wren6991@types.plW wren6991@types.pl

                  @rygorous @TomF @zeux @wolf480pl @pervognsen Would be interested to see how RISC-V fared with addition of B, Zcmp and Zcb. I find it's usually pretty close to Thumb. Zcb in particular is just "oops we forgot to put this in the C extension" and there's not much excuse not to implement it

                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.place
                  wrote last edited by
                  #39

                  @wren6991 @TomF @zeux @wolf480pl @pervognsen no clue, I've looked at the C extension but not the newer stuff.

                  wren6991@types.plW 1 Reply Last reply
                  0
                  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                    @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

                    LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

                    LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

                    wolf480pl@mstdn.ioW This user is from outside of this forum
                    wolf480pl@mstdn.ioW This user is from outside of this forum
                    wolf480pl@mstdn.io
                    wrote last edited by
                    #40

                    @rygorous
                    I guess STM/LDM makes way more sense when you have no instruction cache, so every extra cycle your instruction spends moving more data would've otherwise been an instruction fetch

                    wolf480pl@mstdn.ioW 1 Reply Last reply
                    0
                    • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

                      @rygorous
                      I guess STM/LDM makes way more sense when you have no instruction cache, so every extra cycle your instruction spends moving more data would've otherwise been an instruction fetch

                      wolf480pl@mstdn.ioW This user is from outside of this forum
                      wolf480pl@mstdn.ioW This user is from outside of this forum
                      wolf480pl@mstdn.io
                      wrote last edited by
                      #41

                      @rygorous
                      also rep movs

                      1 Reply Last reply
                      0
                      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                        @zeux @TomF @wolf480pl @pervognsen In general, the thing to keep in mind is that the part that matters for density is usually boring int code, which is the majority of it almost everywhere.

                        You can have 90% of the instructions in your manual have awkwardly redundant encodings and not have it matter too much for size as long as the encodings for the 10-15 insns that really matter are good.

                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                        rygorous@mastodon.gamedev.place
                        wrote last edited by
                        #42

                        @zeux @TomF @wolf480pl @pervognsen Just to be self-contained, the stuff that really matters:

                        - load/store to (reg+small_imm)
                        - add, sub reg/reg and reg/imm, mov if 2-address
                        - sign/zero extends, as needed
                        - nearby conditional branches (whether it be compare + branch form or a branch-if-cond form) and nearby unconditional branches ("nearby" meaning small-offset region, +-4k range is most important)
                        - prologue/epilogue insns like PUSH/POP or LDP/STP if applicable
                        - CALL/branch-and-link, return

                        rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                        0
                        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                          @zeux @TomF @wolf480pl @pervognsen Just to be self-contained, the stuff that really matters:

                          - load/store to (reg+small_imm)
                          - add, sub reg/reg and reg/imm, mov if 2-address
                          - sign/zero extends, as needed
                          - nearby conditional branches (whether it be compare + branch form or a branch-if-cond form) and nearby unconditional branches ("nearby" meaning small-offset region, +-4k range is most important)
                          - prologue/epilogue insns like PUSH/POP or LDP/STP if applicable
                          - CALL/branch-and-link, return

                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.place
                          wrote last edited by
                          #43

                          @zeux @TomF @wolf480pl @pervognsen there's some more variants of this (like yeah maybe AND/bit tests too) but if you look at instruction traces (and also disassembly in general) it's funny just how much of it is just this

                          rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                          0
                          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                            @wren6991 @TomF @zeux @wolf480pl @pervognsen no clue, I've looked at the C extension but not the newer stuff.

                            wren6991@types.plW This user is from outside of this forum
                            wren6991@types.plW This user is from outside of this forum
                            wren6991@types.pl
                            wrote last edited by
                            #44

                            @rygorous @TomF @zeux @wolf480pl @pervognsen Zcb is compressed forms of byte load/store, sign- and zero-extend, mul and not. Zcmp is compressed push/pop/return and a limited form of double-mov. They put together a nice spreadsheet showing the impact of each instruction: https://docs.google.com/spreadsheets/d/1bFMyGkuuulBXuIaMsjBINoCWoLwObr1l9h5TAWN8s7k/edit?gid=1837831327#gid=1837831327

                            1 Reply Last reply
                            0
                            • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                              @zeux @TomF @wolf480pl @pervognsen there's some more variants of this (like yeah maybe AND/bit tests too) but if you look at instruction traces (and also disassembly in general) it's funny just how much of it is just this

                              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                              rygorous@mastodon.gamedev.place
                              wrote last edited by
                              #45

                              @zeux @TomF @wolf480pl @pervognsen anyway, re: the link I posted

                              x86 is sometimes unfairly valorized as being very dense (I think that might just go back to the early 90s when the primary competition was all the classic 32-bit RISCs, in which case, yeah) and sometimes unfairly slandered as being pathologically bad, and neither is true.

                              It is thoroughly, blandly, middle-of-the-road, neither as dense as encodings that optimized for density nor as bloated as the classic RISC encodings.

                              rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                              0
                              • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                @zeux @TomF @wolf480pl @pervognsen anyway, re: the link I posted

                                x86 is sometimes unfairly valorized as being very dense (I think that might just go back to the early 90s when the primary competition was all the classic 32-bit RISCs, in which case, yeah) and sometimes unfairly slandered as being pathologically bad, and neither is true.

                                It is thoroughly, blandly, middle-of-the-road, neither as dense as encodings that optimized for density nor as bloated as the classic RISC encodings.

                                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                rygorous@mastodon.gamedev.place
                                wrote last edited by
                                #46

                                @zeux @TomF @wolf480pl @pervognsen There are many complaints to be made about x86.

                                The original sin of the x86 encoding is that you can't tell the total size of an instruction from just its first 1-2 bytes. That is an actual design mistake that necessitates relatively complex to build and verify instruction-length decoder hardware that could be either gone entirely (fixed-size encodings) or at least way simpler than it is.

                                rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                0
                                • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                  @zeux @TomF @wolf480pl @pervognsen There are many complaints to be made about x86.

                                  The original sin of the x86 encoding is that you can't tell the total size of an instruction from just its first 1-2 bytes. That is an actual design mistake that necessitates relatively complex to build and verify instruction-length decoder hardware that could be either gone entirely (fixed-size encodings) or at least way simpler than it is.

                                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                  rygorous@mastodon.gamedev.place
                                  wrote last edited by
                                  #47

                                  @zeux @TomF @wolf480pl @pervognsen That said, while annoying and an ongoing cost for those who build x86s (they sell them by the tens of millions, they'll be fine), ILD adds maybe 1 pipeline stage more than it needs to.

                                  If the x86 encoding is kinda meh, then so are its mistakes. Yeah, if you were planning to build a lasting arch now, you certainly wouldn't do it that way. It adds extra overhead. But not in a way or to an extent that is particularly damning.

                                  1 Reply Last reply
                                  0
                                  • R relay@relay.infosec.exchange shared this topic
                                  Reply
                                  • Reply as topic
                                  Log in to reply
                                  • Oldest to Newest
                                  • Newest to Oldest
                                  • Most Votes


                                  • Login

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • World
                                  • Users
                                  • Groups