Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Scheduled Pinned Locked Moved Uncategorized
47 Posts 13 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

    @wolf480pl @pervognsen Like, seriously.

    If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

    zeux@mastodon.gamedev.placeZ This user is from outside of this forum
    zeux@mastodon.gamedev.placeZ This user is from outside of this forum
    zeux@mastodon.gamedev.place
    wrote last edited by
    #21

    @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

    tomf@mastodon.gamedev.placeT 1 Reply Last reply
    0
    • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

      @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

      tomf@mastodon.gamedev.placeT This user is from outside of this forum
      tomf@mastodon.gamedev.placeT This user is from outside of this forum
      tomf@mastodon.gamedev.place
      wrote last edited by
      #22

      @zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.

      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
      0
      • tomf@mastodon.gamedev.placeT tomf@mastodon.gamedev.place

        @zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.

        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.place
        wrote last edited by
        #23

        @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

        Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

        x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

        The most frequent instructions are, in fact, 1B opcodes.

        rygorous@mastodon.gamedev.placeR zeux@mastodon.gamedev.placeZ flux@wandering.shopF 3 Replies Last reply
        0
        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

          @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

          Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

          x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

          The most frequent instructions are, in fact, 1B opcodes.

          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.place
          wrote last edited by
          #24

          @TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.

          x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.

          Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.

          rygorous@mastodon.gamedev.placeR 1 Reply Last reply
          0
          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

            @TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.

            x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.

            Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.

            rygorous@mastodon.gamedev.placeR This user is from outside of this forum
            rygorous@mastodon.gamedev.placeR This user is from outside of this forum
            rygorous@mastodon.gamedev.place
            wrote last edited by
            #25

            @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

            - Really compact: ARC, Thumb
            - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
            - Bad: most 32b fixed-size encodings, zSeries

            rygorous@mastodon.gamedev.placeR wren6991@types.plW iximeow@haunted.computerI 3 Replies Last reply
            1
            0
            • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

              @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

              Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

              x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

              The most frequent instructions are, in fact, 1B opcodes.

              zeux@mastodon.gamedev.placeZ This user is from outside of this forum
              zeux@mastodon.gamedev.placeZ This user is from outside of this forum
              zeux@mastodon.gamedev.place
              wrote last edited by
              #26

              @rygorous @TomF @wolf480pl @pervognsen I think I posted stats for some random programs a little while back and it was amusing to see x64 average close to 4 bytes per instruction. Lots of 1-2 bytes instructions but also lots of 5-8+ so it all blends. Although I didn’t do instruction *counts* which might be a little smaller vs fixed length instruction sets.

              rygorous@mastodon.gamedev.placeR 1 Reply Last reply
              0
              • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

                Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

                x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

                The most frequent instructions are, in fact, 1B opcodes.

                flux@wandering.shopF This user is from outside of this forum
                flux@wandering.shopF This user is from outside of this forum
                flux@wandering.shop
                wrote last edited by
                #27

                @rygorous In Tom's defense, they did hurt us.

                @TomF @zeux @wolf480pl @pervognsen

                1 Reply Last reply
                0
                • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                  @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

                  - Really compact: ARC, Thumb
                  - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
                  - Bad: most 32b fixed-size encodings, zSeries

                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.place
                  wrote last edited by
                  #28

                  @TomF @zeux @wolf480pl @pervognsen The outlier here is that ARM A64 plays in the same class for code density as the "moderately density-optimized variable-size" encodings despite being 32b fixed instruction size.

                  This comes at some cost in decoding complexity: A64 is a _way_ less regular encoding than say RV32 or MIPS32 are.

                  But also several of the ideas in A64 that help code density are just objectively good ideas that are now popping up elsewhere. 🙂

                  1 Reply Last reply
                  0
                  • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

                    @rygorous @TomF @wolf480pl @pervognsen I think I posted stats for some random programs a little while back and it was amusing to see x64 average close to 4 bytes per instruction. Lots of 1-2 bytes instructions but also lots of 5-8+ so it all blends. Although I didn’t do instruction *counts* which might be a little smaller vs fixed length instruction sets.

                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                    rygorous@mastodon.gamedev.place
                    wrote last edited by
                    #29

                    @zeux @TomF @wolf480pl @pervognsen x86 insns averaging around 4B is about right but the important thing to keep in mind is that a lot of things that are 1 instruction in x86 are also multiple instructions in other encodings.

                    E.g. a lot of 6B/7B instructions are of the 2B/3B instruction + 4B imm32 variety, and those that actually need an imm32 are usually 8B and 2 insns on RISCs.

                    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                    0
                    • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                      @zeux @TomF @wolf480pl @pervognsen x86 insns averaging around 4B is about right but the important thing to keep in mind is that a lot of things that are 1 instruction in x86 are also multiple instructions in other encodings.

                      E.g. a lot of 6B/7B instructions are of the 2B/3B instruction + 4B imm32 variety, and those that actually need an imm32 are usually 8B and 2 insns on RISCs.

                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                      rygorous@mastodon.gamedev.place
                      wrote last edited by
                      #30

                      @zeux @TomF @wolf480pl @pervognsen
                      That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.

                      But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.

                      One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)

                      wolf480pl@mstdn.ioW rygorous@mastodon.gamedev.placeR 2 Replies Last reply
                      0
                      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                        @zeux @TomF @wolf480pl @pervognsen
                        That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.

                        But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.

                        One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)

                        wolf480pl@mstdn.ioW This user is from outside of this forum
                        wolf480pl@mstdn.ioW This user is from outside of this forum
                        wolf480pl@mstdn.io
                        wrote last edited by
                        #31

                        @rygorous @zeux @TomF @pervognsen
                        > x86 APX

                        did Intel really reuse that brand after 40 years?

                        wolf480pl@mstdn.ioW 1 Reply Last reply
                        0
                        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                          @zeux @TomF @wolf480pl @pervognsen
                          That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.

                          But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.

                          One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)

                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.place
                          wrote last edited by
                          #32

                          @zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.

                          It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).

                          rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                          0
                          • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

                            @rygorous @zeux @TomF @pervognsen
                            > x86 APX

                            did Intel really reuse that brand after 40 years?

                            wolf480pl@mstdn.ioW This user is from outside of this forum
                            wolf480pl@mstdn.ioW This user is from outside of this forum
                            wolf480pl@mstdn.io
                            wrote last edited by
                            #33

                            @rygorous @zeux @TomF @pervognsen
                            also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes

                            rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                            0
                            • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

                              @rygorous @zeux @TomF @pervognsen
                              also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes

                              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                              rygorous@mastodon.gamedev.place
                              wrote last edited by
                              #34

                              @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

                              LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

                              LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

                              dougall@mastodon.socialD wolf480pl@mstdn.ioW 2 Replies Last reply
                              0
                              • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

                                LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

                                LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

                                dougall@mastodon.socialD This user is from outside of this forum
                                dougall@mastodon.socialD This user is from outside of this forum
                                dougall@mastodon.social
                                wrote last edited by
                                #35

                                @rygorous @wolf480pl I found it pretty mind blowing that Apple cores are able to execute them as single 128-bit loads and stores, thereby doubling the number of possible scalar register loads/stores per cycle compared to x86 cores.

                                (It's very nice raw throughput, though I'm guessing some weird memory renaming tricks can maybe make up some of the difference on the x86 side?)

                                1 Reply Last reply
                                0
                                • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                  @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

                                  - Really compact: ARC, Thumb
                                  - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
                                  - Bad: most 32b fixed-size encodings, zSeries

                                  wren6991@types.plW This user is from outside of this forum
                                  wren6991@types.plW This user is from outside of this forum
                                  wren6991@types.pl
                                  wrote last edited by
                                  #36

                                  @rygorous @TomF @zeux @wolf480pl @pervognsen Would be interested to see how RISC-V fared with addition of B, Zcmp and Zcb. I find it's usually pretty close to Thumb. Zcb in particular is just "oops we forgot to put this in the C extension" and there's not much excuse not to implement it

                                  rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                  0
                                  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                    @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

                                    - Really compact: ARC, Thumb
                                    - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
                                    - Bad: most 32b fixed-size encodings, zSeries

                                    iximeow@haunted.computerI This user is from outside of this forum
                                    iximeow@haunted.computerI This user is from outside of this forum
                                    iximeow@haunted.computer
                                    wrote last edited by
                                    #37

                                    @rygorous oh this is a super neat post, thanks for sharing

                                    1 Reply Last reply
                                    0
                                    • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                      @zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.

                                      It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).

                                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                      rygorous@mastodon.gamedev.place
                                      wrote last edited by
                                      #38

                                      @zeux @TomF @wolf480pl @pervognsen In general, the thing to keep in mind is that the part that matters for density is usually boring int code, which is the majority of it almost everywhere.

                                      You can have 90% of the instructions in your manual have awkwardly redundant encodings and not have it matter too much for size as long as the encodings for the 10-15 insns that really matter are good.

                                      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                      0
                                      • wren6991@types.plW wren6991@types.pl

                                        @rygorous @TomF @zeux @wolf480pl @pervognsen Would be interested to see how RISC-V fared with addition of B, Zcmp and Zcb. I find it's usually pretty close to Thumb. Zcb in particular is just "oops we forgot to put this in the C extension" and there's not much excuse not to implement it

                                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                        rygorous@mastodon.gamedev.place
                                        wrote last edited by
                                        #39

                                        @wren6991 @TomF @zeux @wolf480pl @pervognsen no clue, I've looked at the C extension but not the newer stuff.

                                        wren6991@types.plW 1 Reply Last reply
                                        0
                                        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                          @wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

                                          LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

                                          LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

                                          wolf480pl@mstdn.ioW This user is from outside of this forum
                                          wolf480pl@mstdn.ioW This user is from outside of this forum
                                          wolf480pl@mstdn.io
                                          wrote last edited by
                                          #40

                                          @rygorous
                                          I guess STM/LDM makes way more sense when you have no instruction cache, so every extra cycle your instruction spends moving more data would've otherwise been an instruction fetch

                                          wolf480pl@mstdn.ioW 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups