Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Scheduled Pinned Locked Moved Uncategorized
47 Posts 13 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

    @rygorous @pervognsen
    So most of x87 instructions have so limited addressing modes, that it needs a second pipe for just for moving data into the place from which other instructions can use it?

    That's such a TuringComplete.game thing to do 😄

    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.place
    wrote last edited by
    #8

    @wolf480pl @pervognsen nah, the addressing modes are normal, but they had very limited encoding space to use, hence the stack. Specifically one of the 3-bit "register" fields in the ModRM field of the encoding is actually part of the opcode, since the basic ESC (coprocessor) range only has 3 opcode bits in it.

    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
    0
    • pervognsen@mastodon.socialP pervognsen@mastodon.social

      Even if you force gcc to undust its ancient Pentium 1 scheduling model (and force it to actually schedule instructions with -fschedule-insns and -fschedule-insns2), it's hardly different from VC6 in https://fabiensanglard.net/quake_asm_optimizations/index.html and still schedules the three dot products as separate blocks. https://gcc.godbolt.org/z/YE4d53cv3

      zeux@mastodon.gamedev.placeZ This user is from outside of this forum
      zeux@mastodon.gamedev.placeZ This user is from outside of this forum
      zeux@mastodon.gamedev.place
      wrote last edited by
      #9

      @pervognsen A question I'm more curious about is what is the delta on a modern OOO CPU?

      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
      0
      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

        @wolf480pl @pervognsen nah, the addressing modes are normal, but they had very limited encoding space to use, hence the stack. Specifically one of the 3-bit "register" fields in the ModRM field of the encoding is actually part of the opcode, since the basic ESC (coprocessor) range only has 3 opcode bits in it.

        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.place
        wrote last edited by
        #10

        @wolf480pl @pervognsen so instead of 2-register like most x86 insns, almost everything FPU only gets to use 1 register index (or the memory reference, which is normal), the other one is implied to be the top of stack (st(0)).

        wolf480pl@mstdn.ioW 1 Reply Last reply
        0
        • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

          @pervognsen A question I'm more curious about is what is the delta on a modern OOO CPU?

          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.place
          wrote last edited by
          #11

          @zeux @pervognsen funnily enough, this deal is getting worse all the time!

          I mean it still renames around it. But it used to be that there were multiple ports that could handle x87 insns and these days there usually is just one.

          So you get to choose between 2 FMAs (and sometimes even a third regular FADD) every cycle in SSE-land, or a single x87 op dispatched per cycle when using x87 🙂

          dominikg@mastodon.gamedev.placeD 1 Reply Last reply
          0
          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

            @wolf480pl @pervognsen so instead of 2-register like most x86 insns, almost everything FPU only gets to use 1 register index (or the memory reference, which is normal), the other one is implied to be the top of stack (st(0)).

            wolf480pl@mstdn.ioW This user is from outside of this forum
            wolf480pl@mstdn.ioW This user is from outside of this forum
            wolf480pl@mstdn.io
            wrote last edited by
            #12

            @rygorous
            so cursed, I love it!
            @pervognsen

            rygorous@mastodon.gamedev.placeR 1 Reply Last reply
            0
            • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

              @rygorous
              so cursed, I love it!
              @pervognsen

              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
              rygorous@mastodon.gamedev.place
              wrote last edited by
              #13

              @wolf480pl @pervognsen the key to understanding anything x86, IMO, is to not look at the mnemonics (which have all kinds of irregularities) and instead look at the actual instruction encoding instead.

              rygorous@mastodon.gamedev.placeR 1 Reply Last reply
              0
              • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                @wolf480pl @pervognsen the key to understanding anything x86, IMO, is to not look at the mnemonics (which have all kinds of irregularities) and instead look at the actual instruction encoding instead.

                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                rygorous@mastodon.gamedev.place
                wrote last edited by
                #14

                @wolf480pl @pervognsen Like, seriously.

                If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                rygorous@mastodon.gamedev.placeR zeux@mastodon.gamedev.placeZ 2 Replies Last reply
                0
                • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                  @wolf480pl @pervognsen Like, seriously.

                  If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.place
                  wrote last edited by
                  #15

                  @wolf480pl @pervognsen the other protip is to look at x86 instruction bytes in octal mode, not hex, because so much of it is 2-3-3 bit fields (from MSB to LSB)

                  1 Reply Last reply
                  0
                  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                    @zeux @pervognsen funnily enough, this deal is getting worse all the time!

                    I mean it still renames around it. But it used to be that there were multiple ports that could handle x87 insns and these days there usually is just one.

                    So you get to choose between 2 FMAs (and sometimes even a third regular FADD) every cycle in SSE-land, or a single x87 op dispatched per cycle when using x87 🙂

                    dominikg@mastodon.gamedev.placeD This user is from outside of this forum
                    dominikg@mastodon.gamedev.placeD This user is from outside of this forum
                    dominikg@mastodon.gamedev.place
                    wrote last edited by
                    #16

                    @rygorous @zeux @pervognsen
                    Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
                    I remember needing to reset the fpu stack after calling Direct3D functions.

                    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                    0
                    • dominikg@mastodon.gamedev.placeD dominikg@mastodon.gamedev.place

                      @rygorous @zeux @pervognsen
                      Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
                      I remember needing to reset the fpu stack after calling Direct3D functions.

                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                      rygorous@mastodon.gamedev.place
                      wrote last edited by
                      #17

                      @dominikg @zeux @pervognsen Yes, that's architectural. It's required.

                      1 Reply Last reply
                      0
                      • pervognsen@mastodon.socialP pervognsen@mastodon.social

                        Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake. Oh my sweet summer child. Has there ever been a compiler that did an amazing job for x87 in-order dual-pipe scheduling? The entangling of register allocation, instruction selection and instruction scheduling is like nightmare difficulty mode for compiler backends.

                        archiloque@felin.socialA This user is from outside of this forum
                        archiloque@felin.socialA This user is from outside of this forum
                        archiloque@felin.social
                        wrote last edited by
                        #18

                        @pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )

                        regehr@mastodon.socialR 1 Reply Last reply
                        0
                        • archiloque@felin.socialA archiloque@felin.social

                          @pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )

                          regehr@mastodon.socialR This user is from outside of this forum
                          regehr@mastodon.socialR This user is from outside of this forum
                          regehr@mastodon.social
                          wrote last edited by
                          #19

                          @archiloque @pervognsen I can't think of a good resource other than "be on twitter/mastodon at the right time and follow the right people"

                          1 Reply Last reply
                          0
                          • pervognsen@mastodon.socialP pervognsen@mastodon.social

                            Even if you force gcc to undust its ancient Pentium 1 scheduling model (and force it to actually schedule instructions with -fschedule-insns and -fschedule-insns2), it's hardly different from VC6 in https://fabiensanglard.net/quake_asm_optimizations/index.html and still schedules the three dot products as separate blocks. https://gcc.godbolt.org/z/YE4d53cv3

                            amonakov@mastodon.gamedev.placeA This user is from outside of this forum
                            amonakov@mastodon.gamedev.placeA This user is from outside of this forum
                            amonakov@mastodon.gamedev.place
                            wrote last edited by
                            #20

                            @pervognsen that's because stores to 'out' can alias with other loads, you need a temporary or 'restrict': https://gcc.godbolt.org/z/e8hxEGxor

                            1 Reply Last reply
                            0
                            • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                              @wolf480pl @pervognsen Like, seriously.

                              If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                              zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                              zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                              zeux@mastodon.gamedev.place
                              wrote last edited by
                              #21

                              @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

                              tomf@mastodon.gamedev.placeT 1 Reply Last reply
                              0
                              • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

                                @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

                                tomf@mastodon.gamedev.placeT This user is from outside of this forum
                                tomf@mastodon.gamedev.placeT This user is from outside of this forum
                                tomf@mastodon.gamedev.place
                                wrote last edited by
                                #22

                                @zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.

                                rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                0
                                • tomf@mastodon.gamedev.placeT tomf@mastodon.gamedev.place

                                  @zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.

                                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                  rygorous@mastodon.gamedev.place
                                  wrote last edited by
                                  #23

                                  @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

                                  Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

                                  x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

                                  The most frequent instructions are, in fact, 1B opcodes.

                                  rygorous@mastodon.gamedev.placeR zeux@mastodon.gamedev.placeZ flux@wandering.shopF 3 Replies Last reply
                                  0
                                  • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                    @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

                                    Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

                                    x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

                                    The most frequent instructions are, in fact, 1B opcodes.

                                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                    rygorous@mastodon.gamedev.place
                                    wrote last edited by
                                    #24

                                    @TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.

                                    x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.

                                    Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.

                                    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                    0
                                    • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                      @TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.

                                      x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.

                                      Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.

                                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                      rygorous@mastodon.gamedev.place
                                      wrote last edited by
                                      #25

                                      @TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

                                      - Really compact: ARC, Thumb
                                      - OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
                                      - Bad: most 32b fixed-size encodings, zSeries

                                      rygorous@mastodon.gamedev.placeR wren6991@types.plW iximeow@haunted.computerI 3 Replies Last reply
                                      1
                                      0
                                      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                        @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

                                        Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

                                        x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

                                        The most frequent instructions are, in fact, 1B opcodes.

                                        zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                                        zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                                        zeux@mastodon.gamedev.place
                                        wrote last edited by
                                        #26

                                        @rygorous @TomF @wolf480pl @pervognsen I think I posted stats for some random programs a little while back and it was amusing to see x64 average close to 4 bytes per instruction. Lots of 1-2 bytes instructions but also lots of 5-8+ so it all blends. Although I didn’t do instruction *counts* which might be a little smaller vs fixed length instruction sets.

                                        rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                        0
                                        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                          @TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.

                                          Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.

                                          x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.

                                          The most frequent instructions are, in fact, 1B opcodes.

                                          flux@wandering.shopF This user is from outside of this forum
                                          flux@wandering.shopF This user is from outside of this forum
                                          flux@wandering.shop
                                          wrote last edited by
                                          #27

                                          @rygorous In Tom's defense, they did hurt us.

                                          @TomF @zeux @wolf480pl @pervognsen

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups