Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.

Scheduled Pinned Locked Moved Uncategorized
47 Posts 13 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • pervognsen@mastodon.socialP pervognsen@mastodon.social

    Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake. Oh my sweet summer child. Has there ever been a compiler that did an amazing job for x87 in-order dual-pipe scheduling? The entangling of register allocation, instruction selection and instruction scheduling is like nightmare difficulty mode for compiler backends.

    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
    rygorous@mastodon.gamedev.place
    wrote last edited by
    #3

    @pervognsen I don't think it would be particularly hard, it just doesn't fit _at all_ with the machine models modern compilers like to work with.

    If you treat x87 like you have arbitrary reg-reg operations and then try to lower it to stack ops later, you're gonna have a bad time. But I don't think there's anything inherently difficult about scheduling for a pipelined stack machine that would be compiler-defeating.

    rygorous@mastodon.gamedev.placeR 1 Reply Last reply
    0
    • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

      @pervognsen I don't think it would be particularly hard, it just doesn't fit _at all_ with the machine models modern compilers like to work with.

      If you treat x87 like you have arbitrary reg-reg operations and then try to lower it to stack ops later, you're gonna have a bad time. But I don't think there's anything inherently difficult about scheduling for a pipelined stack machine that would be compiler-defeating.

      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
      rygorous@mastodon.gamedev.placeR This user is from outside of this forum
      rygorous@mastodon.gamedev.place
      wrote last edited by
      #4

      @pervognsen The only part about x87 that increased difficulty vs. a general abstract pipelined stack machine is that max stack size is bounded so you need to work around that.

      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
      0
      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

        @pervognsen The only part about x87 that increased difficulty vs. a general abstract pipelined stack machine is that max stack size is bounded so you need to work around that.

        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
        rygorous@mastodon.gamedev.place
        wrote last edited by
        #5

        @pervognsen and on the Pentium in particular there's nothing actually dual-pipe about this.

        You have one pipeline (U pipe) that can accept FP operations. The only V-pipe paiarable x87 op on Pentium is FXCH which is table stakes in order to get _any_ decent use out of pipelined FP ops on a stack machine in FP-heavy code.

        rygorous@mastodon.gamedev.placeR 1 Reply Last reply
        0
        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

          @pervognsen and on the Pentium in particular there's nothing actually dual-pipe about this.

          You have one pipeline (U pipe) that can accept FP operations. The only V-pipe paiarable x87 op on Pentium is FXCH which is table stakes in order to get _any_ decent use out of pipelined FP ops on a stack machine in FP-heavy code.

          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
          rygorous@mastodon.gamedev.place
          wrote last edited by
          #6

          @pervognsen (In the sense that if they didn't allow for that, you would never be able to issue two FP ops in back-to-back cycles at all.)

          wolf480pl@mstdn.ioW 1 Reply Last reply
          0
          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

            @pervognsen (In the sense that if they didn't allow for that, you would never be able to issue two FP ops in back-to-back cycles at all.)

            wolf480pl@mstdn.ioW This user is from outside of this forum
            wolf480pl@mstdn.ioW This user is from outside of this forum
            wolf480pl@mstdn.io
            wrote last edited by
            #7

            @rygorous @pervognsen
            So most of x87 instructions have so limited addressing modes, that it needs a second pipe for just for moving data into the place from which other instructions can use it?

            That's such a TuringComplete.game thing to do 😄

            rygorous@mastodon.gamedev.placeR 1 Reply Last reply
            0
            • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

              @rygorous @pervognsen
              So most of x87 instructions have so limited addressing modes, that it needs a second pipe for just for moving data into the place from which other instructions can use it?

              That's such a TuringComplete.game thing to do 😄

              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
              rygorous@mastodon.gamedev.placeR This user is from outside of this forum
              rygorous@mastodon.gamedev.place
              wrote last edited by
              #8

              @wolf480pl @pervognsen nah, the addressing modes are normal, but they had very limited encoding space to use, hence the stack. Specifically one of the 3-bit "register" fields in the ModRM field of the encoding is actually part of the opcode, since the basic ESC (coprocessor) range only has 3 opcode bits in it.

              rygorous@mastodon.gamedev.placeR 1 Reply Last reply
              0
              • pervognsen@mastodon.socialP pervognsen@mastodon.social

                Even if you force gcc to undust its ancient Pentium 1 scheduling model (and force it to actually schedule instructions with -fschedule-insns and -fschedule-insns2), it's hardly different from VC6 in https://fabiensanglard.net/quake_asm_optimizations/index.html and still schedules the three dot products as separate blocks. https://gcc.godbolt.org/z/YE4d53cv3

                zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                zeux@mastodon.gamedev.place
                wrote last edited by
                #9

                @pervognsen A question I'm more curious about is what is the delta on a modern OOO CPU?

                rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                0
                • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                  @wolf480pl @pervognsen nah, the addressing modes are normal, but they had very limited encoding space to use, hence the stack. Specifically one of the 3-bit "register" fields in the ModRM field of the encoding is actually part of the opcode, since the basic ESC (coprocessor) range only has 3 opcode bits in it.

                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                  rygorous@mastodon.gamedev.place
                  wrote last edited by
                  #10

                  @wolf480pl @pervognsen so instead of 2-register like most x86 insns, almost everything FPU only gets to use 1 register index (or the memory reference, which is normal), the other one is implied to be the top of stack (st(0)).

                  wolf480pl@mstdn.ioW 1 Reply Last reply
                  0
                  • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

                    @pervognsen A question I'm more curious about is what is the delta on a modern OOO CPU?

                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                    rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                    rygorous@mastodon.gamedev.place
                    wrote last edited by
                    #11

                    @zeux @pervognsen funnily enough, this deal is getting worse all the time!

                    I mean it still renames around it. But it used to be that there were multiple ports that could handle x87 insns and these days there usually is just one.

                    So you get to choose between 2 FMAs (and sometimes even a third regular FADD) every cycle in SSE-land, or a single x87 op dispatched per cycle when using x87 🙂

                    dominikg@mastodon.gamedev.placeD 1 Reply Last reply
                    0
                    • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                      @wolf480pl @pervognsen so instead of 2-register like most x86 insns, almost everything FPU only gets to use 1 register index (or the memory reference, which is normal), the other one is implied to be the top of stack (st(0)).

                      wolf480pl@mstdn.ioW This user is from outside of this forum
                      wolf480pl@mstdn.ioW This user is from outside of this forum
                      wolf480pl@mstdn.io
                      wrote last edited by
                      #12

                      @rygorous
                      so cursed, I love it!
                      @pervognsen

                      rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                      0
                      • wolf480pl@mstdn.ioW wolf480pl@mstdn.io

                        @rygorous
                        so cursed, I love it!
                        @pervognsen

                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                        rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                        rygorous@mastodon.gamedev.place
                        wrote last edited by
                        #13

                        @wolf480pl @pervognsen the key to understanding anything x86, IMO, is to not look at the mnemonics (which have all kinds of irregularities) and instead look at the actual instruction encoding instead.

                        rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                        0
                        • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                          @wolf480pl @pervognsen the key to understanding anything x86, IMO, is to not look at the mnemonics (which have all kinds of irregularities) and instead look at the actual instruction encoding instead.

                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                          rygorous@mastodon.gamedev.place
                          wrote last edited by
                          #14

                          @wolf480pl @pervognsen Like, seriously.

                          If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                          rygorous@mastodon.gamedev.placeR zeux@mastodon.gamedev.placeZ 2 Replies Last reply
                          0
                          • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                            @wolf480pl @pervognsen Like, seriously.

                            If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                            rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                            rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                            rygorous@mastodon.gamedev.place
                            wrote last edited by
                            #15

                            @wolf480pl @pervognsen the other protip is to look at x86 instruction bytes in octal mode, not hex, because so much of it is 2-3-3 bit fields (from MSB to LSB)

                            1 Reply Last reply
                            0
                            • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                              @zeux @pervognsen funnily enough, this deal is getting worse all the time!

                              I mean it still renames around it. But it used to be that there were multiple ports that could handle x87 insns and these days there usually is just one.

                              So you get to choose between 2 FMAs (and sometimes even a third regular FADD) every cycle in SSE-land, or a single x87 op dispatched per cycle when using x87 🙂

                              dominikg@mastodon.gamedev.placeD This user is from outside of this forum
                              dominikg@mastodon.gamedev.placeD This user is from outside of this forum
                              dominikg@mastodon.gamedev.place
                              wrote last edited by
                              #16

                              @rygorous @zeux @pervognsen
                              Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
                              I remember needing to reset the fpu stack after calling Direct3D functions.

                              rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                              0
                              • dominikg@mastodon.gamedev.placeD dominikg@mastodon.gamedev.place

                                @rygorous @zeux @pervognsen
                                Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
                                I remember needing to reset the fpu stack after calling Direct3D functions.

                                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                rygorous@mastodon.gamedev.placeR This user is from outside of this forum
                                rygorous@mastodon.gamedev.place
                                wrote last edited by
                                #17

                                @dominikg @zeux @pervognsen Yes, that's architectural. It's required.

                                1 Reply Last reply
                                0
                                • pervognsen@mastodon.socialP pervognsen@mastodon.social

                                  Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake. Oh my sweet summer child. Has there ever been a compiler that did an amazing job for x87 in-order dual-pipe scheduling? The entangling of register allocation, instruction selection and instruction scheduling is like nightmare difficulty mode for compiler backends.

                                  archiloque@felin.socialA This user is from outside of this forum
                                  archiloque@felin.socialA This user is from outside of this forum
                                  archiloque@felin.social
                                  wrote last edited by
                                  #18

                                  @pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )

                                  regehr@mastodon.socialR 1 Reply Last reply
                                  0
                                  • archiloque@felin.socialA archiloque@felin.social

                                    @pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )

                                    regehr@mastodon.socialR This user is from outside of this forum
                                    regehr@mastodon.socialR This user is from outside of this forum
                                    regehr@mastodon.social
                                    wrote last edited by
                                    #19

                                    @archiloque @pervognsen I can't think of a good resource other than "be on twitter/mastodon at the right time and follow the right people"

                                    1 Reply Last reply
                                    0
                                    • pervognsen@mastodon.socialP pervognsen@mastodon.social

                                      Even if you force gcc to undust its ancient Pentium 1 scheduling model (and force it to actually schedule instructions with -fschedule-insns and -fschedule-insns2), it's hardly different from VC6 in https://fabiensanglard.net/quake_asm_optimizations/index.html and still schedules the three dot products as separate blocks. https://gcc.godbolt.org/z/YE4d53cv3

                                      amonakov@mastodon.gamedev.placeA This user is from outside of this forum
                                      amonakov@mastodon.gamedev.placeA This user is from outside of this forum
                                      amonakov@mastodon.gamedev.place
                                      wrote last edited by
                                      #20

                                      @pervognsen that's because stores to 'out' can alias with other loads, you need a temporary or 'restrict': https://gcc.godbolt.org/z/e8hxEGxor

                                      1 Reply Last reply
                                      0
                                      • rygorous@mastodon.gamedev.placeR rygorous@mastodon.gamedev.place

                                        @wolf480pl @pervognsen Like, seriously.

                                        If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less. 🙂

                                        zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                                        zeux@mastodon.gamedev.placeZ This user is from outside of this forum
                                        zeux@mastodon.gamedev.place
                                        wrote last edited by
                                        #21

                                        @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

                                        tomf@mastodon.gamedev.placeT 1 Reply Last reply
                                        0
                                        • zeux@mastodon.gamedev.placeZ zeux@mastodon.gamedev.place

                                          @rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.

                                          tomf@mastodon.gamedev.placeT This user is from outside of this forum
                                          tomf@mastodon.gamedev.placeT This user is from outside of this forum
                                          tomf@mastodon.gamedev.place
                                          wrote last edited by
                                          #22

                                          @zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.

                                          rygorous@mastodon.gamedev.placeR 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups