Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake.
-
@zeux @pervognsen funnily enough, this deal is getting worse all the time!
I mean it still renames around it. But it used to be that there were multiple ports that could handle x87 insns and these days there usually is just one.
So you get to choose between 2 FMAs (and sometimes even a third regular FADD) every cycle in SSE-land, or a single x87 op dispatched per cycle when using x87

@rygorous @zeux @pervognsen
Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
I remember needing to reset the fpu stack after calling Direct3D functions. -
@rygorous @zeux @pervognsen
Is there still a battle between x87 fpu stack and the mmx registers? femms and all that?
I remember needing to reset the fpu stack after calling Direct3D functions.@dominikg @zeux @pervognsen Yes, that's architectural. It's required.
-
Someone on Lobsters wondered "how a modern compiler would fare against hand-optimized asm" in reference to Abrash's TransformVector (3x3 matrix-vector multiply) hand-written x87 routine in Quake. Oh my sweet summer child. Has there ever been a compiler that did an amazing job for x87 in-order dual-pipe scheduling? The entangling of register allocation, instruction selection and instruction scheduling is like nightmare difficulty mode for compiler backends.
@pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )
-
@pervognsen i wonder if there are resources that explain what modern compilers are good at what they are bad at, because i feel that people’s intuition (and I include my own) are probably wrong about this and it leads to bad decisions (maybe @regehr know one? )
@archiloque @pervognsen I can't think of a good resource other than "be on twitter/mastodon at the right time and follow the right people"
-
Even if you force gcc to undust its ancient Pentium 1 scheduling model (and force it to actually schedule instructions with -fschedule-insns and -fschedule-insns2), it's hardly different from VC6 in https://fabiensanglard.net/quake_asm_optimizations/index.html and still schedules the three dot products as separate blocks. https://gcc.godbolt.org/z/YE4d53cv3
@pervognsen that's because stores to 'out' can alias with other loads, you need a temporary or 'restrict': https://gcc.godbolt.org/z/e8hxEGxor
-
@wolf480pl @pervognsen Like, seriously.
If you look at the actual encoding you start to realize pretty quick that x86 is more regular than you probably thought, and say ARM (especially T32, but also A64) a lot less.

@rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.
-
@rygorous @wolf480pl @pervognsen ARM64 kinda looks like huffman coding sometimes, only without variable length output you’re reduced to truncating inputs arbitrarily.
@zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.
-
@zeux @rygorous @wolf480pl @pervognsen x86 is also Huffman encoded, but backwards. So the instructions you rarely use are nice and short, and the instructions you use all the time have lots of wordy prefixes on them.
@TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.
Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.
x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.
The most frequent instructions are, in fact, 1B opcodes.
-
@TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.
Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.
x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.
The most frequent instructions are, in fact, 1B opcodes.
@TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.
x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.
Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.
-
@TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.
x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.
Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.
@TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:
- Really compact: ARC, Thumb
- OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
- Bad: most 32b fixed-size encodings, zSeries -
@TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.
Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.
x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.
The most frequent instructions are, in fact, 1B opcodes.
@rygorous @TomF @wolf480pl @pervognsen I think I posted stats for some random programs a little while back and it was amusing to see x64 average close to 4 bytes per instruction. Lots of 1-2 bytes instructions but also lots of 5-8+ so it all blends. Although I didn’t do instruction *counts* which might be a little smaller vs fixed length instruction sets.
-
@TomF @zeux @wolf480pl @pervognsen You keep saying this and it's not even close to true.
Code density of x86, x86-64 etc. vs. other ISAs has been well-studied approximately a zillion times, with the same result every damn time.
x86s encoding is not optimal for the instruction frequencies seen in current x86 but it is not even remotely close to backwards, and it's denser than most alternatives, even some that explicitly go for density.
The most frequent instructions are, in fact, 1B opcodes.
@rygorous In Tom's defense, they did hurt us.
-
@TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:
- Really compact: ARC, Thumb
- OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
- Bad: most 32b fixed-size encodings, zSeries@TomF @zeux @wolf480pl @pervognsen The outlier here is that ARM A64 plays in the same class for code density as the "moderately density-optimized variable-size" encodings despite being 32b fixed instruction size.
This comes at some cost in decoding complexity: A64 is a _way_ less regular encoding than say RV32 or MIPS32 are.
But also several of the ideas in A64 that help code density are just objectively good ideas that are now popping up elsewhere.

-
@rygorous @TomF @wolf480pl @pervognsen I think I posted stats for some random programs a little while back and it was amusing to see x64 average close to 4 bytes per instruction. Lots of 1-2 bytes instructions but also lots of 5-8+ so it all blends. Although I didn’t do instruction *counts* which might be a little smaller vs fixed length instruction sets.
@zeux @TomF @wolf480pl @pervognsen x86 insns averaging around 4B is about right but the important thing to keep in mind is that a lot of things that are 1 instruction in x86 are also multiple instructions in other encodings.
E.g. a lot of 6B/7B instructions are of the 2B/3B instruction + 4B imm32 variety, and those that actually need an imm32 are usually 8B and 2 insns on RISCs.
-
@zeux @TomF @wolf480pl @pervognsen x86 insns averaging around 4B is about right but the important thing to keep in mind is that a lot of things that are 1 instruction in x86 are also multiple instructions in other encodings.
E.g. a lot of 6B/7B instructions are of the 2B/3B instruction + 4B imm32 variety, and those that actually need an imm32 are usually 8B and 2 insns on RISCs.
@zeux @TomF @wolf480pl @pervognsen
That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.
One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)
-
@zeux @TomF @wolf480pl @pervognsen
That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.
One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)
@rygorous @zeux @TomF @pervognsen
> x86 APXdid Intel really reuse that brand after 40 years?
-
@zeux @TomF @wolf480pl @pervognsen
That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.
One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)
@zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.
It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).
-
@rygorous @zeux @TomF @pervognsen
> x86 APXdid Intel really reuse that brand after 40 years?
@rygorous @zeux @TomF @pervognsen
also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes -
@rygorous @zeux @TomF @pervognsen
also, what happened with ARM's STM/LDM? I thought those were it's superpower all the way since ARM2 and Acorn Archimedes@wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.
LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.
LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.
-
@wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.
LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.
LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.
@rygorous @wolf480pl I found it pretty mind blowing that Apple cores are able to execute them as single 128-bit loads and stores, thereby doubling the number of possible scalar register loads/stores per cycle compared to x86 cores.
(It's very nice raw throughput, though I'm guessing some weird memory renaming tricks can maybe make up some of the difference on the x86 side?)