CIRCLE WITH A DOT

rygorous@mastodon.gamedev.place

@regehr the entire interview is so good

rygorous@mastodon.gamedev.place

@david_chisnall @koakuma @regehr (I will say that during the relatively short window I contracted for Intel, this tradition of organizational dysfunction with different teams ostensibly on the same effort actively trying to sabotage each other was alive and well)

rygorous@mastodon.gamedev.place

@david_chisnall @koakuma @regehr Mind, there's no question that the 432 was still bad in many ways, but the issue was less "good compilers are impossible to write for this" and more "there were many unforced mistakes that the compiler people tried to raise awareness about and got shut down, so they stopped caring"

rygorous@mastodon.gamedev.place

@david_chisnall @koakuma @regehr incidentally, Bob Colwell was at Multiflow working on (and shipping) commercial VLIW HW and then later went to Intel becoming Chief Architect of the PPro. (What he thought about Itanium given his background shows up later in the same PDF.)

rygorous@mastodon.gamedev.place

@david_chisnall @koakuma @regehr The iAPX 432 from what I can tell was not at all impossible to write compilers for, but as per Bob Colwell who wrote a bunch of papers dissecting the 432 for his PhD, the compiler team did not get along with the HW team and explicitly didn't want the effort to succeed. https://www.sigmicro.org/media/oralhistories/colwell.pdf p. 51

rygorous@mastodon.gamedev.place

@koakuma @regehr my favorite part of the whole saga is how while the HP and Itanium guys were high on life and VLIW, several key people in the PPro team (which were ex-Multiflow and hence had actually, you know, shipped a VLIW) were going "either you have figured out something really major and fundamental that we never got or you're fully talking out of your ass"

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen That said, while annoying and an ongoing cost for those who build x86s (they sell them by the tens of millions, they'll be fine), ILD adds maybe 1 pipeline stage more than it needs to.

If the x86 encoding is kinda meh, then so are its mistakes. Yeah, if you were planning to build a lasting arch now, you certainly wouldn't do it that way. It adds extra overhead. But not in a way or to an extent that is particularly damning.

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen There are many complaints to be made about x86.

The original sin of the x86 encoding is that you can't tell the total size of an instruction from just its first 1-2 bytes. That is an actual design mistake that necessitates relatively complex to build and verify instruction-length decoder hardware that could be either gone entirely (fixed-size encodings) or at least way simpler than it is.

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen anyway, re: the link I posted

x86 is sometimes unfairly valorized as being very dense (I think that might just go back to the early 90s when the primary competition was all the classic 32-bit RISCs, in which case, yeah) and sometimes unfairly slandered as being pathologically bad, and neither is true.

It is thoroughly, blandly, middle-of-the-road, neither as dense as encodings that optimized for density nor as bloated as the classic RISC encodings.

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen there's some more variants of this (like yeah maybe AND/bit tests too) but if you look at instruction traces (and also disassembly in general) it's funny just how much of it is just this

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen Just to be self-contained, the stuff that really matters:

- load/store to (reg+small_imm)
- add, sub reg/reg and reg/imm, mov if 2-address
- sign/zero extends, as needed
- nearby conditional branches (whether it be compare + branch form or a branch-if-cond form) and nearby unconditional branches ("nearby" meaning small-offset region, +-4k range is most important)
- prologue/epilogue insns like PUSH/POP or LDP/STP if applicable
- CALL/branch-and-link, return

rygorous@mastodon.gamedev.place

@wren6991 @TomF @zeux @wolf480pl @pervognsen no clue, I've looked at the C extension but not the newer stuff.

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen In general, the thing to keep in mind is that the part that matters for density is usually boring int code, which is the majority of it almost everywhere.

You can have 90% of the instructions in your manual have awkwardly redundant encodings and not have it matter too much for size as long as the encodings for the 10-15 insns that really matter are good.

rygorous@mastodon.gamedev.place

@wolf480pl m68k had those too! Anyway, they're gone in A64, replaced with LDP/STP.

LDP/STP are a better compromise. LDM/STM is awkward in numerous ways, chiefly in that it's inherently a variable-number-of-uops flow with different memory access sizes and variable number of regs referenced, which is all a bit of a nightmare.

LDP/STP is fixed access size, fixed number of register references. It cuts prologues/epilogues in ~half without needing tricky micro-sequencing.

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen The other big source of chonky insns with x86 is something like 4B/5B base insn and then a complicated addressing mode. But again the complicated addressing mode add one or more extra insns on other archs.

It's not the same, because the calculus changes. On x86 you'll often redo variants of the same address calc 2-3 times, without those addr modes you'd calc once and share. Still, these extra bytes are not bloat, they're paying for themselves (on average).

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen
That said, x86s immeds are one of its weaker points. simm8 is inconsistently available and often too short, imm32 is almost always overkill.

But if you look at binaries you start to notice just how much code size is the same boilerplate over and over and over again.

One of A64s secret weapons re: density is STP/LDP for function prologues/epilogues, to save/restore two regs at once. (Which x86 APX is now stealing just for reg saving with PUSH2/POP2.)

rygorous@mastodon.gamedev.place

@zeux @TomF @wolf480pl @pervognsen x86 insns averaging around 4B is about right but the important thing to keep in mind is that a lot of things that are 1 instruction in x86 are also multiple instructions in other encodings.

E.g. a lot of 6B/7B instructions are of the 2B/3B instruction + 4B imm32 variety, and those that actually need an imm32 are usually 8B and 2 insns on RISCs.

rygorous@mastodon.gamedev.place

@TomF @zeux @wolf480pl @pervognsen The outlier here is that ARM A64 plays in the same class for code density as the "moderately density-optimized variable-size" encodings despite being 32b fixed instruction size.

This comes at some cost in decoding complexity: A64 is a _way_ less regular encoding than say RV32 or MIPS32 are.

But also several of the ideas in A64 that help code density are just objectively good ideas that are now popping up elsewhere.

rygorous@mastodon.gamedev.place

@TomF @zeux @wolf480pl @pervognsen There's this for example https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-07-28/? and the related ratings vary but the broad trend holds true:

- Really compact: ARC, Thumb
- OK: m68k, x86, moderately compressed variable-size reprs like RV32GC/RV64GC, ARM A64 (Don't know where MIPS16 lands here)
- Bad: most 32b fixed-size encodings, zSeries

rygorous@mastodon.gamedev.place

@TomF @zeux @wolf480pl @pervognsen 32-bit x86 is usually roughly comparable with m68k, worse than Thumb-2, but usually better than RV32GC or ARM A64.

x86-64 is roughly comparable to RV64GC (which is variable-length and compressed), similar or worse than the (fixed-size-ish, SVE gets a bit weird) ARM A64.

Almost all 32-bit fixed-instruction-size RISCs are significantly worse than all these options on typical code. Something like 25% larger.

CIRCLE WITH A DOT

rygorous@mastodon.gamedev.place

Posts