@rygorous @wolf480pl I found it pretty mind blowing that Apple cores are able to execute them as single 128-bit loads and stores, thereby doubling the number of possible scalar register loads/stores per cycle compared to x86 cores.
(It's very nice raw throughput, though I'm guessing some weird memory renaming tricks can maybe make up some of the difference on the x86 side?)