Working on some more libscopehal backend tweaks.

Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 11:09:06 GMT

azonenberg@ioc.exchange — Sun, 03 May 2026 11:09:06 GMT

Added a 16-bit version of the same conversion block for use with >8 bit ADCs.

Apple and llvmpipe showed huge improvements, slight on the AMD, almost no change on nvidia.

But again, the main goal is portability so "it runs at least as fast and doesn't depend on extensions" is a major win. Bolted this into the ThunderScope driver already and will try to do others soon.

* R9700: 1.58 -> 1.26 ms
* RTX 3070: 7.41 -> 7.36 ms
* GTX 1630: 9.50 -> 9.45 ms
* Apple M4: 12.88 -> 1.93 ms
* Xeon 8362 (llvmpipe): 8.64 -> 5.58 ms

Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 09:18:51 GMT

azonenberg@ioc.exchange — Sun, 03 May 2026 09:18:51 GMT

Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).

For 10M points, some selected speedups:
* R9700: 0.89 -> 0.44 ms
* RTX 3070: 4.98 -> 2.44 ms
* GTX 1630: 7.43 -> 2.7 ms
* Apple M4: 7.69 -> 1.8 ms
* Xeon 8362 (llvmpipe): 7.43 -> 2.7 ms

Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 09:12:42 GMT

azonenberg@ioc.exchange — Sun, 03 May 2026 09:12:42 GMT

So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?

After a bit of thought, I realized we could do exactly that.

The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.

The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.

This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.

And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.