Working on some more libscopehal backend tweaks.
-
Working on some more libscopehal backend tweaks.
Most scopes we work with take in int8 ADC sample data, which is then converted to float32 before further processing.
The original baseline implementation of this process was a simple loop on the CPU, which later got vectorized with AVX (but the fallback version kept for non-AVX platforms). This then got an OpenMP wrapper around it for converting multiple channels in parallel on different CPU cores.
The next iteration was to add a GPU shader, dependent on GL_EXT_shader_8bit_storage being available, available, to do sample conversion.
This shader runs one GPU thread per sample, reads an int8, writes back a float32. It works, and is significantly faster than the CPU.
For example, on my Xeon 8362 + R9700 test system, converting ten million points takes:
* CPU baseline: 10.5 ms
* AVX2: 9.2 ns
* First shader: 0.89 msBut now we have three different implementations of the same logic floating around, and lots of complex conditional logic around dispatch.
Plus I'm planning to transition away from OpenMP due to portability/tooling complexity issues and the thread pools being annoying when debugging, needing some environment variables set on startup, it's just generally a PITA.
-
Working on some more libscopehal backend tweaks.
Most scopes we work with take in int8 ADC sample data, which is then converted to float32 before further processing.
The original baseline implementation of this process was a simple loop on the CPU, which later got vectorized with AVX (but the fallback version kept for non-AVX platforms). This then got an OpenMP wrapper around it for converting multiple channels in parallel on different CPU cores.
The next iteration was to add a GPU shader, dependent on GL_EXT_shader_8bit_storage being available, available, to do sample conversion.
This shader runs one GPU thread per sample, reads an int8, writes back a float32. It works, and is significantly faster than the CPU.
For example, on my Xeon 8362 + R9700 test system, converting ten million points takes:
* CPU baseline: 10.5 ms
* AVX2: 9.2 ns
* First shader: 0.89 msBut now we have three different implementations of the same logic floating around, and lots of complex conditional logic around dispatch.
Plus I'm planning to transition away from OpenMP due to portability/tooling complexity issues and the thread pools being annoying when debugging, needing some environment variables set on startup, it's just generally a PITA.
So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?
After a bit of thought, I realized we could do exactly that.
The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.
The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.
This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.
And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.
-
So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?
After a bit of thought, I realized we could do exactly that.
The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.
The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.
This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.
And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.
Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).
For 10M points, some selected speedups:
* R9700: 0.89 -> 0.44 ms
* RTX 3070: 4.98 -> 2.44 ms
* GTX 1630: 7.43 -> 2.7 ms
* Apple M4: 7.69 -> 1.8 ms
* Xeon 8362 (llvmpipe): 7.43 -> 2.7 ms -
Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).
For 10M points, some selected speedups:
* R9700: 0.89 -> 0.44 ms
* RTX 3070: 4.98 -> 2.44 ms
* GTX 1630: 7.43 -> 2.7 ms
* Apple M4: 7.69 -> 1.8 ms
* Xeon 8362 (llvmpipe): 7.43 -> 2.7 msAdded a 16-bit version of the same conversion block for use with >8 bit ADCs.
Apple and llvmpipe showed huge improvements, slight on the AMD, almost no change on nvidia.
But again, the main goal is portability so "it runs at least as fast and doesn't depend on extensions" is a major win. Bolted this into the ThunderScope driver already and will try to do others soon.
* R9700: 1.58 -> 1.26 ms
* RTX 3070: 7.41 -> 7.36 ms
* GTX 1630: 9.50 -> 9.45 ms
* Apple M4: 12.88 -> 1.93 ms
* Xeon 8362 (llvmpipe): 8.64 -> 5.58 ms -
R relay@relay.infosec.exchange shared this topic