<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Working on some more libscopehal backend tweaks.]]></title><description><![CDATA[<p>Working on some more libscopehal backend tweaks.</p><p>Most scopes we work with take in int8 ADC sample data, which is then converted to float32 before further processing.</p><p>The original baseline implementation of this process was a simple loop on the CPU, which later got vectorized with AVX  (but the fallback version kept for non-AVX platforms). This then got an OpenMP wrapper around it for converting multiple channels in parallel on different CPU cores.</p><p>The next iteration was to add a GPU shader, dependent on GL_EXT_shader_8bit_storage being available, available, to do sample conversion.</p><p>This shader runs one GPU thread per sample, reads an int8, writes back a float32. It works, and is significantly faster than the CPU.</p><p>For example, on my Xeon 8362 + R9700 test system, converting ten million points takes:<br />* CPU baseline: 10.5 ms<br />* AVX2: 9.2 ns<br />* First shader: 0.89 ms</p><p>But now we have three different implementations of the same logic floating around, and lots of complex conditional logic around dispatch.</p><p>Plus I'm planning to transition away from OpenMP due to portability/tooling complexity issues and the thread pools being annoying when debugging, needing some environment variables set on startup, it's just generally a PITA.</p>]]></description><link>https://board.circlewithadot.net/topic/2d996ce2-a19a-4fdc-8d21-87362b6e3e34/working-on-some-more-libscopehal-backend-tweaks.</link><generator>RSS for Node</generator><lastBuildDate>Fri, 15 May 2026 06:58:46 GMT</lastBuildDate><atom:link href="https://board.circlewithadot.net/topic/2d996ce2-a19a-4fdc-8d21-87362b6e3e34.rss" rel="self" type="application/rss+xml"/><pubDate>Sun, 03 May 2026 09:07:48 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 11:09:06 GMT]]></title><description><![CDATA[<p>Added a 16-bit version of the same conversion block for use with &gt;8 bit ADCs.</p><p>Apple and llvmpipe showed huge improvements, slight on the AMD, almost no change on nvidia.</p><p>But again, the main goal is portability so "it runs at least as fast and doesn't depend on extensions" is a major win. Bolted this into the ThunderScope driver already and will try to do others soon.</p><p>* R9700: 1.58 -&gt; 1.26 ms<br />* RTX 3070: 7.41 -&gt; 7.36 ms<br />* GTX 1630: 9.50 -&gt; 9.45 ms<br />* Apple M4: 12.88 -&gt; 1.93 ms<br />* Xeon 8362 (llvmpipe): 8.64 -&gt; 5.58 ms</p>]]></description><link>https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116510329841562697</link><guid isPermaLink="true">https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116510329841562697</guid><dc:creator><![CDATA[azonenberg@ioc.exchange]]></dc:creator><pubDate>Sun, 03 May 2026 11:09:06 GMT</pubDate></item><item><title><![CDATA[Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 09:18:51 GMT]]></title><description><![CDATA[<p>Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).</p><p>For 10M points, some selected speedups:<br />* R9700: 0.89 -&gt; 0.44 ms<br />* RTX 3070: 4.98 -&gt; 2.44 ms<br />* GTX 1630: 7.43 -&gt; 2.7 ms<br />* Apple M4: 7.69 -&gt; 1.8 ms<br />* Xeon 8362 (llvmpipe): 7.43 -&gt; 2.7 ms</p>]]></description><link>https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116509896325427360</link><guid isPermaLink="true">https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116509896325427360</guid><dc:creator><![CDATA[azonenberg@ioc.exchange]]></dc:creator><pubDate>Sun, 03 May 2026 09:18:51 GMT</pubDate></item><item><title><![CDATA[Reply to Working on some more libscopehal backend tweaks. on Sun, 03 May 2026 09:12:42 GMT]]></title><description><![CDATA[<p>So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?</p><p>After a bit of thought, I realized we could do exactly that.</p><p>The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.</p><p>The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.</p><p>This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.</p><p>And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.</p>]]></description><link>https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116509872104662045</link><guid isPermaLink="true">https://board.circlewithadot.net/post/https://ioc.exchange/users/azonenberg/statuses/116509872104662045</guid><dc:creator><![CDATA[azonenberg@ioc.exchange]]></dc:creator><pubDate>Sun, 03 May 2026 09:12:42 GMT</pubDate></item></channel></rss>