Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Working on some more libscopehal backend tweaks.

Working on some more libscopehal backend tweaks.

Scheduled Pinned Locked Moved Uncategorized
4 Posts 1 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • azonenberg@ioc.exchangeA This user is from outside of this forum
    azonenberg@ioc.exchangeA This user is from outside of this forum
    azonenberg@ioc.exchange
    wrote last edited by
    #1

    Working on some more libscopehal backend tweaks.

    Most scopes we work with take in int8 ADC sample data, which is then converted to float32 before further processing.

    The original baseline implementation of this process was a simple loop on the CPU, which later got vectorized with AVX (but the fallback version kept for non-AVX platforms). This then got an OpenMP wrapper around it for converting multiple channels in parallel on different CPU cores.

    The next iteration was to add a GPU shader, dependent on GL_EXT_shader_8bit_storage being available, available, to do sample conversion.

    This shader runs one GPU thread per sample, reads an int8, writes back a float32. It works, and is significantly faster than the CPU.

    For example, on my Xeon 8362 + R9700 test system, converting ten million points takes:
    * CPU baseline: 10.5 ms
    * AVX2: 9.2 ns
    * First shader: 0.89 ms

    But now we have three different implementations of the same logic floating around, and lots of complex conditional logic around dispatch.

    Plus I'm planning to transition away from OpenMP due to portability/tooling complexity issues and the thread pools being annoying when debugging, needing some environment variables set on startup, it's just generally a PITA.

    azonenberg@ioc.exchangeA 1 Reply Last reply
    0
    • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

      Working on some more libscopehal backend tweaks.

      Most scopes we work with take in int8 ADC sample data, which is then converted to float32 before further processing.

      The original baseline implementation of this process was a simple loop on the CPU, which later got vectorized with AVX (but the fallback version kept for non-AVX platforms). This then got an OpenMP wrapper around it for converting multiple channels in parallel on different CPU cores.

      The next iteration was to add a GPU shader, dependent on GL_EXT_shader_8bit_storage being available, available, to do sample conversion.

      This shader runs one GPU thread per sample, reads an int8, writes back a float32. It works, and is significantly faster than the CPU.

      For example, on my Xeon 8362 + R9700 test system, converting ten million points takes:
      * CPU baseline: 10.5 ms
      * AVX2: 9.2 ns
      * First shader: 0.89 ms

      But now we have three different implementations of the same logic floating around, and lots of complex conditional logic around dispatch.

      Plus I'm planning to transition away from OpenMP due to portability/tooling complexity issues and the thread pools being annoying when debugging, needing some environment variables set on startup, it's just generally a PITA.

      azonenberg@ioc.exchangeA This user is from outside of this forum
      azonenberg@ioc.exchangeA This user is from outside of this forum
      azonenberg@ioc.exchange
      wrote last edited by
      #2

      So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?

      After a bit of thought, I realized we could do exactly that.

      The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.

      The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.

      This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.

      And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.

      azonenberg@ioc.exchangeA 1 Reply Last reply
      0
      • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

        So what if we had a GPU implementation that did not need any extensions, and would work on *any* Vulkan 1.0 capable GPU?

        After a bit of thought, I realized we could do exactly that.

        The new shader, which I'm testing and benchmarking now, runs 1/4 as many threads and processes four samples per thread.

        The input is fetched as a packed uint32_t[], which is available without any extensions. Then bitshifting operations are used to extract each of the 8-bit samples, sign extend, and convert to float32.

        This should allow us to use this shader as the *only* implementation of this functionality... the 8-bit integer shader and AVX implementation can be deleted outright, and the unoptimized C++ version moved to a unit test as a golden reference.

        And 13 of the 22 OpenMP loops in libscopehal can be replaced by this.

        azonenberg@ioc.exchangeA This user is from outside of this forum
        azonenberg@ioc.exchangeA This user is from outside of this forum
        azonenberg@ioc.exchange
        wrote last edited by
        #3

        Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).

        For 10M points, some selected speedups:
        * R9700: 0.89 -> 0.44 ms
        * RTX 3070: 4.98 -> 2.44 ms
        * GTX 1630: 7.43 -> 2.7 ms
        * Apple M4: 7.69 -> 1.8 ms
        * Xeon 8362 (llvmpipe): 7.43 -> 2.7 ms

        azonenberg@ioc.exchangeA 1 Reply Last reply
        0
        • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

          Early benchmark results show that the new shader is also significantly faster on all tested platforms (even though this was not the point of the refactoring, portability was).

          For 10M points, some selected speedups:
          * R9700: 0.89 -> 0.44 ms
          * RTX 3070: 4.98 -> 2.44 ms
          * GTX 1630: 7.43 -> 2.7 ms
          * Apple M4: 7.69 -> 1.8 ms
          * Xeon 8362 (llvmpipe): 7.43 -> 2.7 ms

          azonenberg@ioc.exchangeA This user is from outside of this forum
          azonenberg@ioc.exchangeA This user is from outside of this forum
          azonenberg@ioc.exchange
          wrote last edited by
          #4

          Added a 16-bit version of the same conversion block for use with >8 bit ADCs.

          Apple and llvmpipe showed huge improvements, slight on the AMD, almost no change on nvidia.

          But again, the main goal is portability so "it runs at least as fast and doesn't depend on extensions" is a major win. Bolted this into the ThunderScope driver already and will try to do others soon.

          * R9700: 1.58 -> 1.26 ms
          * RTX 3070: 7.41 -> 7.36 ms
          * GTX 1630: 9.50 -> 9.45 ms
          * Apple M4: 12.88 -> 1.93 ms
          * Xeon 8362 (llvmpipe): 8.64 -> 5.58 ms

          1 Reply Last reply
          1
          0
          • R relay@relay.infosec.exchange shared this topic
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups