Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

Scheduled Pinned Locked Moved Uncategorized
7 Posts 4 Posters 20 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • azonenberg@ioc.exchangeA This user is from outside of this forum
    azonenberg@ioc.exchangeA This user is from outside of this forum
    azonenberg@ioc.exchange
    wrote last edited by
    #1

    Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

    Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).

    The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.

    So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).

    But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).

    Which brings us to the second option: try to implement the entire decode inner loop in a shader.

    azonenberg@ioc.exchangeA ignaloidas@not.acu.ltI 2 Replies Last reply
    0
    • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

      Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

      Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).

      The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.

      So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).

      But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).

      Which brings us to the second option: try to implement the entire decode inner loop in a shader.

      azonenberg@ioc.exchangeA This user is from outside of this forum
      azonenberg@ioc.exchangeA This user is from outside of this forum
      azonenberg@ioc.exchange
      wrote last edited by
      #2

      The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?

      I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?

      whitequark@social.treehouse.systemsW 0h00000000@ioc.exchange0 2 Replies Last reply
      0
      • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

        The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?

        I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?

        whitequark@social.treehouse.systemsW This user is from outside of this forum
        whitequark@social.treehouse.systemsW This user is from outside of this forum
        whitequark@social.treehouse.systems
        wrote last edited by
        #3

        @azonenberg I do feel like if we used an intermediate representation for protocol decoders (I'm partial to WebAssembly, which should map quite well to GPUs, but lots of representations will work here!) then you could write normal-looking code and still run it on the GPU.

        the other advantage of this would be that you no longer have to write a protocol decoder individually for every application that could conceivably want to decode a protocol, and treat them more like reusable software libraries than one-off hacks one needs to finish a project

        azonenberg@ioc.exchangeA 1 Reply Last reply
        0
        • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

          The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?

          I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?

          0h00000000@ioc.exchange0 This user is from outside of this forum
          0h00000000@ioc.exchange0 This user is from outside of this forum
          0h00000000@ioc.exchange
          wrote last edited by
          #4

          @azonenberg I know that STM32 uses a 16x oversampling on their UART ports, and you can change this to x8 or x1.

          Maybe you find a couple of edges and then pick a few spots in the middle of the range to sample on?

          You could run an average on those points and if it isn't 0 or 1 alert the user or make an adjustment.

          Or just run a decimation filter before it?

          azonenberg@ioc.exchangeA 1 Reply Last reply
          0
          • 0h00000000@ioc.exchange0 0h00000000@ioc.exchange

            @azonenberg I know that STM32 uses a 16x oversampling on their UART ports, and you can change this to x8 or x1.

            Maybe you find a couple of edges and then pick a few spots in the middle of the range to sample on?

            You could run an average on those points and if it isn't 0 or 1 alert the user or make an adjustment.

            Or just run a decimation filter before it?

            azonenberg@ioc.exchangeA This user is from outside of this forum
            azonenberg@ioc.exchangeA This user is from outside of this forum
            azonenberg@ioc.exchange
            wrote last edited by
            #5

            @0h00000000 The challenge is more that, in general as a decoder, you do not know the SCL clock rate a priori. And I don't want to force the user to specify.

            1 Reply Last reply
            0
            • whitequark@social.treehouse.systemsW whitequark@social.treehouse.systems

              @azonenberg I do feel like if we used an intermediate representation for protocol decoders (I'm partial to WebAssembly, which should map quite well to GPUs, but lots of representations will work here!) then you could write normal-looking code and still run it on the GPU.

              the other advantage of this would be that you no longer have to write a protocol decoder individually for every application that could conceivably want to decode a protocol, and treat them more like reusable software libraries than one-off hacks one needs to finish a project

              azonenberg@ioc.exchangeA This user is from outside of this forum
              azonenberg@ioc.exchangeA This user is from outside of this forum
              azonenberg@ioc.exchange
              wrote last edited by
              #6

              @whitequark well it's more a question of paradigm than implementation.

              Like, most CPU based decoders now are implemented as state machines that iterate over the list of input samples and do something based on the value.

              This is fundamentally incompatible with a data-parallel approach on the GPU where you need to be able to decode in tens of thousands of threads simultaneously. It requires a completely different approach to designing the decode.

              And sometimes you have to do multiple passes because e.g. you generate a variable number of edges/packets/whatever in each thread so you need one decoding pass followed by a gather/scatter pass to concatenate them into a second buffer with no gaps

              1 Reply Last reply
              0
              • azonenberg@ioc.exchangeA azonenberg@ioc.exchange

                Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

                Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).

                The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.

                So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).

                But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).

                Which brings us to the second option: try to implement the entire decode inner loop in a shader.

                ignaloidas@not.acu.ltI This user is from outside of this forum
                ignaloidas@not.acu.ltI This user is from outside of this forum
                ignaloidas@not.acu.lt
                wrote last edited by
                #7

                @azonenberg@ioc.exchange i don't think ypu can do it (well?) with Vulkan but it is possible to do GPU-side allocation. Not sure if worth tho.

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups