Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.
-
Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.
Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).
The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.
So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).
But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).
Which brings us to the second option: try to implement the entire decode inner loop in a shader.
-
Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.
Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).
The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.
So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).
But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).
Which brings us to the second option: try to implement the entire decode inner loop in a shader.
The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?
I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?
-
The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?
I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?
@azonenberg I do feel like if we used an intermediate representation for protocol decoders (I'm partial to WebAssembly, which should map quite well to GPUs, but lots of representations will work here!) then you could write normal-looking code and still run it on the GPU.
the other advantage of this would be that you no longer have to write a protocol decoder individually for every application that could conceivably want to decode a protocol, and treat them more like reusable software libraries than one-off hacks one needs to finish a project
-
The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?
I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?
@azonenberg I know that STM32 uses a 16x oversampling on their UART ports, and you can change this to x8 or x1.
Maybe you find a couple of edges and then pick a few spots in the middle of the range to sample on?
You could run an average on those points and if it isn't 0 or 1 alert the user or make an adjustment.
Or just run a decimation filter before it?
-
@azonenberg I know that STM32 uses a 16x oversampling on their UART ports, and you can change this to x8 or x1.
Maybe you find a couple of edges and then pick a few spots in the middle of the range to sample on?
You could run an average on those points and if it isn't 0 or 1 alert the user or make an adjustment.
Or just run a decimation filter before it?
@0h00000000 The challenge is more that, in general as a decoder, you do not know the SCL clock rate a priori. And I don't want to force the user to specify.
-
@azonenberg I do feel like if we used an intermediate representation for protocol decoders (I'm partial to WebAssembly, which should map quite well to GPUs, but lots of representations will work here!) then you could write normal-looking code and still run it on the GPU.
the other advantage of this would be that you no longer have to write a protocol decoder individually for every application that could conceivably want to decode a protocol, and treat them more like reusable software libraries than one-off hacks one needs to finish a project
@whitequark well it's more a question of paradigm than implementation.
Like, most CPU based decoders now are implemented as state machines that iterate over the list of input samples and do something based on the value.
This is fundamentally incompatible with a data-parallel approach on the GPU where you need to be able to decode in tens of thousands of threads simultaneously. It requires a completely different approach to designing the decode.
And sometimes you have to do multiple passes because e.g. you generate a variable number of edges/packets/whatever in each thread so you need one decoding pass followed by a gather/scatter pass to concatenate them into a second buffer with no gaps
-
Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.
Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).
The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.
So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).
But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).
Which brings us to the second option: try to implement the entire decode inner loop in a shader.
@azonenberg@ioc.exchange i don't think ypu can do it (well?) with Vulkan but it is possible to do GPU-side allocation. Not sure if worth tho.