ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

NZ Gangsta · Postby **NZ Gangsta** » Mon Nov 20, 2023 4:04 am

A generic memcpy function, like implemented in the DSP lib, must be able to deal with all kinds of unaligned memory addresses as input and this necessitates some overhead which may contribute to it appearing a little slower than theoretically possible.

Its a lot slower! It does support using a zero overhead loop to shift data when it can. So perhaps it spends too much time deciding to use it or doesn't end up using it for some reason.

Code: Untitled.asm Select all


    // Main loop arr_src aligned

    loopnez  a5, ._main_loop_aligned                // 32 bytes in one loop

        ee.vld.128.ip    q0,  a3,  16               // load 16 bytes from arr_src to q0

        ee.vld.128.ip    q1,  a3,  16               // load 16 bytes from arr_src to q1



        ee.vst.128.ip    q0,  a2,  16               // save 16 bytes to arr_dest from q0

        ee.vst.128.ip    q1,  a2,  16               // save 16 bytes to arr_dest from q1

    ._main_loop_aligned:

MicroController · Postby **MicroController** » Mon Nov 20, 2023 10:30 am

I now get 1.78 GB/s. I'm a little surprised it doesn't match the performance of your code at 1.91 GB/s as they look functionally the same.

That's just because my GB is 1000000000 bytes and yours is 1024*1024*1024

. Bytes/1000 cycles are identical.

MicroController · Postby **MicroController** » Mon Nov 20, 2023 3:15 pm

Funny how one's C++ experience defines what they prefer. I like the inline assembly as there is no need to understand the inner workings of the GCC compiler in order to understand what's happening. ... it also looks like its possible to use the .S extension with the ESP-IDF and compile direct assembly code. But this is not something you even mention so I'm guessing its not your preferred approach.

Well, except for the single inline assembly instruction I wrap (and __attribute__((always_inline))) everything is plain standard C++20

Using standalone assembler files (.S) is the opposite of what I want. Writing algorithms in assembler on top of manually implementing the ABI, managing register allocation, data types, optimizations &c. is unnecessarily complicated, as it makes you replicate what the compiler will do automatically for you.

Take for example

Code: Select all

    rpt(cnt, [&src_p,&dest_p]() {
        vld_128_ip<0>(src_p); // Load 16 bytes from RAM into q0, increment src_p
        vst_128_ip<0>(dest_p); // Store 16 bytes from q0 to RAM, increment dest_p
    });

which lets the compiler do its thing so that I don't have to concern myself with how and where src_p and dest_p come from, or which registers are best to use at that location in the code; gcc can optimize that as it likes.

My current version is wrapping the SIMD instructions in meaningful objects (i.e. vectors) and operator overrides which map naturally to the SIMD instructions (vecA = vecB + vecC;).

NZ Gangsta · Postby **NZ Gangsta** » Mon Nov 20, 2023 9:31 pm

That's just because my GB is 1000000000 bytes and yours is 1024*1024*1024 . Bytes/1000 cycles are identical.

I wondered that but hadn't done the math. Nice! Thanks for helping me figure out how to get max bandwidth from the CPU!

It would be nice if the GDMA controller had access to the IRAM with all 128 bits. I assume it doesn't because its connected to the flash cache which would explain its low IRAM->IRAM bandwidth. I guess if it did it wouldn't really be useful DMA controller as it would stave the CPU of data much of the time.

NZ Gangsta · Postby **NZ Gangsta** » Mon Nov 20, 2023 10:22 pm

Using standalone assembler files (.S) is the opposite of what I want. Writing algorithms in assembler on top of manually implementing the ABI, managing register allocation, data types, optimizations &c. is unnecessarily complicated, as it makes you replicate what the compiler will do automatically for you.

That makes so much sense. About the only thing you lose in that approach is the syntactical highlighting of the assembly code as all the assembly is behind quotes. Everything else is a total win.

I'm still not sure why the compiler doesn't like the INT, Q0 and Q1 statements. My guess for the Q0 & Q1 registers is that you have created macros pointing to the registers inside the PIE (TIE in the Xtensia ISA). The only place I can find anything mentioning these registers is in the "esp-idf\components\xtensa\esp32s3\include\xtensa\config\tie.h" header which makes sense but I cant see how these are connected to Q0 & Q1.

As for the INT in your rpt function, I can only assume it is an attribute defined somewhere, but where and for what I have no idea. Please help to clarify these mysteries.

Code: Untitled.c Select all


 template<typename F, typename...Args>

 static inline void INT rpt(const uint32_t cnt, const F& f, Args&&...args)

MicroController · Postby **MicroController** » Mon Nov 20, 2023 10:55 pm

I'm still not sure why the compiler doesn't like the INT, Q0 and Q1 statements. My guess for the Q0 & Q1 registers is that you have created macros pointing to the registers inside the PIE (TIE in the Xtensia ISA).

Q0 &c. are actually global constexpr instances of my class template QReg<N> indeed representing the q0...q7 SIMD registers, and I created and use a ton of helper types and concepts to get things from my lib cleanly usable in a type-safe way. (E.g.: multiplying a q-vector of 16 8-bit values by another q-vector containing 8 16-bit values won't work in a sensible way, and accordingly there is no SIMD instruction for this, so the types can't allow you to do so.)

As for the INT in your rpt function, I can only assume it is an attribute defined somewhere, but where and for what I have no idea.

That actually is a macro:

Code: Select all

#define INL __attribute__((always_inline))

Check out my pull request

NZ Gangsta · Postby **NZ Gangsta** » Tue Nov 21, 2023 12:07 am

Check out my pull request

Perfect! Interesting to see the 16 byte version stalling a bit more. According to your comments this is due to the CPU's pipeline architecture.

By my math, with the CPU operating at 240 MHz moving 128 bits (16 bytes) per clock cycle it should be able to read 15 GB/s (1,000's). Reading + writing halves this to 7.5 GB/s. We are getting 1.91 GB/s this is almost exactly 4 x slower.

My guess is that the memory bus is still only capable of moving 32 bits to the PIE per clock cycle. So not strictly a 128 bit memory bus as stated in the ESP32-S3 datasheet. "128-bit data bus and SIMD commands". Its still cool that the PIE can do math on these at 128 bits though.

Do you think my assumption is correct? or is my math wrong? or are we missing something that is causing the bandwidth to drop to 25% of the theoretical max?

MicroController · Postby **MicroController** » Tue Nov 21, 2023 12:16 am

By my math, with the CPU operating at 240 MHz moving 128 bits (16 bytes) per clock cycle it should be able to read 15 GB/s

That math ain't mathin'...

NZ Gangsta · Postby **NZ Gangsta** » Tue Nov 21, 2023 2:12 am

That math ain't mathin'...

Haha. OK let me see...

128 bits = 16 bytes
240 Million clock cycles x 16 bytes = 3,840 Million bytes per sec
Half = 1,920 Million bytes per sec.

OK, I must have been on a different planet.

I was going to ask where I find the documentation on the instructions used on the PIE. However I managed to find this eventually in the ESP32-S3 Technical Reference manual at the bottom of the PIE section.

Thanks for all the help! You solved my initial request to make the Async_memcpy work and then showed me how to move data around using the PIE. I enjoyed the journey and learnt lots more about how the ESP32 is architected and how to tame it using C++ and assembly. I look forward to seeing the library you create for the PIE.

MicroController · Postby **MicroController** » Tue Nov 21, 2023 3:17 pm

I enjoyed the journey and learnt lots more about how the ESP32 is architected and how to tame it using C++ and assembly. I look forward to seeing the library you create for the PIE.

Thanks for enticing me to explore what's possible some more, and for providing the useful benchmark of different memcpy techniques.

ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Who is online

About Us

Extra

Information