ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

NZ Gangsta · Postby **NZ Gangsta** » Thu Nov 16, 2023 4:49 am

BTW:
In order to make the code compile on my PC I had to get the PSRAM Cache Line Size using this version of the command when I used the Master branch.

Code: Untitled.c Select all


const uint32_t PSRAM_ALIGN = cache_hal_get_cache_line_size(CACHE_LL_LEVEL_EXT_MEM, CACHE_TYPE_DATA);

On my ESP32-S3-WROOM-1U-N8R8 with the main clock set at 240 MHz and the PSRAM set to octal mode at 80MHz...

100,000 bytes IRAM->IRAM took 403,301 CPU cycles = 56.8 MB/s
100,000 bytes IRAM->EXT took 459,041 CPU cycles = 49.8 MB/s
100,000 bytes EXT->IRAM took 441,796 CPU cycles = 51.8 MB/s
1,000,000 bytes EXT->EXT took 8,536,922 CPU cycles = 26.8 MB/s
4,000,000 bytes EXT->EXT took 34,073,036 CPU cycles = 26.8 MB/s

To get the DMA to work with the bigger buffers I had to adjust the backlog calculation to be...

Code: Untitled.c Select all


uint32_t backlog = (size + 4091) / 4092;

ok-home · Postby **ok-home** » Thu Nov 16, 2023 6:27 am

hi

uint32_t backlog = (size + 4091) / 4092;

on ESP IDF >=5.2

uint32_t backlog; /*!< Maximum number of transactions that can be prepared in the background */

this is the size of the queue where you can send transfer requests without waiting for the end of data transfer. This has nothing to do with the size of the dma buffer. Default value is 8.

MicroController · Postby **MicroController** » Thu Nov 16, 2023 1:51 pm

on ESP IDF >=5.2

uint32_t backlog; /*!< Maximum number of transactions that can be prepared in the background */

this is the size of the queue where you can send transfer requests without waiting for the end of data transfer. This has nothing to do with the size of the dma buffer. Default value is 8.

True. Apparently, v5.2 dynamically allocates the number of required descriptors for every transfer. This is more convenient compared to v5.1, but I'd rather not be wasting CPU time repeatedly allocating and deallocating descriptors from the heap.

MicroController · Postby **MicroController** » Thu Nov 16, 2023 2:17 pm

100,000 bytes IRAM->IRAM took 403,301 CPU cycles = 56.8 MB/s

Wow, that's slow. In comparison, using the CPU's 128-bit memory bus we should get 240MHz*16/2 ~ 1.9GB/s IRAM->IRAM via the CPU. Seems like DMA really does not make much sense for IRAM->IRAM.

Edit:
With minor tweaking I actually get 7972 IRAM bytes copied per 1000 CPU cycles via the CPU, i.e. actual 1.91GB/s. On an MCU which costs less than €2

Amazing!

NZ Gangsta · Postby **NZ Gangsta** » Fri Nov 17, 2023 12:24 am

With minor tweaking I actually get 7972 IRAM bytes copied per 1000 CPU cycles via the CPU, i.e. actual 1.91GB/s. On an MCU which costs less than €2 Amazing!

WOW that's impressive and worth knowing. I didn't realize the ESP32 had a 128 bit memory bus. Too cool. I just wish it had a larger internal RAM. Things would be awesome if it had 1MB or more of IRAM.

How are you achieving this? Using the memcpy instruction on 100,000 bytes of IRAM I see 63,357 clock cycles to complete = 361.2 MB/s. Not bad but not 1.91 GB/s. My guess is your using the ESP32-S3's PIE SIMD extensions to read (qu) and write (qv) 128bits of data to/from the IRAM in a loop.

MicroController · Postby **MicroController** » Fri Nov 17, 2023 8:18 am

How are you achieving this? ... My guess is your using the ESP32-S3's PIE SIMD extensions to read (qu) and write (qv) 128bits of data to/from the IRAM in a loop.

Yes, exactly. The EE.VLD.128.IP/EE.VST.128.IP instructions load/store 16 bytes from/to RAM and increment the pointer in 1 CPU cycle each. Wrapped in Xtensa's "zero-overhead loop" things start really taking off.

Through the magic of C++, my code looks like this:

Code: Select all

const uint8_t* src = arr1;
uint8_t* dest = arr2;
rpt( sizeof(arr1)/32, [&src,&dest]() {
    src >> Q0; // a.k.a. simd::ops::vld_128_ip<0,16>(src);
    src >> Q1;
    dest << Q0; // a.k.a. simd::ops::vst_128_ip<0,16>(dest);
    dest << Q1;
    }
);

where the loop compiles down to 1+4 assembler instructions

(Alternating between two Q registers made the CPU pipeline happy and sped things up from ~5.3 bytes per cycle to ~8. )

Btw, just the other day, and after already having rolled my own version

, I unearthed the dsps_memcpy_aes3(...) function in the ESP-DSP library which is not (yet?) mentioned in the documentation.

NZ Gangsta · Postby **NZ Gangsta** » Sat Nov 18, 2023 4:20 am

Very nice, however I can't get the code to compile as it doesn't recognize the RPT (Cadence Tensilica Extensions - repeat instruction I think?), or Q0 and Q1 (PIE registers). What header do I need to add for the C++ compiler to recognize these?

Also where is the documentation explaining how to use the PIE in C++? You appear to be privy to some knowledge I cannot find on the open internet.

I was able to get the dsps_memcpy_aes3 function to work after installing the esp-dsp component. It transferred memory IRAM->IRAM at 1.41 GBps. Fast but still not 1.91 GBps. Looking at its implementation its written in assembly but does not look like it uses the zero overhead loop (rpt) instruction. I guess that explains the 25% performance hit.

Code: Untitled.c Select all

    

#include "dsps_mem.h"

.

.

void *ret = dsps_memcpy_aes3(dest, source, size);

if (!ret) {

    ESP_LOGE(TAG, "Failed to execute dsps_memcpy_aes3.");

    return ESP_FAIL;

}

MicroController · Postby **MicroController** » Sat Nov 18, 2023 4:02 pm

Sorry for causing confusion. I'm actually currently developing a C++ library for the ESP32S3 SIMD, and rpt(...) is just a simple C++ function template I made to generate ZOLs :

Code: Select all

/**
 * @brief Uses Xtensa's zero-overhead loop to execute a given operation a number of times.
 * This function does \e not save/restore the LOOP registers, so if required these need to be 
 * saved&restored explicitly around the function call.
 * 
 * @tparam F Type of the functor to execute
 * @tparam Args Argument types of the functor
 * @param cnt Number of iterations
 * @param f The functor to invoke
 * @param args Arguments to pass to the functor
 */
template<typename F, typename...Args>
static inline void INL rpt(const uint32_t cnt, const F& f, Args&&...args) {

    bgn:
        asm goto (
            "LOOPNEZ %[cnt], %l[end]"
            : /* no output*/
            : [cnt] "r" (cnt)
            : /* no clobbers */
            : end
        );

            f(std::forward<Args>(args)...);

 
    end:
        /* Tell the compiler that the above code might execute more than once.
           The begin label must be before the inital LOOP asm because otherwise
           gcc may decide to put one-off setup code between the LOOP asm and the
           begin label, i.e. inside the loop.
        */
        asm goto ("":::: bgn);    
        ;
}

You'll find the LOOPGTZ, LOOPNEZ, or LOOP instructions being used in the ESP-DSP's assembler code too.
While GCC knows how to use these instructions, it is pretty 'conservative' about generating them, and will not use them if the loop's body contains any inline assembler instructions, which requires me to generate the loops via inline assembly too if I want the loop to contain an inline assembly SIMD instruction.

I plan to open source my SIMD library once it reaches a somewhat stable state.

It transferred memory IRAM->IRAM at 1.41 GBps

The SIMD instructions operate only on 16-byte aligned memory, and so did my test code to copy data at the hardware's speed limit. A generic memcpy function, like implemented in the DSP lib, must be able to deal with all kinds of unaligned memory addresses as input and this necessitates some overhead which may contribute to it appearing a little slower than theoretically possible.

MicroController · Postby **MicroController** » Sat Nov 18, 2023 4:32 pm

explaining how to use the PIE in C++?

For me it boils down to a) inline assembler and b) ways to 'smoothly' integrate the use of inline assembly into C++ code.

C++ templates make a good basis to map assembler instructions to C++ functions:

Code: Select all

    /*
        q<R> = *(src & ~0xf);
        src += INC;
    */
    template<qreg_id R, int16_t INC = 16, typename S>
    requires (valid_qr<R> && valid_inc16<INC>)
    static inline void INL vld_128_ip(S*& src) {
        asm volatile (
            "EE.VLD.128.IP q%[reg], %[src], %[inc]" "\n"
            : [src] "+r" (src)
            : [reg] "i" (R.r),
              [inc] "i" (INC),
              "m" (cmem128(src))
        );
    }

The alternative would be 'actual' inline assembly, like

Code: Select all

const uint32_t bytes_per_iteration = 32;
const uint32_t cnt = sizeof(arr1)/bytes_per_iteration;
const uint8_t* src = arr1;
uint8_t* dest = arr2;
asm volatile (
    "LOOPNEZ %[cnt], end_loop%=" "\n"
    "   EE.VLD.128.IP q0, %[src], 16" "\n"
    "   EE.VLD.128.IP q1, %[src], 16" "\n"

    "   EE.VST.128.IP q0, %[dest], 16" "\n"
    "   EE.VST.128.IP q1, %[dest], 16" "\n"
    "end_loop%=:"
    : [src] "+&r" (src), [dest] "+&r" (dest)
    : [cnt] "r" (cnt)
    : "memory"
);

but I think that is too unwieldy for regular use.

NZ Gangsta · Postby **NZ Gangsta** » Mon Nov 20, 2023 3:53 am

For me it boils down to a) inline assembler and b) ways to 'smoothly' integrate the use of inline assembly into C++ code.

Funny how one's C++ experience defines what they prefer. I like the inline assembly as there is no need to understand the inner workings of the GCC compiler in order to understand what's happening. If you understand the 'asm' keyword then its obviously assembly and all the parameters etc are self explanatory. I've looked at your templated version and I still don't fully understand how it works.

To round things out it also looks like its possible to use the .S extension with the ESP-IDF and compile direct assembly code. But this is not something you even mention so I'm guessing its not your preferred approach.

I wasn't able to get the template to compile as it doesn't know the INT in the function definition. It also complains about Q0 and Q1. I was however able to get the inline assembly to compile and I now get 1.78 GB/s. I'm a little surprised it doesn't match the performance of your code at 1.91 GB/s as they look functionally the same. Perhaps there is another config setting slowing things down as the code looks to be functionally the same.

I have created a github repository (https://github.com/project-x51/esp32-s3 ... /tree/main) and put the code to date there. It includes the templated copy version commented out so things compile. Feel free if you have the time to fix this in the repository, or perhaps just leave a link to your library once its going.

If you feel so inclined I would love to know how the templated code works. I understand the basics of C++ generics as they work similar to c#. I can see the args are Variadic arguments like used in the printf function. But I don't understand the && (Rvalue reference) or the variable f (functor). I think it is similar to a delegate in c#. Most importantly I'm curious to know how this all ties together with the C++ code passed in (to the functor I think) and the reason you prefer this to straight up inline assembly?

Below is the performance data for the various ways of copying memory I have achieved so far. I'm surprised the GMDA controller is not faster copying IRAM->IRAM as well.

Code: Untitled.txt Select all


I (391) Memory Copy:



memory copy version 1.



I (401) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes

I (421) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819662 CPU cycles = 28.59 MB/s

I (421) Memory Copy: 16-bit for loop copy IRAM->IRAM took 461530 CPU cycles = 50.78 MB/s

I (431) Memory Copy: 32-bit for loop copy IRAM->IRAM took 206343 CPU cycles = 113.59 MB/s

I (441) Memory Copy: 64-bit for loop copy IRAM->IRAM took 128300 CPU cycles = 182.68 MB/s

I (451) Memory Copy: memcpy IRAM->IRAM took 64658 CPU cycles = 362.48 MB/s

I (461) Memory Copy: async_memcpy IRAM->IRAM took 412233 CPU cycles = 56.85 MB/s

I (461) Memory Copy: PIE 128-bit IRAM->IRAM took 13145 CPU cycles = 1783.00 MB/s

I (471) Memory Copy: DSP AES3 IRAM->IRAM took 17028 CPU cycles = 1376.41 MB/s



I (471) Memory Copy: Freeing 100kb from IRAM

I (481) Memory Copy: Allocating 100kb in PSRAM, alignment: 32 bytes

I (501) Memory Copy: 8-bit for loop copy IRAM->PSRAM took 1075776 CPU cycles = 21.79 MB/s

I (501) Memory Copy: 16-bit for loop copy IRAM->PSRAM took 720958 CPU cycles = 32.51 MB/s

I (511) Memory Copy: 32-bit for loop copy IRAM->PSRAM took 462534 CPU cycles = 50.67 MB/s

I (521) Memory Copy: 64-bit for loop copy IRAM->PSRAM took 422886 CPU cycles = 55.42 MB/s

I (531) Memory Copy: memcpy IRAM->PSRAM took 413299 CPU cycles = 56.71 MB/s

I (541) Memory Copy: async_memcpy IRAM->PSRAM took 478345 CPU cycles = 49.00 MB/s

I (541) Memory Copy: PIE 128-bit IRAM->PSRAM took 403536 CPU cycles = 58.08 MB/s

I (551) Memory Copy: DSP AES3 IRAM->PSRAM took 409011 CPU cycles = 57.30 MB/s



I (551) Memory Copy: Swapping source and destination buffers

I (571) Memory Copy: 8-bit for loop copy PSRAM->IRAM took 1040185 CPU cycles = 22.53 MB/s

I (581) Memory Copy: 16-bit for loop copy PSRAM->IRAM took 696069 CPU cycles = 33.67 MB/s

I (591) Memory Copy: 32-bit for loop copy PSRAM->IRAM took 615118 CPU cycles = 38.10 MB/s

I (591) Memory Copy: 64-bit for loop copy PSRAM->IRAM took 602883 CPU cycles = 38.88 MB/s

I (601) Memory Copy: memcpy PSRAM->IRAM took 607278 CPU cycles = 38.59 MB/s

I (611) Memory Copy: async_memcpy PSRAM->IRAM took 458985 CPU cycles = 51.06 MB/s

I (621) Memory Copy: PIE 128-bit PSRAM->IRAM took 605593 CPU cycles = 38.70 MB/s

I (631) Memory Copy: DSP AES3 PSRAM->IRAM took 611055 CPU cycles = 38.36 MB/s



I (631) Memory Copy: Freeing 100kb from IRAM

I (631) Memory Copy: Allocating 100kb in PSRAM, alignment: 32 bytes

I (651) Memory Copy: 8-bit for loop copy PSRAM->PSRAM took 1422563 CPU cycles = 16.48 MB/s

I (661) Memory Copy: 16-bit for loop copy PSRAM->PSRAM took 1090496 CPU cycles = 21.49 MB/s

I (681) Memory Copy: 32-bit for loop copy PSRAM->PSRAM took 1049829 CPU cycles = 22.33 MB/s

I (691) Memory Copy: 64-bit for loop copy PSRAM->PSRAM took 1049128 CPU cycles = 22.34 MB/s

I (701) Memory Copy: memcpy PSRAM->PSRAM took 1049045 CPU cycles = 22.34 MB/s

I (711) Memory Copy: async_memcpy PSRAM->PSRAM took 895774 CPU cycles = 26.16 MB/s

I (721) Memory Copy: PIE 128-bit PSRAM->PSRAM took 1054546 CPU cycles = 22.23 MB/s

I (741) Memory Copy: DSP AES3 PSRAM->PSRAM took 1059057 CPU cycles = 22.13 MB/s

I (741) main_task: Returned from app_main()

ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Re: ESP32-S3 - esp_async_memcpy not working with PSRAM using GDMA

Who is online

About Us

Extra

Information