For me it boils down to a) inline assembler and b) ways to 'smoothly' integrate the use of inline assembly into C++ code.
Funny how one's C++ experience defines what they prefer. I like the inline assembly as there is no need to understand the inner workings of the GCC compiler in order to understand what's happening. If you understand the 'asm' keyword then its obviously assembly and all the parameters etc are self explanatory. I've looked at your templated version and I still don't fully understand how it works.

To round things out it also looks like its possible to use the .S extension with the ESP-IDF and compile direct assembly code. But this is not something you even mention so I'm guessing its not your preferred approach.
I wasn't able to get the template to compile as it doesn't know the INT in the function definition. It also complains about Q0 and Q1. I was however able to get the inline assembly to compile and I now get 1.78 GB/s. I'm a little surprised it doesn't match the performance of your code at 1.91 GB/s as they look functionally the same. Perhaps there is another config setting slowing things down as the code looks to be functionally the same.
I have created a github repository (
https://github.com/project-x51/esp32-s3 ... /tree/main) and put the code to date there. It includes the templated copy version commented out so things compile. Feel free if you have the time to fix this in the repository, or perhaps just leave a link to your library once its going.
If you feel so inclined I would love to know how the templated code works. I understand the basics of C++ generics as they work similar to c#. I can see the args are Variadic arguments like used in the printf function. But I don't understand the && (Rvalue reference) or the variable f (functor). I think it is similar to a delegate in c#. Most importantly I'm curious to know how this all ties together with the C++ code passed in (to the functor I think) and the reason you prefer this to straight up inline assembly?
Below is the performance data for the various ways of copying memory I have achieved so far. I'm surprised the GMDA controller is not faster copying IRAM->IRAM as well.
Code: Untitled.txt Select all
I (391) Memory Copy:
memory copy version 1.
I (401) Memory Copy: Allocating 2 x 100kb in IRAM, alignment: 32 bytes
I (421) Memory Copy: 8-bit for loop copy IRAM->IRAM took 819662 CPU cycles = 28.59 MB/s
I (421) Memory Copy: 16-bit for loop copy IRAM->IRAM took 461530 CPU cycles = 50.78 MB/s
I (431) Memory Copy: 32-bit for loop copy IRAM->IRAM took 206343 CPU cycles = 113.59 MB/s
I (441) Memory Copy: 64-bit for loop copy IRAM->IRAM took 128300 CPU cycles = 182.68 MB/s
I (451) Memory Copy: memcpy IRAM->IRAM took 64658 CPU cycles = 362.48 MB/s
I (461) Memory Copy: async_memcpy IRAM->IRAM took 412233 CPU cycles = 56.85 MB/s
I (461) Memory Copy: PIE 128-bit IRAM->IRAM took 13145 CPU cycles = 1783.00 MB/s
I (471) Memory Copy: DSP AES3 IRAM->IRAM took 17028 CPU cycles = 1376.41 MB/s
I (471) Memory Copy: Freeing 100kb from IRAM
I (481) Memory Copy: Allocating 100kb in PSRAM, alignment: 32 bytes
I (501) Memory Copy: 8-bit for loop copy IRAM->PSRAM took 1075776 CPU cycles = 21.79 MB/s
I (501) Memory Copy: 16-bit for loop copy IRAM->PSRAM took 720958 CPU cycles = 32.51 MB/s
I (511) Memory Copy: 32-bit for loop copy IRAM->PSRAM took 462534 CPU cycles = 50.67 MB/s
I (521) Memory Copy: 64-bit for loop copy IRAM->PSRAM took 422886 CPU cycles = 55.42 MB/s
I (531) Memory Copy: memcpy IRAM->PSRAM took 413299 CPU cycles = 56.71 MB/s
I (541) Memory Copy: async_memcpy IRAM->PSRAM took 478345 CPU cycles = 49.00 MB/s
I (541) Memory Copy: PIE 128-bit IRAM->PSRAM took 403536 CPU cycles = 58.08 MB/s
I (551) Memory Copy: DSP AES3 IRAM->PSRAM took 409011 CPU cycles = 57.30 MB/s
I (551) Memory Copy: Swapping source and destination buffers
I (571) Memory Copy: 8-bit for loop copy PSRAM->IRAM took 1040185 CPU cycles = 22.53 MB/s
I (581) Memory Copy: 16-bit for loop copy PSRAM->IRAM took 696069 CPU cycles = 33.67 MB/s
I (591) Memory Copy: 32-bit for loop copy PSRAM->IRAM took 615118 CPU cycles = 38.10 MB/s
I (591) Memory Copy: 64-bit for loop copy PSRAM->IRAM took 602883 CPU cycles = 38.88 MB/s
I (601) Memory Copy: memcpy PSRAM->IRAM took 607278 CPU cycles = 38.59 MB/s
I (611) Memory Copy: async_memcpy PSRAM->IRAM took 458985 CPU cycles = 51.06 MB/s
I (621) Memory Copy: PIE 128-bit PSRAM->IRAM took 605593 CPU cycles = 38.70 MB/s
I (631) Memory Copy: DSP AES3 PSRAM->IRAM took 611055 CPU cycles = 38.36 MB/s
I (631) Memory Copy: Freeing 100kb from IRAM
I (631) Memory Copy: Allocating 100kb in PSRAM, alignment: 32 bytes
I (651) Memory Copy: 8-bit for loop copy PSRAM->PSRAM took 1422563 CPU cycles = 16.48 MB/s
I (661) Memory Copy: 16-bit for loop copy PSRAM->PSRAM took 1090496 CPU cycles = 21.49 MB/s
I (681) Memory Copy: 32-bit for loop copy PSRAM->PSRAM took 1049829 CPU cycles = 22.33 MB/s
I (691) Memory Copy: 64-bit for loop copy PSRAM->PSRAM took 1049128 CPU cycles = 22.34 MB/s
I (701) Memory Copy: memcpy PSRAM->PSRAM took 1049045 CPU cycles = 22.34 MB/s
I (711) Memory Copy: async_memcpy PSRAM->PSRAM took 895774 CPU cycles = 26.16 MB/s
I (721) Memory Copy: PIE 128-bit PSRAM->PSRAM took 1054546 CPU cycles = 22.23 MB/s
I (741) Memory Copy: DSP AES3 PSRAM->PSRAM took 1059057 CPU cycles = 22.13 MB/s
I (741) main_task: Returned from app_main()