ESP performance and cache

PeterR · Postby **PeterR** » Mon Dec 17, 2018 1:52 pm

I have run out of IRAM and so start to need to understand performance issues relating to the Cache, ROM and FLASH.
The manuals are not entirely clear about what can run and I have not found any performance details/estimates.

I will make a few statements in order to draw comment, please correct me as needed!

1) User application code may be run directly from internal ROM (448 KB)
If so what frequency does this work at & how does that compare to IRAM?
If so, do I use ROMFN_ATTR to locate in internal ROM & how do I check how much is free?

2) User application code may be placed and executed directly within RTC fast memory using RTC_IRAM_ATTR
and may also be called from regular functions.
If so how does RTC_IRAM_ATTR performance compare to IRAM. Is RTC_IRAM_ATTR the same speed as IRAM?

3) External Flash code memory
Is external memory fetched directly by the MMU over QSPI or is software involved?
If software is involved, what latency might I expect?
What throughput may I expect? I am guessing <20MB/S?? This gives me >500uS on a 10 KB code block.

4) External Flash
3.1.3 of the Esp32 datasheet suggests that external flash may be configured as instruction memory or read only memory.
I have a large Webserver which I intended to place within SPIFFS but could use another file system.
How do I ensure that the Webserver memory is ' mapped into read-only data memory space' and so prevent the reduction of cache performance discussed in section 3.1.3?
I found spi_flash_mmap() but usage is not clear.

5) External Memory & Cache
I did not follow the wording in 1.3.4 of the technical reference what does

CACHE_MUX_MODE is set to 1 or 2, PRO CPU and APP CPU cannot enable the Cache function at the same time

mean?
Is this just a comment that two updates to bits in one register may not go as expected?

Thanks

Postby **Angus** » Mon Dec 17, 2018 11:40 pm

Hi PeterR,

Probably the best place to start understanding ESP32 memory layout as used by ESP-IDF is this section in the docs, if you haven't already read it:
https://docs.espressif.com/projects/esp ... ory-layout

1) User application code may be run directly from internal ROM (448 KB)
If so what frequency does this work at & how does that compare to IRAM?
If so, do I use ROMFN_ATTR to locate in internal ROM & how do I check how much is free?

The ROM is mask ROM which is baked into the chip as part of the fabrication process. This is entirely read-only in all contexts. If you call a ROM function then it will run as fast as a function in IRAM, but the ROM functions are determined at chip fabrication time so user application code can't go here.

There is some potential for confusion here because flash mapped into the instruction address space via the flash cache (which is also read-only at chip run-time but can be reflashed) is sometimes referred to as "IROM" in ESP-IDF code, as a shorter way to say "mapped flash instruction memory".

2) User application code may be placed and executed directly within RTC fast memory using RTC_IRAM_ATTR
and may also be called from regular functions.
If so how does RTC_IRAM_ATTR performance compare to IRAM. Is RTC_IRAM_ATTR the same speed as IRAM?

The main property of this 8KB of RTC fast memory is that it persists in deep sleep so can be used for deep sleep wake stubs.

However it is possible to put some regular functions here if tight on regular IRAM. However note that RTC FAST memory is only accessible from the PRO CPU (CPU0). So any code running here can only be called from tasks pinned to CPU 0 (or interrupt handlers assigned to CPU 0).

I believe that when the CPU is running the RTC fast memory is as fast as other internal memory (DRAM or IRAM) but I will check this and confirm.

3) External Flash code memory
Is external memory fetched directly by the MMU over QSPI or is software involved?
If software is involved, what latency might I expect?
What throughput may I expect? I am guessing <20MB/S?? This gives me >500uS on a 10 KB code block.

No software is involved, the flash cache layer is entirely handled by hardware and transparent to software.

In the event of a cache miss, the cache line is automatically filled via a read from the flash chip. The cache line size is 32 bytes. With Quad SPI (QIO mode) this requires something like 75 clocks to fill the cache (8 clocks for the command plus 64 clocks to read 32 bytes of data at 4 bits per clock). At 80MHz (max flash speed) this means 1000ns to read a line, approx 16MB/sec max throughput.

(EDIT: Previous version of this paragraph said cache lines are 16 bytes, but they are 32 bytes. Calculations adjusted to match.)

In the cache of a cache hit, reading from flash cache is as fast as reading from IRAM.

The cache size is 32KB per CPU (each CPU's cache is independent).

Characterizing performance of cached code exactly is difficult because of the cache, CPU pipelining, and the code layout in flash as determined by compiler/linker. Individual cache misses tend to average out for "real" workloads, but micro-benchmarks can be misleading because of individual cache interactions.

Moving "performance critical" code to IRAM, especially any small piece of code which always needs to run at maximum speed, is the best way to ensure consistent highest possible performance.

4) External Flash
3.1.3 of the Esp32 datasheet suggests that external flash may be configured as instruction memory or read only memory.
I have a large Webserver which I intended to place within SPIFFS but could use another file system.
How do I ensure that the Webserver memory is ' mapped into read-only data memory space' and so prevent the reduction of cache performance discussed in section 3.1.3?
I found spi_flash_mmap() but usage is not clear.

By default, all code (.text section in the ELF file) is mapped to instruction memory via flash cache and all read-only data ("const" variables and strings, .rodata section in the ELF file) is mapped to data memory via flash cache. As you've seen, attributes can be override these defaults.

I don't understand what the "webserver" is? Is it code or data? How large is "large"?

If you have some large chunk of binary or text data which is specific to the firmware, another option is to embed it in the binary (in which case it will be mapped as read-only data via flash cache):
https://docs.espressif.com/projects/esp ... inary-data

5) External Memory & Cache
I did not follow the wording in 1.3.4 of the technical reference what does
CACHE_MUX_MODE is set to 1 or 2, PRO CPU and APP CPU cannot enable the Cache function at the same time
mean?
Is this just a comment that two updates to bits in one register may not go as expected?

If either POOL0 or POOL1 is not used for a flash cache, the "pool" memory can be accessed as regular IRAM by the CPU.

However, only a single CPU can use a pool at one time. Normally POOL0 is used by CPU0 and POOL1 is used by CPU1. Setting CACHE_MUX_MODE to 1 or 2 allows both CPUs to be assigned to a single pool. This means, for example, CPU1 could use POOL0 for its cache memory or CPU0 could use POOL1. However in these modes only a single CPU can use the flash cache at a given time.

If you're using ESP-IDF then only two cache configurations are actually supported - dual CPU mode where CPU0 used POOL0 and CPU1 uses POOL1, and single CPU mode where CPU0 uses POOL0 and POOL1 is added into the address space and can be used as regular IRAM. ESP-IDF will configure the flash cache at startup based on the choices in the project config. These other more exotic configurations are not supported.

If I may make an observation and then a suggestion:

It seems like you're taking performance very seriously and this is commendable. However, on a complex but still resource constrainted system like an ESP32 there are a lot of overlapping factors which can make it hard to figure out all of the best performance choices in advance (as you're finding).

If you can, try implementing your firmware in ESP-IDF without worrying too much about optimisations, don't use IRAM_ATTR or similar, use the simplest way to read data from flash instead of worrying about the fastest, etc. Once the functions are implemented, measure the performance you get. If you then find some particular function of the firmware does not meet your performance requirements, or uses too much RAM, then start by optimising this part of the code and re-measuring the performance after each change. Repeat if necessary until the performance is adequate.

PeterR · Postby **PeterR** » Tue Dec 18, 2018 11:09 am

Hi, Thanks for your full answer.

4) External FLASH (& webserver)
Around 3.5 MB of webpages. I have scope to reduce. Bootstrap was added but we don't really need it.
The server itself will be quite small. The usual stuff, websockets & CGI.

I thought SPIFFS as the compile & upload process seemed smooth. I found a fork which had directory support as well.
Your link seems a much more manual process however. I suppose I could script & automatically generate the URI handler etc.
I have used embedded file systems before & pinching makefs or similar would seem simpler.

I agree, write then optimize. 'Premature optimization' & all that.

I have been developing for a few years however & believe that I have a good feeling for what flies.
The 'change' is that I missed that IRAM would be pretty much used up by the IDF.
That tends to put my application code on two 16 MHz processors + what ever I can gain out of cache & tight loops.

Thanks again for your detailed answer.

Ritesh · Postby **Ritesh** » Tue Dec 18, 2018 1:40 pm

Hi, Thanks for your full answer.

4) External FLASH (& webserver)
Around 3.5 MB of webpages. I have scope to reduce. Bootstrap was added but we don't really need it.
The server itself will be quite small. The usual stuff, websockets & CGI.

I thought SPIFFS as the compile & upload process seemed smooth. I found a fork which had directory support as well.
Your link seems a much more manual process however. I suppose I could script & automatically generate the URI handler etc.
I have used embedded file systems before & pinching makefs or similar would seem simpler.

I agree, write then optimize. 'Premature optimization' & all that.

I have been developing for a few years however & believe that I have a good feeling for what flies.
The 'change' is that I missed that IRAM would be pretty much used up by the IDF.
That tends to put my application code on two 16 MHz processors + what ever I can gain out of cache & tight loops.

Thanks again for your detailed answer.

Hi,

You can use below link for SPIFFS support with directories as well.

https://github.com/loboris/ESP32_spiffs_example

ESP performance and cache

ESP performance and cache

Re: ESP performance and cache

Re: ESP performance and cache

Who is online

About Us

Extra

Information