Page 1 of 1

IRAM fragmentation at startup?

Posted: Mon Apr 09, 2018 7:03 am
by Alekos2313
Hello,
In my current project developed on WROVER KIT V3, I need a large and fast memory buffer. Testing with buffer allocated on external RAM, speed was not enough and timing was not deterministic. I guess since the PSRAM bus is shared with SPI FLASH, this makes sense/ Since the project is about converting a stream format to another in realtime streaming, this is not a trade-of I can make.I though it would be best to use internal RAM and assumed that the 520K of internal RAM would have some of the space taken up by FreeRTOS and other base services, but a good amount would be still available for my use.
The project I have here, comes from the esp32 template with very few stuff added. At the very early stage of app_main(), I called 'heap_caps_print_heap_info( MALLOC_CAP_INTERNAL );' to see what memory I have left. The result is not what I expected. Here I quote from terminal:
Heap summary for capabilities 0x00000800:
At 0x3ffae7e4 len 2047 free 120 allocated 1844 min_free 120
largest_free_block 120 alloc_blocks 11 free_blocks 1 total_blocks 12
At 0x3ffae6e0 len 6432 free 4 allocated 6288 min_free 4
largest_free_block 4 alloc_blocks 26 free_blocks 1 total_blocks 27
At 0x3ffbc898 len 145256 free 130988 allocated 14184 min_free 130028
largest_free_block 130988 alloc_blocks 12 free_blocks 1 total_blocks 13

At 0x3ffe0440 len 15296 free 15260 allocated 0 min_free 15260
largest_free_block 15260 alloc_blocks 0 free_blocks 1 total_blocks 1
At 0x3ffe4350 len 113840 free 113804 allocated 0 min_free 113804
largest_free_block 113804 alloc_blocks 0 free_blocks 1 total_blocks 1

At 0x40098d0c len 29428 free 29392 allocated 0 min_free 29392
largest_free_block 29392 alloc_blocks 0 free_blocks 1 total_blocks 1
Totals:
free 289568 allocated 22316 min_free 288608 largest_free_block 130988
I would expect that since I print the report at startup, not many malloc/free have run and the reserved memory would be close to start of IRAM, leaving a large block free at the bottom. Also this total (289568 + 22316 ) is much much less than 520K. I suppose part of the IRAM is given for IRAM_ATTR functions? or maybe I don't understand well the memory layout... or I could configure the SDK to use more (or all) or the IRAM as heap.
Looking at the tech-reference there should be 3 blocks of IRAM:SRAM 0 (192 KB), Internal SRAM 1 (128 KB), and
Internal SRAM 2 (200 KB).

In this report I can see 2 large blocks, the ones I marked bold in the report. Also I cant see why there are small pieces reserved on all blocks. In such an early stage that I call the report, I would expect that all memory reserved would be coming from the first block (SRAM0?), but here fragmentation is already evident.

Can you help me explain how I control and solve this situation?
Please advise!
Thank you

Re: IRAM fragmentation at startup?

Posted: Mon Apr 09, 2018 2:26 pm
by vonnieda
There may be some info in this thread that helps: https://esp32.com/viewtopic.php?f=2&t=3802

Re: IRAM fragmentation at startup?

Posted: Mon Apr 09, 2018 10:26 pm
by WiFive

Re: IRAM fragmentation at startup?

Posted: Tue Apr 10, 2018 2:50 am
by ESP_Angus
Hi Alekos,

I've replied to a few points below inline. Hopefully this will help explain the situation. You can also find some supporting information in the docs here:
http://esp-idf.readthedocs.io/en/latest ... alloc.html
http://esp-idf.readthedocs.io/en/latest ... ory-layout
Alekos2313 wrote:Hello,
Testing with buffer allocated on external RAM, speed was not enough and timing was not deterministic. I guess since the PSRAM bus is shared with SPI FLASH, this makes sense/ Since the project is about converting a stream format to another in realtime streaming, this is not a trade-of I can make.
The critical differentiator for external RAM is locality of the data accesses. If you have to sweep the entire large buffer each time you read/write, it will be slow because all operations have to go via the external RAM chip. If you access a small part of the RAM at a time, these operations can be cached in the internal memory so they are faster.
Alekos2313 wrote: I would expect that since I print the report at startup, not many malloc/free have run and the reserved memory would be close to start of IRAM, leaving a large block free at the bottom. Also this total (289568 + 22316 ) is much much less than 520K. I suppose part of the IRAM is given for IRAM_ATTR functions? or maybe I don't understand well the memory layout... or I could configure the SDK to use more (or all) or the IRAM as heap.
There's a few things to note here:

- IRAM is Instruction RAM, which is usually used to house executable code. But if you plan to only access that memory buffer via 32-bit operations, you can use it as general data-style memory. IRAM is always at a 0x4....... address, so it's only the last "heap" in the dump shown above.

- DRAM is Data RAM. This is all the other addresses (0x3f.......) in the dump. DRAM can be accessed in any size reads/writes. To see DRAM only, pass (MALLOC_CAP_INTERNAL | MALLOC_CAP_8BIT) to the heap functions. The "generic" (ie non-capability-aware) heap functions will always allocate from DRAM only (although they may use external DRAM for large allocations, if configured to do this.)

- Heap is dyamically allocated memory, which is all the memory which is available after statically allocated memory is taken up by the program. For IRAM, this means code which is placed in IRAM (either via IRAM_ATTR or because the entire source file is linked to IRAM). For DRAM, it means statically allocated memory (static variables, etc.) The commands "make size", "make size-components" and "make size-files" will give you different levels of breakdown on the static memory usage of your program. ESP-IDF will take up static memory in any non-trivial program.

- The fact we have a bunch of heaps (including some small ones) is because some memory is used by the early startup code (before we have any heap initialized) and some memory is used by the ROM code. Where possible we reclaim this at the end of startup and make it available as heap, but there are sometimes gaps or regions of static memory still used by ROM function routines. Hence the fragmentation, even at startup.

How large is the contiguous buffer of RAM you need?

Re: IRAM fragmentation at startup?

Posted: Fri Aug 03, 2018 8:24 am
by Alekos2313
Hello, thank you for your answer, I was busy with other stuff, and my response is late, sorry!
So, the Idea is to have a configurable system, and that configuration will determine the chunks of fast RAM that are needed.
Consider it as a protocol translator, where streaming data arrives in a UDP socket in bursts of constant length packets.
These packets need to put in a temporary queue for processing, and then sent out as modified frames which are made up of X packets in one frame. That "queue" is currently implemented as a 2-page ring-queue. Page A is the one I write in, and page B holds the previous frame that must stay untouched, while it is read by the sending thread. I use 2 pages, because some sequences of packets may contain corrupt or missing data, so If the code decides to drop a whole frame (X packets ), its most efficient to ignore a whole page and use next, instead of trying to "fix" the ring queue contents. After some light processing on the packets, rip off the first header etc, payload of X packets is placed as a whole frame in an output buffer. They will be picked up by another thread, and sent out as a serial stream.
An example with numbers would be: receive 320 packets of 600 bytes, per second. For these every 16 make up a frame, so they must be processes as they arrive and build up a header-stripped payload of 16 * 510 bytes that makes up a full frame. And this full frame is to be copied in the output buffer and sent out serially. Since the output uses the RMT module, each byte of the 16 * 510 bytes, is translated into a 16 * 510 * 8 uint32 which is the suitable for RMT. To save RAM, this is done on the fly as RMT reads 32 values of these uint32, next chunk is converted and placed in the RMT buffer. And this will happen for approximately 20 frames /sec. Even-though the data rate is not high, because of the real-time streaming and 2-level processing, the job becomes somewhat heavy and RAM consuming.

If I could do some raw reception directly out of the wifi buffer and skip the lwip stack and socket handling, I could spare one stage of buffering. In other words if I could handle the wifi-receive H/W interrupt directly I would do all the header processing right there, and dump only the useful payload in the second stage buffer, ready for streaming out. Until now I haven't been able to do so, and from what I read, even a raw API for lwip, is not well supported in IDF.

I hope now it's clear what kind of RAM buffers I need, and what I am trying to do. Note that the first stage buffering needs to be ~150% larger that the expected data, since the wifi reception timing is not always stable, so I should be able to absorb fluctuations without dropping packets.
Thank you
Alex