Random boot failure depending on .flash.text size

sgiasm
Posts: 4
Joined: Mon Jul 01, 2024 11:25 am

Random boot failure depending on .flash.text size

Postby sgiasm » Mon Dec 09, 2024 3:58 pm

We've encountered a strange problem: completely randomly, after change to the source code (the actual change is irrelevant), firmware becomes "damaged" or "incorrect": it starts rebooting due to LoadProhibited or IllegalInstruction somewhere around esp_psram_init (usually it's movi.n a3, 0), so this is even before our code.
Rebuilding doesn't help, it reproduces on different devices (so, for example if someone pushes the code in crashing state - all developers, even with different host OS, will have bootloop on their devices)
Instructions in disassembly look completely correct and are the same in normal and bootlooping firmwares.
After investigation we found out that the size of the .flash.text block of the firmware matters. So, to fix this bootloop usually is enough to add a string to the source code, for example LOGI("crashfix", "crashfix"); line. This .flash.text block lies before instructions, so it moves them.

Here is some statistics of movi.n a3, 0 address:
0x4200b589, 0x4200b669, 0x4200b689 - LoadProhibited crash
0x4200b665, 0x4200b591 - works OK

Currently it's just an annoying ritual - to add/remove log line if firmware crashes, but last time it became worse – simple one-line log didn't fix it and only removing a bunch of logs helped. I will post here more info, once it reproduces again. And we haven't encountered this problem on ESP-IDF 4.*.* but that may be also due to the small firmware size at that moment.

So, currently I don't know if it's something wrong with our setup (we are not doing anything fancy), or toolchain or hardware bug. Any ideas and information are really appreciated.


Example of disassembly:

Code: Untitled.asm Select all


...
Disassembly of section .flash.text:

42000020 <_stext>:
42000020: 3fca1a24
42000024: 3c0e0120

...

s_psram_ctx.is_initialised = true;
4200b661: d4e831 l32r a3, 42000a04 <_stext+0x9e4> (3fca22c4 <s_psram_ctx>)
4200b664: 180c movi.n a8, 1
4200b666: 004382 s8i a8, a3, 0
uint32_t psram_physical_size = 0;
4200b669: 030c movi.n a3, 0 <--------------- PC
4200b66b: 0139 s32i.n a3, a1, 0
Example of logs:

Code: Select all

...
I (490) cpu_start: Multicore app
I (490) octal_psram: vendor id    : 0x0d (AP)
I (490) octal_psram: dev id       : 0x02 (generation 3)
I (493) octal_psram: density      : 0x03 (64 Mbit)
I (499) octal_psram: good-die     : 0x01 (Pass)
I (504) octal_psram: Latency      : 0x01 (Fixed)
I (509) octal_psram: VCC          : 0x01 (3V)
I (514) octal_psram: SRF          : 0x01 (Fast Refresh)
I (520) octal_psram: BurstType    : 0x01 (Hybrid Wrap)
I (526) octal_psram: BurstLen     : 0x01 (32 Byte)
I (531) octal_psram: Readlatency  : 0x02 (10 cycles@Fixed)
I (538) octal_psram: DriveStrength: 0x00 (1/1)
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
0x4200b669: esp_psram_init at C:/Users/.../esp/v5.1.4/esp-idf/components/esp_psram/esp_psram.c:135
PC      : 0x4200b669  PS      : 0x00060f30  A0      : 0x80376109  A1      : 0x3fceb2a0
A2      : 0x00000000  A3      : 0x3fca22c4  A4      : 0x00000001  A5      : 0x50000000
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x00000001  A9      : 0x3fceb280
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x00000028  A13     : 0x3fca8294
A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x80376441  LBEG    : 0x40056f08  LEND    : 0x40056f12  LCOUNT  : 0x00000000
0x40056f08: __memcpy_aux in ROM
0x40056f12: __memcpy_aux in ROM

Backtrace: 0x4200b666:0x3fceb2a0 0x40376106:0x3fceb2e0 0x403cd9b4:0x3fceb340 0x403cddf9:0x3fceb380 0x403c9971:0x3fceb4b0 0x40045c01:0x3fceb570 0x40043ab6:0x3fceb6f0 0x40034c45:0x3fceb710
0x4200b666: esp_psram_init at C:/Users/.../esp/v5.1.4/esp-idf/components/esp_psram/esp_psram.c:133
0x40376106: call_start_cpu0 at C:/Users/.../esp/v5.1.4/esp-idf/components/esp_system/port/cpu_start.c:473
0x40045c01: ets_run_flash_bootloader in ROM
0x40043ab6: main in ROM
0x40034c45: .stack_ok in ROM

Chip: ESP32-S3, rev. 0.2
Board: ESP32-S3-WROOM-1
RAM: 8MB, Octal
Flash: 16MB
ESP-IDF: 5.1.4, 5.3.1
Framework: ESP-IDF
Language: C++
Host OS: Linux, Windows

Code: Select all

Total sizes:
Used static IRAM:   99994 bytes ( 262246 remain, 27.6% used)
      .text size:   98967 bytes
   .vectors size:    1027 bytes
Used stat D/IRAM:   62772 bytes ( 283084 remain, 18.1% used)
      .data size:   21276 bytes
      .bss  size:   41496 bytes
Used Flash size : 1147203 bytes
           .text:  889375 bytes
         .rodata:  257572 bytes
Total image size: 1268473 bytes (.bin may be padded larger)

Sprite
Espressif staff
Espressif staff
Posts: 10593
Joined: Thu Nov 26, 2015 4:08 am

Re: Random boot failure depending on .flash.text size

Postby Sprite » Tue Dec 10, 2024 4:12 am

That is odd. What hardware are you trying this on?

sgiasm
Posts: 4
Joined: Mon Jul 01, 2024 11:25 am

Re: Random boot failure depending on .flash.text size

Postby sgiasm » Thu Dec 12, 2024 9:43 am

That is odd. What hardware are you trying this on?

Code: Select all

ESP32-S3-WROOM-1
N16R8 board
FCCID: 2AC7Z-ESPS3WROOM1
IC: 21086-ESPS3WROOM1
CMIITID: 2022DP2892
ESP32-S3, rev. 0.2
and

Code: Select all

ESP32-S3-WROOM-1
M0N8R8 board
IC: 21098-ESPS3WROOM1
CMIITID: 2022DP2892
ESP32-S3, rev. 0.1

Sprite
Espressif staff
Espressif staff
Posts: 10593
Joined: Thu Nov 26, 2015 4:08 am

Re: Random boot failure depending on .flash.text size

Postby Sprite » Fri Dec 13, 2024 2:51 am

That is on a custom board, I assume? Anything connected to GPIO35, 36, 37 or 45?

sgiasm
Posts: 4
Joined: Mon Jul 01, 2024 11:25 am

Re: Random boot failure depending on .flash.text size

Postby sgiasm » Fri Dec 13, 2024 9:11 am

Developers are working on ESP32-S3-WROOM-1 developer boards (like this https://www.amazon.de/-/en/gp/product/B0BMQ8F7FN). We have Waveshare E-Ink display connected to 38, 39, 40, 15, 16, 18, all other pins are free.
We have custom boards, but I didn't check if crash reproduces on them (at current stage they are only remotely available, so we don't use them often). Will check next time it appears.
If this can be pins-related, next time I will also disconnect everything.

User avatar
TheRealRevK
Posts: 14
Joined: Mon Aug 26, 2019 7:32 pm

Re: Random boot failure depending on .flash.text size

Postby TheRealRevK » Thu Dec 19, 2024 9:21 am

I have started seeing the same. It went away (probably because I was adding log lines).

I can confirm, just adding a log line fixed it. i.e. many builds, fullclean, checking sdkconfig, etc, no help, add one log line, and fine.

Code: Select all

SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce2810,len:0xdc
load:0x403c8700,len:0x4
load:0x403c8704,len:0xa08
load:0x403cb700,len:0x2464
entry 0x403c8854
Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x42006bd8: 0000dfb1 00106403 00a0b2cc
--- 0x42006bd8: esp_chip_info at /Users/adrian/esp/5.3.1/components/esp_hw_support/port/esp32s3/chip_info.c:13

Core  0 register dump:
PC      : 0x42006bdc  PS      : 0x00060d30  A0      : 0x803cc521  A1      : 0x3fceb2e0  
--- 0x42006bdc: esp_chip_info at /Users/adrian/esp/5.3.1/components/esp_hw_support/port/esp32s3/chip_info.c:13

A2      : 0x00000001  A3      : 0x3c0c0020  A4      : 0x3fc9d347  A5      : 0x00000004  
A6      : 0x0000000c  A7      : 0x0000000c  A8      : 0x8037610f  A9      : 0x3fceb2c0  
A10     : 0x3fceb2e0  A11     : 0x00000009  A12     : 0x00100000  A13     : 0xfc000000  
A14     : 0x0000003f  A15     : 0x0000000c  SAR     : 0x00000006  EXCCAUSE: 0x00000000  
EXCVADDR: 0x00000000  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0x00000000  
--- 0x40056f5c: memcpy in ROM
0x40056f72: memcpy in ROM



Backtrace: 0x42006bd9:0x3fceb2e0 0x403cc51e:0x3fceb340 0x403cc75a:0x3fceb380 0x403c88a9:0x3fceb4b0 0x40045c01:0x3fceb570 0x40043ab6:0x3fceb6f0 0x40034c45:0x3fceb710
--- 0x42006bd9: esp_chip_info at /Users/adrian/esp/5.3.1/components/esp_hw_support/port/esp32s3/chip_info.c:13
0x40045c01: ets_run_flash_bootloader in ROM
0x40043ab6: main in ROM
0x40034c45: .stack_ok in ROM

sgiasm
Posts: 4
Joined: Mon Jul 01, 2024 11:25 am

Re: Random boot failure depending on .flash.text size

Postby sgiasm » Wed Jan 15, 2025 7:00 pm

Crash happened again.
I disconnected display, it didn't help. So it's just a dev-board with nothing connected.
I pushed my code and used it remotely on custom board and it crashed also. But! The firmware on custom board is different (due to some flags). And it crashed in different line, but still crash/no-crash state is shared across all our devices/hosts.

Here is brief statistics of current crash on two devices (S3 dev-board on Windows and custom S3 on MacOS, both on ESP-IDF 5.3.1):
Developer board:

Code: Select all

Reason           | IllegalInstruction
Flash .text size | 856090
PC               | 0x420080fc
PC asm           | entry a1, 32
C++ line         | void esp_chip_info(esp_chip_info_t *out_info)
Custom board:

Code: Select all

Reason           | StoreProhibited
Flash .text size | 950330
PC               | 0x42008be2
PC asm           | s32i.n a8, a2, 4
C++ line         | out_info->features = CHIP_FEATURE_WIFI_BGN | CHIP_FEATURE_BLE;
Here is statistics I've got so far:

Code: Select all

+--------------+------------+------------------+----------------------------------------------------------------+--------------------+--------------------+---------+
|              | PC         | PC asm           | PC C++                                                         | Reason             |   Flash .text size | ESP-IDF |
+--------------+------------+------------------+----------------------------------------------------------------+--------------------+--------------------+---------+
| crash        | 0x4200b5d2 | callx8 a8        | ret = esp_psram_impl_get_physical_size(&psram_physical_size);  | LoadProhibited     |             889023 | 5.1.4   |
| crash        | 0x4200b669 | movi.n a3, 0     | uint32_t psram_physical_size = 0;                              | LoadProhibited     |             889375 | 5.1.4   |
| crash        | 0x4200b689 | movi.n a3, 0     | uint32_t psram_physical_size = 0;                              | LoadProhibited     |             889883 | 5.1.4   |
| OK           | 0x4200b665 | movi.n a3, 0     | uint32_t psram_physical_size = 0;                              | OK                 |             889363 | 5.1.4   |
| OK           | 0x4200b591 | movi.n a3, 0     | uint32_t psram_physical_size = 0;                              | OK                 |                nan | 5.1.4   |
| crash        | 0x420080fc | entry a1, 32     | void esp_chip_info(esp_chip_info_t *out_info)                  | IllegalInstruction |             856090 | 5.3.1   |
| crash        | 0x42008be2 | s32i.n a8, a2, 4 | out_info->features = CHIP_FEATURE_WIFI_BGN | CHIP_FEATURE_BLE; | StoreProhibited    |             950330 | 5.3.1   |
| crash        | 0x4200813c | entry a1, 32     | void esp_chip_info(esp_chip_info_t *out_info)                  | IllegalInstruction |             857154 | 5.3.1   |
| OK           | 0x42008138 | entry a1, 32     | void esp_chip_info(esp_chip_info_t *out_info)                  | OK                 |             857110 | 5.3.1   |
+--------------+------------+------------------+----------------------------------------------------------------+--------------------+--------------------+---------+

chegewara
Posts: 2505
Joined: Wed Jun 14, 2017 9:00 pm

Re: Random boot failure depending on .flash.text size

Postby chegewara » Thu Jan 16, 2025 2:02 am

I have the same or similar issue from time to time.
When adding code even before that code have chance to start there is boot loop, but adding some extra log code is solving the issue.
It is very annoying, but i didnt try to debug or report it.

Johan Forslöf
Posts: 1
Joined: Wed May 28, 2025 7:48 am

Re: Random boot failure depending on .flash.text size

Postby Johan Forslöf » Wed May 28, 2025 10:46 am

We have also experienced this issue on a ESP32-S3-WROOM-1U-N8R8 boards with certain firmware builds faulting during PSRAM initialization early in the application startup sequence.

I lack the time to fully investigated this or drive an bug ticket with Espressif but believe that I have found a plausible root cause and bug fix.

Anyway, the fault creates an infinite reset loop (se below) which the device never exists.
I (587) boot: Loaded app from partition at offset 0x20000
I (587) boot: Disabling RNG early entropy source...
I (598) cpu_start: Multicore app
I (599) octal_psram: vendor id : 0x0d (AP)
I (599) octal_psram: dev id : 0x02 (generation 3)
I (602) octal_psram: density : 0x03 (64 Mbit)
I (607) octal_psram: good-die : 0x01 (Pass)
I (612) octal_psram: Latency : 0x01 (Fixed)
I (618) octal_psram: VCC : 0x01 (3V)
I (623) octal_psram: SRF : 0x01 (Fast Refresh)
I (629) octal_psram: BurstType : 0x01 (Hybrid Wrap)
I (634) octal_psram: BurstLen : 0x01 (32 Byte)
I (640) octal_psram: Readlatency : 0x02 (10 cycles@Fixed)
I (646) octal_psram: DriveStrength: 0x00 (1/1)
I (652) MSPI Timing: PSRAM timing tuning index: 5
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
This issue occurs randomly for certain builds and appears to be affected by FLASH size/placement such that inserting or removing unrelated code prevents the fault.

The fault is deterministics with any bad build continually faulting when installed. This has been reported by multiple team members and occurs with at least ESP-IDF versions 5.1.1 and 5.1.6.

JTAG debugging shows that faults occuring as CPU attempts to execute invalid code in esp_psram_init soon after return from esp_psram_impl_enable. The exact code varies one example is a l32i.n instruction attempting to read from a register not storing an address.

This incorrect gibberish code is inconsistent with the actual content of FLASH.

Test findings:
- The incorrectly content is located within a 32-byte instruction cache line following return from esp_psram_impl_enable.
- The fault occurs when the call esp_psram_impl_enable to is placed to close to the end of the prior cache line.
- The FLASH content remains valid. Invaliding the cache by reading a larger memory block or inspecting the data cache mirror at 0x3C00000 does fetch the correct content.
- Adding a brief delay (or cache invalidation) early in esp_psram_impl_enable prevents the issue.
- The critical point is reconfiguration of the SPI1 clock divider when entering low speed mode for the timing calibration. Delay before this register write prevents the issue and while delays afterwards have no effect.
- SPI0 is used for the external FLASH while SPI1 is used for PSRAM. Both peripherals some clocking resources.
- The esp_psram_impl_enable code is placed in IRAM.
- The fault is not consistent for a particular cache line alignment of the esp_psram_impl_enable call, though so far all cases return within the last 4-bytes of the cache line.
- There ESP32-S3 errata sheet lists no relevant issues and later IDF versions do not appear to include a fix.
- Using Cache_Suspend_DCache_Autoload to disable or speculative prefetching or Cache_Config_ICache_Autoload to force aggressive prefetch does not exacerbate the issue. Admittedly I may be using thsese wrong given the absence of documentation.

My current working theory is that:
- The CPU pipeline has started fetcing the next cache line after the call to esp_psram_impl_enable when located close to the cache line boundary.
- PSRAM configuration races against the instruction line fetch from FLASH and clock divider configuration of the FLASH interface mid-transfer.
- This race ends up fetching incorrect data at the end of the cache line.
- The PSRAM initialization is missing synchronization to stop or complete any ongoing FLASH access.
- Modifying the program size changes the esp_psram_init placement in FLASH to hide the problem.
- Other confounding variable affect the issue. Perhaps the new cache line has already been fetched for other reasons, or associativity-related collisions are significant, etc.

For now I have patched esp_psram_impl_enable to add an early call to esp_rom_spiflash_wait_idle(&g_rom_flashchip) as a workaround to synchronize ongoing FLASH transactions. This fixes our known repro cases and has avoided any reoccurrence thus far.

Who is online

Users browsing this forum: PetalBot and 1 guest