The device is one piece of a custom flight simulation peripheral suite, and is fairly straightforward from a hardware standpoint. The ESP32 is the main/only MCU, and it communicates with 3x MCP23S17 GPIO expanders (SPI) as well as a MCP251863 CAN controller+transceiver (also SPI, same bus). I do not use DMA for SPI on this board, as the MCU is capable of the required duty cycle easily without it.
My development environment is PlatformIO and the official "espressif32" platform with "arduino" framework; I'm not using the 3rd-party "pioarduino" platform. This means I'm stuck on an older-than-ideal Arduino core v2.0.17, and I'm leveraging the Adafruit TinyUSB Library v3.3.4 for USB functionality, as this is the newest version of the library that compiles with that platform core. I have spent upwards of 25 hours trying to find a better way that would allow me to use the latest core and USB drivers, but without success.
For reference, here's the core/platform-related version dumped by my app firmware at boot:
Code: Select all
[ 26627][I][app.cpp:45] app_init(): [APP] ESP-IDF version: v4.4.7-dirty
[ 26637][I][app.cpp:46] app_init(): [APP] FreeRTOS version: V10.4.3
[ 26647][I][app.cpp:47] app_init(): [APP] Core TinyUSB version: 0.17.0
[ 26657][I][app.cpp:48] app_init(): [APP] Arduino framework version: 2_0_17 (5e19e086)
[ 26669][I][app.cpp:49] app_init(): [APP] Adafruit TinyUSB version: 3.3.4 (API version 30000)I've given up on rock-solid stability at this point, for this design variant. BUT, while I can deal with an extremely infrequent board reset, I can't deal with a hang that requires user intervention to recover...but that's what happening.
In short, the board stops executing code after running perfectly well for anywhere between 10 minutes and 24 hours, roughly. It seems to be very random. The normal application code blinks a hardware LED on/off at 1 Hz during normal operation, and when the problem occurs, the blinking stops. It's visually very obvious.
I've had UART serial debugging output (115200 baud) with ESP_LOGx for the entire troubleshooting effort, and system-wide INFO-level logs are enabled. (DEBUG-level logs, system-wide, would flood the interface and severely change the behavior of the overall system.) When the freeze happens, there are no helpful messages over UART immediately before.
Eventually I got an ESP-PROG debugger connected via JTAG, and confirmed that I could pause execution, set and trigger breakpoints, etc. This let me get a little closer, but ultimately the debugger just pauses with a SIGTRAP signal, sometimes (?) seeming to come from the main "loopTask" FreeRTOS task that handles the main program loop. I have enabled the loop watchdog task timer, and every other watchdog timer I know how to enable, so I can't for the life of me figure out why this condition doesn't trigger a board reset. But it doesn't.
I have build flags at "-Og -ggdb3 -fno-inline" and my debug_extra command list (in platformio.ini) as "handle SIGTRAP stop print nopass", which I read somewhere online might help the debugger grab the problem before it gets too lost inside internal exception/panic handlers. But no dice.
For reference, this is what I get from the debug session inside PlatformIO when the SIGTRAP occurs:
Code: Select all
[esp32s2] Target halted, PC=0x4002A562, debug_reason=00000002
Thread
5 "loopTask" received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 1073545132]
_xt_context_save () at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/free
rtos/port/xtensa/xtensa_context.S:145
145 /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port/xtensa
/xtensa_context.S: No such file or directory.Some relevant registers:
- exccause is 0x00000000
- excvaddr is also 0x00000000
- debugcause is 0x00000104
I've tried to investigate other stack frames, exception-related registers, etc. but I can't get anywhere that points me to the true source of the error/hang/instruction/memory access that actually triggers the failure mode. I'm not even sure the SIGTRAP is the same or related to the thing causing the board hang when the debugger isn't connected. But it's all I've got to work with.
I'm at a loss for what to do next. Where can I look? How do I troubleshoot this better? I'm willing to try just about anything at this point. I even entertained power supply stability as a potential cause, maybe heap corruption or stack overflow, but none of the canaries I tried ever triggered. All tasks consistently report plenty of available stack space. Heap fragmentation checks never trip.
Is there anything I can do to debug further? Can I share anything else relevant that might help?