Mysterious freeze/stop in the field
Posted: Tue Feb 22, 2022 11:05 am
I have an ESP32 ESP-IDF project, on M5stack Atom board.
I have shipped hundreds of products in production.
Basically my application connects to a wifi AP, poll my server for a few minutes, then connect to a bluetooth peripheral, and lopps back.
Some users encounter the device stopping at some point. The shorter time one stopped is a few hours. Some devices never stopped working.
I have reproduced it once, after 3 days, with monitoring enabled.
They always stop during polling the server, during what is a basically a for loop, each iteration retrieving some info from the server. I know that they didn't stop when receiving a particular server reponse, only the normal empty response ("nothing to do").
In all cases, when restarted, it worked correctly again.
Hypothesis 1 : device freezes
I have checked that both interrupt watchdog and task watchdog are set and working. Wactdhog successfully restart the device when not fed.
Hypothesis 2 : device overheats
I have tested adding lot of overhead to make both CPUs run at 100%, and it didn't make it overheat and stop.
When I reproduced, measured temperature was OK.
Hypothesis 3 : no more RAM
When no RAM is available anymore, device reboots correctly.
When I reproduced, there was a lot of free RAM (I logged it).
Tried solution 1 : restart after each 24h
I added code to restart it automaticaly every 24 hours. Issue still occurs
Tried solution 2 : restart when RAM is low
I added code to restart it automaticaly when RAM is low. Issue still occurs
Does someone have an idea what could happen ? I'm not even sure in which state it is.
Would adding a custom watchdog with FreeRTOS help ? (restarting if the main loop was not run for a few minutes)
Any other solution ?
Thanks !
I have shipped hundreds of products in production.
Basically my application connects to a wifi AP, poll my server for a few minutes, then connect to a bluetooth peripheral, and lopps back.
Some users encounter the device stopping at some point. The shorter time one stopped is a few hours. Some devices never stopped working.
I have reproduced it once, after 3 days, with monitoring enabled.
They always stop during polling the server, during what is a basically a for loop, each iteration retrieving some info from the server. I know that they didn't stop when receiving a particular server reponse, only the normal empty response ("nothing to do").
In all cases, when restarted, it worked correctly again.
Hypothesis 1 : device freezes
I have checked that both interrupt watchdog and task watchdog are set and working. Wactdhog successfully restart the device when not fed.
Hypothesis 2 : device overheats
I have tested adding lot of overhead to make both CPUs run at 100%, and it didn't make it overheat and stop.
When I reproduced, measured temperature was OK.
Hypothesis 3 : no more RAM
When no RAM is available anymore, device reboots correctly.
When I reproduced, there was a lot of free RAM (I logged it).
Tried solution 1 : restart after each 24h
I added code to restart it automaticaly every 24 hours. Issue still occurs
Tried solution 2 : restart when RAM is low
I added code to restart it automaticaly when RAM is low. Issue still occurs
Does someone have an idea what could happen ? I'm not even sure in which state it is.
Would adding a custom watchdog with FreeRTOS help ? (restarting if the main loop was not run for a few minutes)
Any other solution ?
Thanks !