First of all, many thanks for your time, @ESP_Angus! Your help is greatly appreciated.
On to your suggestions:
ESP_Angus wrote: ↑
Wed May 01, 2019 11:53 pm
Any chance you could install binwalk
and run "binwalk -E flash_readback.bin"? This produces an analysis of the entropy level in the file. You should see close to 1.0 entropy level for all of the encrypted flash after the first 0x1000 bytes, up until the repeating pattern at the end.
Yes, the results are: all 1kb blocks of the bootloader have similar entropy, around 0.97. Just the last one is 0.94. I think it's not the case of an unwarranted write.
With your decision guide it should be case #1.
But it's not absolutely clear:
- the flash (an external Winbond flash chip) is not under harsh conditions, voltage- and temperature-wise.
- the issue is definitely not caused by OTA updates; they are infrequent, and when we do them - the devices don't fail right away. They fail during normal work, long after the update was applied and have reported working;
- the cause of the issue is quite probably hardware instability (as you suggest), which is caused by external noise source. The devices are near an internal combustion engine. Some of them sometimes restart unexpectedly (some have no issues, it's borderline). The restart rate is rare, however; e.g. once per hour of uptime. The thing I'm wondering is how can such a hardware instability corrupt the bootloader in particular; i.e. why the bootloader, why not something else?
We use a custom PCB, the ESP chip is not in a module. The power is through a LDO to 3.3V, it has 1µF+10µF decoupling caps. The input to the LDO is a 18650 battery, and it's kept at 80-100% SoC.
There are two other tell-tale signs that I noticed today:
1) It seems that the bootloader corruption is "gradual". I mean the devices aren't awake most of the time, they wake up from deep sleep at periodic intervals and report data back. So the device whose bootloader got corrupted normally outputs e.g. 24 reports per day, but then one day it had just 3 reports, the next day just 1, and the next day 0. Almost as if some bit of the bootloader got weak and varied randomly between 0 and 1 on each boot, gradually assuming the new / flipped value as time progressed. This is a pure speculation, but may also explain the other device which computed different checksum each time... (the 80MHz by itself doesn't explain it to me. The device boots and works on 80 Mhz; 40 MHz is just better than 80 MHz in the noisy environment, but I was seeing this checksum variation in the lab, where it's quiet...)
2) The bootloader sometimes (rarely) decides to load the factory app, not one of the OTAs. I'm wondering why would it select to do that? This happens "in the field", in the noisy environment.
Many thanks for your other comments. As per your suggestions, it wouldn't make sense to disable WiFi's usage of NVS, it's definitely not writing to the flash more than once, and it's not causing the issue.