Device bricked ("csum err") after two months of service

PanicanWhyasker
Posts: 37
Joined: Sun Jan 06, 2019 12:42 pm

Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Tue Apr 30, 2019 5:18 am

Hi,

I need some help to debug an issue which I presume happens when the ROM bootloader tries to load the flash bootloader:

Code: Select all

ets Jun  8 2016 00:22:57                                                                                                                                                                             
                                                                                                                                                                                                     
rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)                                                                                                                                          
configsip: 0, SPIWP:0xee                                                                                                                                                                             
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00                                                                                                                              
mode:DIO, clock div:2                                                                                                                                                                                
load:0x3fff0018,len:4                                                                                                                                                                                
load:0x3fff001c,len:152                                                                                                                                                                              
load:0x40078000,len:9000                                                                                                                                                                             
load:0x40080400,len:5480                                                                                                                                                                             
csum err:0x9a!=0x89                                                                                                                                                                                  
ets_main.c 371          
This repeats non-stop.
This device got returned by customers after it was used in service for two months without a hitch. It's not a strapping pin issue.
My suspicion is that some of the flash memory got corrupted, but I cannot acknowledge which part. How to interpret the "load" statements?
Which checksum fails?

WiFive
Posts: 2999
Joined: Tue Dec 01, 2015 7:35 am

Re: Device bricked ("csum err") after two months of service

Postby WiFive » Tue Apr 30, 2019 6:17 am

Code: Select all

esptool.py --chip esp32 image_info bootloader.bin

PanicanWhyasker
Posts: 37
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Tue Apr 30, 2019 6:55 am

Oh, so the on-flash bootloader is corrupted?

WiFive
Posts: 2999
Joined: Tue Dec 01, 2015 7:35 am

Re: Device bricked ("csum err") after two months of service

Postby WiFive » Tue Apr 30, 2019 7:19 am

Yes probably. Can you dump the flash, or are you using flash encryption?

PanicanWhyasker
Posts: 37
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Tue Apr 30, 2019 7:47 am

Using encryption unfortunately.

I had a similar issue with one device which didn't have encryption.
It was behaving funnier, the "csum err" was there, but the computed checksum was different each time.
However it was misconfigured at 80 MHz DIO, so it may be bitslip, so the issues may be completely unrelated.
In the current one the "csum err" is always the same.

What could possibly cause the flash to go corrupt in the bootloader?

ESP_Angus
Posts: 2249
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Tue Apr 30, 2019 11:56 pm

PanicanWhyasker wrote:
Tue Apr 30, 2019 7:47 am
What could possibly cause the flash to go corrupt in the bootloader?
Hi PanicanWhyasker,

It could be that something has erased or written to the flash chip in the bootloader address range. This shouldn't happen (and by default ESP-IDF has some software protections to prevent this happening), but it's not impossible.

It could be that the flash chip is faulty, although this is unlikely.

It could be a physical problem with the flash SPI connections or the power, although as you point out this is more likely to have a symptom of randomly changing calculated values (although it's not impossible that it's failing in a consistent way each time).

Even though the flash contents is encrypted, you could try dumping the ciphertext with "esptool.py read_flash 0 0x8000 readback.bin" or similar, and look to see if there's some unlikely looking data (all 0xFF or all 0x00 or not encrypted) somewhere in the readback dump, before the end of the bootloader.bin image size. This may give some clue.

PanicanWhyasker
Posts: 37
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Wed May 01, 2019 10:26 am

@ESP_Angus,

In the first 8 sectors, the flash has 0xFFs in the ranges
0x0000-0x0fff
0x5000-0x7fff

The bootloader occupies 0x1000-0x4fff, with the last 1-2 kb of it having equal 16-byte parts (i.e. each 32-byte aligned address consists of two equal copies of 16 bytes), which I presume is an artefact of zeros being encrypted with the ESP.

As you said, the software protections (SPI_FLASH_WRITING_DANGEROUS_REGIONS) are in place.

Is there any way to prevent the IDF of writing in the flash at all (or enable it only via an explicit call, e.g. for OTA)?
I don't use NVS/FAT/...

ESP_Angus
Posts: 2249
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Wed May 01, 2019 11:53 pm

Hi PanicanWhyasker,

It sounds like your flash is mostly intact. And you're right that encrypting all 0xFF (freshly erased flash) will result in that pattern.

Any chance you could install binwalk and run "binwalk -E flash_readback.bin"? This produces an analysis of the entropy level in the file. You should see close to 1.0 entropy level for all of the encrypted flash after the first 0x1000 bytes, up until the repeating pattern at the end.

If you see a region in the bootloader internals with low entropy, it might mean that a flash write has landed somewhere in the middle of the bootloader (even if writing encrypted data over existing encrypted data, if the flash isn't erased first then it's going to end up with a predictable high number of 0 bits because NOR flash can only write 1->0).

My best guess for what's happened is one of the following:
  • The flash chip is faulty and has flipped a stored bit (which is then decrypted as a 128 bit block by the AES engine, resulting in 16 randomly changed bytes). I've not heard of this happening but it's always possible. Is there any chance the flash chip spent time outside of specified operating conditions (too hot, too cold, overvoltage, undervoltage)?
  • During an OTA update (do you do these?), a flash write has had the address portion corrupted (due to electrical noise or similar issue on the board) and has latched an address bit as 0 not 1, causing some data to be written to the bootloader region when it should be written to an app region. Normally I wouldn't suspect this, but you mentioned that your flash is not working correctly at max frequency so maybe it's borderline. What's the exact hardware config on the board? Is it a module or a chip, how is it powered, how much decoupling on the power for the module, any noise sources nearby on the PCB, etc? Do you get any other random crashes during normal operation that could be caused by hardware stability issues?
The "binwalk -E" entropy result may help narrow down between the two, as the first cause won't change the overall entropy of the encrypted flash (unless it flips a lot of bits, which should be easy enough to see in the binary), but the second cause will cause a predictable percentage of zero bits in the twice-written regions.
PanicanWhyasker wrote:
Wed May 01, 2019 10:26 am
Is there any way to prevent the IDF of writing in the flash at all (or enable it only via an explicit call, e.g. for OTA)?
I don't use NVS/FAT/...
IDF won't write to flash unless a function is called to write to flash. Do you use OTA?

The Wi-Fi library will use NVS if it's initialized. This can be disabled via config. See:
https://docs.espressif.com/projects/esp ... -nvs-flash
https://docs.espressif.com/projects/esp ... vs-enabled

However, the Wi-Fi library doesn't write to NVS with any regularity.

PanicanWhyasker
Posts: 37
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Thu May 02, 2019 8:56 pm

First of all, many thanks for your time, @ESP_Angus! Your help is greatly appreciated.

On to your suggestions:
ESP_Angus wrote:
Wed May 01, 2019 11:53 pm
Any chance you could install binwalk and run "binwalk -E flash_readback.bin"? This produces an analysis of the entropy level in the file. You should see close to 1.0 entropy level for all of the encrypted flash after the first 0x1000 bytes, up until the repeating pattern at the end.
Yes, the results are: all 1kb blocks of the bootloader have similar entropy, around 0.97. Just the last one is 0.94. I think it's not the case of an unwarranted write.

With your decision guide it should be case #1.
But it's not absolutely clear:
- the flash (an external Winbond flash chip) is not under harsh conditions, voltage- and temperature-wise.
- the issue is definitely not caused by OTA updates; they are infrequent, and when we do them - the devices don't fail right away. They fail during normal work, long after the update was applied and have reported working;
- the cause of the issue is quite probably hardware instability (as you suggest), which is caused by external noise source. The devices are near an internal combustion engine. Some of them sometimes restart unexpectedly (some have no issues, it's borderline). The restart rate is rare, however; e.g. once per hour of uptime. The thing I'm wondering is how can such a hardware instability corrupt the bootloader in particular; i.e. why the bootloader, why not something else?

We use a custom PCB, the ESP chip is not in a module. The power is through a LDO to 3.3V, it has 1µF+10µF decoupling caps. The input to the LDO is a 18650 battery, and it's kept at 80-100% SoC.

There are two other tell-tale signs that I noticed today:
1) It seems that the bootloader corruption is "gradual". I mean the devices aren't awake most of the time, they wake up from deep sleep at periodic intervals and report data back. So the device whose bootloader got corrupted normally outputs e.g. 24 reports per day, but then one day it had just 3 reports, the next day just 1, and the next day 0. Almost as if some bit of the bootloader got weak and varied randomly between 0 and 1 on each boot, gradually assuming the new / flipped value as time progressed. This is a pure speculation, but may also explain the other device which computed different checksum each time... (the 80MHz by itself doesn't explain it to me. The device boots and works on 80 Mhz; 40 MHz is just better than 80 MHz in the noisy environment, but I was seeing this checksum variation in the lab, where it's quiet...)

2) The bootloader sometimes (rarely) decides to load the factory app, not one of the OTAs. I'm wondering why would it select to do that? This happens "in the field", in the noisy environment.

Many thanks for your other comments. As per your suggestions, it wouldn't make sense to disable WiFi's usage of NVS, it's definitely not writing to the flash more than once, and it's not causing the issue.

ESP_Angus
Posts: 2249
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Fri May 03, 2019 1:33 am

Thanks for these additional details.

Electrical noise or signal integrity issues could explain some of the other crashing problems but I agree that they don't seem to adequately explain the bootloader failure, in the circumstances you describe.

It sounds like maybe the flash chip is failing? You could try dumping the whole flash a few times with esptool and see if the exact same binary comes out each time?

Writing to flash is a physical process, so as far as I understand it is possible for a defective flash bit to have value "0.5" (ie was written 1->0 but the writing didn't complete fully at the physical level). So the inadequately written bit could be read as 1 or 0 depending on random chance. I agree this could explain the "gradual" failure to boot.

(Although normally we find intermittent mis-reads of flash like this are signal integrity problems, FWIW.)

The flash controller in the flash chip should normally take care of writing correctly, though. Provided the chip is legit and it was flashed under correct operating specifications (ie no brownout during the factory flash, etc).

Who is online

Users browsing this forum: Google [Bot] and 15 guests