Device bricked ("csum err") after two months of service

WiFive
Posts: 3529
Joined: Tue Dec 01, 2015 7:35 am

Re: Device bricked ("csum err") after two months of service

Postby WiFive » Fri May 10, 2019 8:51 am

Hmm yes it has to make it through initial programming, self encryption, and months of operation so that is unusual. I guess there could be some kind of chip defect where if you are erasing a certain block address it could trigger an unintended and out of spec erase in your problem block. Can you correlate it to flash erase events like ota, nvs, or filesystem use?

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Fri May 10, 2019 1:04 pm

@WiFive, no, I cannot correlate it with any write operations. I don't use the NVS or the filesystem. OTA is used regularly, but the failure is not during the OTA, e.g. to give an example

- Device X works happily all April
- An OTA update is applied on May 3rd
- The device continues to work happily, reporting the new version
- Then it suddenly gets bricked on May 8th

WiFive
Posts: 3529
Joined: Tue Dec 01, 2015 7:35 am

Re: Device bricked ("csum err") after two months of service

Postby WiFive » Sat May 11, 2019 12:10 am

I assume you are using 3.3v flash since your boot mode is 0x13. There was an issue with undervoltage on 1.8v flash in the past.

Are you trying to reproduce the failure with a torture test (reboots and ota updates, high ambient temperature)?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: Device bricked ("csum err") after two months of service

Postby ESP_Angus » Mon May 13, 2019 7:35 am

PanicanWhyasker wrote:
Fri May 10, 2019 6:32 am
@ESP_Angus, interesting. So in this case is it possible that (due to signal integrity issues), when the ESP reboots in the field, it tries to reencrypt itself (or for the device which isn't encrypted - to try to initiate self-encryption)?
I don't think so, because the FLASH_CRYPT_CNT efuse controls both the bootloader detecting that the device needs encrypting, and that the cache will load the bootloader from encrypted flash. And it doesn't explain the "different bytes read each time" symptom or the low entropy in your broken flash chip..

WiFive makes a good point about the flash chip possibly accidentally strapping to 1.8V sometimes. Is MTDI pin (GPIO12) always disconnected or driven low on reset?

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Wed May 15, 2019 9:25 pm

Yes, pin 12 is pulled down with 360k to GND. Nothing else drives this pin (it's just the ESP pin, the 360k, and a MOSFET gate). Basically this pin is always logic 0, the only time it's driven high is during lab testing.

gecko242
Posts: 18
Joined: Tue Oct 02, 2018 7:11 am

Re: Device bricked ("csum err") after two months of service

Postby gecko242 » Tue May 21, 2019 11:13 am

Could it be that 360k isnt aggressive enough of a pull down?
One thing worth trying might be disabling the bootstrapping pins and forcing the flash voltage through the efuse, and then see if you get a failure? I dont know how long you have to run these before they typically fail.

You mentioned that these run near an internal combustion engine, can you shed any more details about power supply, or general device configuration?

Sam

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Mon May 27, 2019 3:47 pm

Hello Sam,

sorry for the late reply. I wasn't near a PC the last week.
Could it be that 360k isnt aggressive enough of a pull down?
Hmm, the datasheet states that pin 12 has an internal pull-down during bootstrapping; so it's two pulldowns in parallel, I think this should be enough, as there is really nothing pulling that pin up in any case.
You mentioned that these run near an internal combustion engine, can you shed any more details about power supply, or general device configuration?
The device has an internal battery, it's not powered by the ICE. The signal integrity issue is on the first batch of devices, where the device and the engine's ignition system share grounds. The signal from the ignition system, after proper conditioning, is used to compute engine RPM.
On the later batches the signal from the ignition is coupled differently, and we have no problems there. Yet if there's anything software-wise that I can do to reduce the likelihood of the ESP overwriting its bootloader, I'd gladly apply it to both batches; the devices are expensive and the bricking issue is a major roadblock for us.

gecko242
Posts: 18
Joined: Tue Oct 02, 2018 7:11 am

Re: Device bricked ("csum err") after two months of service

Postby gecko242 » Wed May 29, 2019 9:44 am

No worries.

Yeah, it does seem far fetched, but weirder things have happened. If power consumption isn't an issue, you could easily lower it.


Okay, that makes sense. To clarify, the second batch of devices never overwrite their bootloader?
Or they don't have any noise issues?

Have you put the device through any sort of EMC/FCC testing? Could it be a weird radio immunity quirk?

Sam

PanicanWhyasker
Posts: 45
Joined: Sun Jan 06, 2019 12:42 pm

Re: Device bricked ("csum err") after two months of service

Postby PanicanWhyasker » Thu May 30, 2019 8:44 pm

gecko242 wrote:
Wed May 29, 2019 9:44 am
No worries.

Yeah, it does seem far fetched, but weirder things have happened. If power consumption isn't an issue, you could easily lower it.
Power consumption isn't an issue, I'll lower it.

gecko242 wrote:
Wed May 29, 2019 9:44 am
Okay, that makes sense. To clarify, the second batch of devices never overwrite their bootloader?
Or they don't have any noise issues?
Both - no bootloader issues as of yet, and the noise issue is gone.
gecko242 wrote:
Wed May 29, 2019 9:44 am
Have you put the device through any sort of EMC/FCC testing? Could it be a weird radio immunity quirk?
Not yet. It's possible, because the devices fail while they are in use and while they are communicating over the WiFi.

gecko242
Posts: 18
Joined: Tue Oct 02, 2018 7:11 am

Re: Device bricked ("csum err") after two months of service

Postby gecko242 » Fri May 31, 2019 3:40 pm

Okay. Sounds like it may be fixed, although it would be nice to know the root mechanism as to why devices in noisy environments get bricked.

The immunity thing may be a red herring, but worth looking out for.

Can you share what you did differently between revisions in regards to GND coupling?

Sam

Who is online

Users browsing this forum: No registered users and 67 guests