heap corruption

timredfern
Posts: 31
Joined: Sun Feb 25, 2018 10:59 am

heap corruption

Postby timredfern » Thu Apr 26, 2018 1:57 pm

I enabled heap corruption detection and I'm seeing strange things.

Code: Select all

CORRUPT HEAP: Bad tail at 0x3ffe270a. Expected 0xbaad5678 got 0xbaac5678
assertion "head != NULL" failed: file "/Users/tim/ESP/esp-idf/components/heap/multi_heap_poisoning.c", line 201, function: multi_heap_free
abort() was called at PC 0x400dca43 on core 0

Backtrace: 0x4008e150:0x3ffc3a60 0x4008e2f3:0x3ffc3a80 0x400dca43:0x3ffc3aa0 0x4008de61:0x3ffc3ad0 0x4008292e:0x3ffc3af0 0x40082949:0x3ffc3b10 0x400829f2:0x3ffc3b30 0x40082df4:0x3ffc3b50 0x40082e57:0x3ffc3ba0 0x400fa29f:0x3ffc3bc0 0x400fa31d:0x3ffc3be0 0x400fea50:0x3ffc3c00 0x400feb63:0x3ffc3c30 0x400fefa3:0x3ffc3c60 0x400ff1fd:0x3ffc3ca0 0x400fb071:0x3ffc3cf0 0x400d3ebc:0x3ffc3d10

CPU halted.
Using EspStackTraceDecoder.jar I see:

Code: Select all

0x400dca43: __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:63 (discriminator 8)
0x4008e150: invoke_abort at /Users/tim/ESP/esp-idf/components/esp32/panic.c:648
0x4008e2f3: abort at /Users/tim/ESP/esp-idf/components/esp32/panic.c:648
0x400dca43: __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c:63 (discriminator 8)
0x4008de61: multi_heap_free at /Users/tim/ESP/esp-idf/components/heap/multi_heap_poisoning.c:299
0x4008292e: heap_caps_free at /Users/tim/ESP/esp-idf/components/heap/heap_caps.c:123
0x40082949: heap_caps_realloc at /Users/tim/ESP/esp-idf/components/heap/heap_caps.c:123
0x400829f2: heap_caps_realloc_default at /Users/tim/ESP/esp-idf/components/heap/heap_caps.c:123
0x40082df4: trace_realloc at /Users/tim/ESP/esp-idf/components/heap/heap_trace.c:367
0x40082e57: __wrap_realloc at /Users/tim/ESP/esp-idf/components/heap/heap_trace.c:402
0x400fa29f: mbuf_resize at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400fa31d: mbuf_trim at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400fea50: mg_if_sent_cb at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400feb63: mg_write_to_socket at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400fefa3: mg_mgr_handle_conn at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400ff1fd: mg_socket_if_poll at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400fb071: mg_mgr_poll at /Users/tim/ESP/workspace/firmware/components/Mongoose/mongoose.c:11327
0x400d3ebc: net_task at /Users/tim/ESP/workspace/firmware/main/net_task.c:390 (discriminator 1)
This was triggered by sending a response to a REST request to the mongoose webserver.

The code that sends the response is this:

Code: Select all

    char *msg=json_dumps(jmsg,JSON_COMPACT);
    if (msg){
        int msglen=strlen(msg);
        mg_printf(nc, "HTTP/1.1 200 OK\r\nContent-Length: %lu\r\n\r\n%.*s",
                (unsigned long) msglen, (int) msglen, msg);

        vPortFree(msg);
    }
Now, my first thought is that maybe the buffer isn't null terminated, and the heap corruption detection has caused the end of the buffer not to be zeroed, so I put in a logging statement:

Code: Select all

    char *msg=json_dumps(jmsg,JSON_COMPACT);
    if (msg){
        int msglen=strlen(msg);
        ESP_LOGI(tag,"sending %s, length %i",msg,msglen);
        mg_printf(nc, "HTTP/1.1 200 OK\r\nContent-Length: %lu\r\n\r\n%.*s",
                (unsigned long) msglen, (int) msglen, msg);

        vPortFree(msg);
    }
Now, the code runs without triggering the heap corruption, and the logging looks fine:

Code: Select all

I (27886) Rest: sending {"acknowledged":true}, length 21
Hmm, didn't expect that. The length looks fine. Maybe adding the line of code triggered a fortunate series of events and the string is now accidentally null terminated, and the real error is that json_dumps doesn't return a null terminated string? It has to though really, as it doesn't return a length. Yep, see https://github.com/akheron/jansson/blob ... src/dump.c and https://github.com/akheron/jansson/blob ... c/memory.c

Hmm. OK, let's remove the logging again and run again to prove the problem is repeatable. No, the function now works without heap corruption. Tried it a bunch of times.

I'm now struggling to understand what could be happening. Look at the corrupt heap error: "Expected 0xbaad5678 got 0xbaac5678" Whatever this is, it's off by exactly 64K?

Adding and removing a statement "fixed" a heap error with a "bad tail" off by 64k.

I know the ESP32 uses a complex "multi-heap" with different areas. Are there any caveats to this?
Last edited by timredfern on Thu Apr 26, 2018 4:25 pm, edited 1 time in total.

timredfern
Posts: 31
Joined: Sun Feb 25, 2018 10:59 am

Re: heap corruption

Postby timredfern » Thu Apr 26, 2018 2:38 pm

I'm using esp-idf 6c44fc70bc374b29b079115a5b3551edc9f48be9 from April 13, thus I have this fix, but I see that bugs are still being found in the multi-heap as recently as 2 months ago?

https://github.com/espressif/esp-idf/co ... d703c13e59
Last edited by timredfern on Thu Apr 26, 2018 2:50 pm, edited 1 time in total.

User avatar
fly135
Posts: 606
Joined: Wed Jan 03, 2018 8:33 pm
Location: Orlando, FL

Re: heap corruption

Postby fly135 » Thu Apr 26, 2018 2:43 pm

Looks like you are writing the message to a pointer "nc" of unknown origin. That's my 1st guess as to the source of the problem.

Code: Select all

mg_printf(nc, "HTTP/1.1 200 OK\r\nContent-Length: %lu\r\n\r\n%.*s",
                (unsigned long) msglen, (int) msglen, msg);
John A

timredfern
Posts: 31
Joined: Sun Feb 25, 2018 10:59 am

Re: heap corruption

Postby timredfern » Thu Apr 26, 2018 2:51 pm

nc is defined outside of the code sample I pasted.

here's the full function

Code: Select all

void http_json_send(struct mg_connection *nc, const char *fmt, ...){
    va_list arguments;
    va_start( arguments, fmt );
    json_t *jmsg = json_vpack_ex(NULL, 0, fmt, arguments);
    char *msg=json_dumps(jmsg,JSON_COMPACT);
    if (msg){
        int msglen=strlen(msg);
        //heap corruption without this logging, disappeared again
        //ESP_LOGI(tag,"sending %s, length %i",msg,msglen);
        mg_printf(nc, "HTTP/1.1 200 OK\r\nContent-Length: %lu\r\n\r\n%.*s",
                (unsigned long) msglen, (int) msglen, msg);

        vPortFree(msg);
    }
    json_decref(jmsg);
    va_end(arguments);
}

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: heap corruption

Postby ESP_Angus » Fri Apr 27, 2018 12:24 am

Hi Tim,

Heap corruption bugs can be very nefarious to track down.

Given the original crash was in a different task, not the task generating the JSON response, my guess is that the extra logging is changing the timing of that task relative to netTask, so heap operations happen in a different order or timing, and this is somehow avoiding the crash.

Unless I'm misreading the way your firmware is structured and these are both happening at different times in the one task?

If you change the "magic" log line to a vTaskDelay(1), and/or add some calls to heap_caps_check_integrity_all() around the code which appears to crash, does this yield any different results? (For performance reasons, the heap checker only checks integrity for the areas it touches during a malloc/free. heap_caps_check_integrity_all() will force it to check everything.)

More details can be found here: https://esp-idf.readthedocs.io/en/lates ... debug.html
Expected 0xbaad5678 got 0xbaac5678
The heap poisoning code writes the "magic" pattern 0xbaad5678 after the final bytes of each allocation. My guess is 0xbaac5678 is either an output of bounds decrement by 0x10000, ie:

Code: Select all

*word -= 0x10000;
Or it's treating the memory as a 16-bit short word and decrementing an out of bounds area by 1:

Code: Select all

*short -= 1;
Is there anything else you can tell us about your setup? What else is happening in the system at this time? What does netTask do?

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: heap corruption

Postby ESP_Angus » Fri Apr 27, 2018 12:41 am

One other thing, on a hunch: I haven't used Jansson before but can you please check if either JSON_HAVE_ATOMIC_BUILTINS or JSON_HAVE_SYNC_BUILTINS are enabled in jansson_config.h, and disable them if they are?

Who is online

Users browsing this forum: Google [Bot] and 245 guests