Page 1 of 1

Seeking Advice on ESP32 Software Architecture: ITC, Error Handling & Defensive Coding for a Self-Healing System

Posted: Fri May 30, 2025 7:25 pm
by yuemko
Hello everyone,

I’m working on an ESP32 project and would appreciate some insights on software architecture, particularly regarding inter-task communication (ITC) and robust error handling. I thought I might get broader architectural perspectives here, as my questions are more about general design patterns for a multi-tasking system.

I’ve started development, but the code isn’t quite in a shareable state yet, and I believe my questions are more conceptual at this stage, focusing on the overall software architecture rather than a specific bug.

Here’s an overview of the system I’m aiming for:

Goal:
To build a critical system that is as self-healing as possible.

Components & Task Structure:
The system involves several components, each running as its own FreeRTOS task:
  • []Mesh Wi-Fi task
    []SoftAP with a basic HTTPS server task
    []MQTT client task
    []Buttons (input handling) task
    []Relays (output control) task
    []LCD display task
  • A dedicated error_handler task/module
Component API:
Each component task will expose public functions for its lifecycle management:

Code: Select all

init(), start(), stop(), update(), deinitialize()
Each component will also have specific public operational functions like:

Code: Select all

mqtt_publish_message(), relay_set_state(), lcd_display_text()
Error Propagation:
Each component reports errors to the calling function/module.

Error Handling and Recovery Mechanism:
Using an

Code: Select all

ERROR_CHECK
macro to:
  • []Capture return error codes
    []Feed them into a centralized state machine (in

    Code: Select all

    main_callback.c
    and

    Code: Select all

    main_polling.c
    )
  • Trigger recovery actions (e.g., task restart or full system reset)
Centralized Management:
Lifecycle functions are invoked from the central management modules:

Code: Select all

main_callback.c / main_polling.c
Resource Protection:
Component tasks use mutexes with

Code: Select all

portMAX_DELAY
when accessing shared global data.

Mutex Discipline:
Strict acquire/release for all mutex-requiring functions or critical sections.

Inter-Component Data Exchange (Current thought):
Using getter/setter functions with built-in mutex protection.

Now, for my main questions:

I’m considering the following for ITC (Inter-Task Communication):
  1. Polling with Mutexes & Getters/Setters:
    Tasks poll shared state exposed via thread-safe getter/setter functions.
  2. esp_event:
    ESP-IDF’s event loop for an event-driven approach between tasks.
  3. FreeRTOS Queues:
    Using queues for structured, direct task-to-task message passing.
My Core Dilemma:
Which of these approaches (or combination) best ensures:
  • []Concurrency safety
    []Liveness (no deadlocks)
  • Timeliness (meeting deadlines)
I understand FreeRTOS Queues are often the go-to, but I'm trying to balance robustness vs. over-engineering.

Seeking Your Experience:
  • []What ITC mechanisms do you typically use in ESP32/FreeRTOS multi-tasking projects, and why?
    []How do you approach error handling and achieving self-healing in systems with multiple independent tasks?
    []Do you have a flowchart or conceptual diagram you use in similar architectures?
    []How rigorously do you check every return code in ESP-IDF/FreeRTOS? Any shortcuts you apply?
    []Do you trust returned data from functions like

    Code: Select all

    xQueueReceive
    if the return was successful, or validate it further?
    []Should pointers from successful function calls still be checked for NULL as a best practice?
  • How do you balance defensive coding vs. clean/readable code, especially regarding stable libraries?
I know “it depends” is often the answer, but I’d love to hear your experience-based rules of thumb and examples of best practices.

Thanks in advance for your time and insights!

Re: Seeking Advice on ESP32 Software Architecture: ITC, Error Handling & Defensive Coding for a Self-Healing System

Posted: Sat May 31, 2025 10:26 am
by MicroController
Just some quick thoughts here.
Resource Protection:
Component tasks use mutexes with

Code: Select all

portMAX_DELAY
when accessing shared global data.
...
Mutex Discipline:
Strict acquire/release for all mutex-requiring functions or critical sections.
For single-value state, I prefer to use C/C++ atomic types/functions. Rationale: Potentially more efficient if&when atomic operations are available in hardware, no chance of 'forgetting' to release a lock/mutex.
I’m considering the following for ITC (Inter-Task Communication):
...
My Core Dilemma:
Which of these approaches (or combination) best ensures:
  • Concurrency safety
  • Liveness (no deadlocks)
  • Timeliness (meeting deadlines)
As to 'safety', using provided functionalities (queues, semaphores,...) is probably a little less likely to introduce bugs in concurrency handling.
The other criteria seem to be equally addressed by all the approaches.
I understand FreeRTOS Queues are often the go-to, but I'm trying to balance robustness vs. over-engineering.
I recommend using the appropriate tool for each use case. E.g., a task running calculations needs to read one shared value -> call getter; conversely, if a task needs to wait for something to happen -> semaphore, queue, notification,...
How rigorously do you check every return code in ESP-IDF/FreeRTOS? Any shortcuts you apply?
...
How do you balance defensive coding vs. clean/readable code, especially regarding stable libraries?
Exceptions are one of the reasons why I prefer to use C++. Makes APIs easier/cleaner, code more readable, provides 'automatic' cleanup, and mitigates the risk of forgetting to check or pass back a return code.

Note that every task needs to be somewhat 'self-contained' w.r.t. error handling. If a task fails to operate correctly there's not much you can do from outside of that task short of resetting the device. (xTaskAbortDelay() could sometimes be helpful, if it were not designed with an inherent race codition which makes it pretty useless.)