ESP32-C3 Instruction load latency from flash
Posted: Wed Jul 16, 2025 10:45 pm
I am doing a project using the ESP32-C3 development board.
I wanted to measure the clock cycles it takes while a function is called the first time and when it is called again (basically measuring the miss and hit cycles for instruction load from the flash). I wrote a basic function in assembly that returns the MPCER counter value, and in the main code, I measured the MPCER register before calling this function and then calculated the difference to get the latency of the load. I call it again and measure the latency, which would be smaller because the instructions are in the cache.
I repeat this experiment by adding dummy instructions into the function to increase the number of instructions between my measurement points. I found some results for the miss penalty which I cannot fully understand. See below attachment.
The "Diff with prev" column takes the miss cycles of the current row and subtracts it with the miss cycles from the previous row.
Why do I see a 32-cycle increase for every additional instruction, and after 8 instructions, there is a bigger jump of 104 cycles?
My understanding is that this might be due to loading the instruction from the flash to the cache and after 8 cycles, another cache line needs to be filled. Is my understanding correct?
Does it take that long to load instructions from the flash?
I also repeated this experiment with some changes that caused the assembly code to be put further away from the main and I saw even bigger initial delay of around 400 cycles. Does the delay for loading an instruction depend on how far it is from the calling function? (I assume not). Or maybe does it have something to do with which cache line it is to be stored in?
Thanks in advance for the support.
Please let me know if you need further information or if something is not clear.
I wanted to measure the clock cycles it takes while a function is called the first time and when it is called again (basically measuring the miss and hit cycles for instruction load from the flash). I wrote a basic function in assembly that returns the MPCER counter value, and in the main code, I measured the MPCER register before calling this function and then calculated the difference to get the latency of the load. I call it again and measure the latency, which would be smaller because the instructions are in the cache.
I repeat this experiment by adding dummy instructions into the function to increase the number of instructions between my measurement points. I found some results for the miss penalty which I cannot fully understand. See below attachment.
The "Diff with prev" column takes the miss cycles of the current row and subtracts it with the miss cycles from the previous row.
Why do I see a 32-cycle increase for every additional instruction, and after 8 instructions, there is a bigger jump of 104 cycles?
My understanding is that this might be due to loading the instruction from the flash to the cache and after 8 cycles, another cache line needs to be filled. Is my understanding correct?
Does it take that long to load instructions from the flash?
I also repeated this experiment with some changes that caused the assembly code to be put further away from the main and I saw even bigger initial delay of around 400 cycles. Does the delay for loading an instruction depend on how far it is from the calling function? (I assume not). Or maybe does it have something to do with which cache line it is to be stored in?
Thanks in advance for the support.
Please let me know if you need further information or if something is not clear.