Page 1 of 1

ESP32-C3 Instruction load latency from flash

Posted: Wed Jul 16, 2025 10:45 pm
by jithinvarghese.de
I am doing a project using the ESP32-C3 development board.

I wanted to measure the clock cycles it takes while a function is called the first time and when it is called again (basically measuring the miss and hit cycles for instruction load from the flash). I wrote a basic function in assembly that returns the MPCER counter value, and in the main code, I measured the MPCER register before calling this function and then calculated the difference to get the latency of the load. I call it again and measure the latency, which would be smaller because the instructions are in the cache.
I repeat this experiment by adding dummy instructions into the function to increase the number of instructions between my measurement points. I found some results for the miss penalty which I cannot fully understand. See below attachment.

Screenshot 2025-07-17 003059.png
Screenshot 2025-07-17 003059.png (10.61 KiB) Viewed 178 times

The "Diff with prev" column takes the miss cycles of the current row and subtracts it with the miss cycles from the previous row.

Why do I see a 32-cycle increase for every additional instruction, and after 8 instructions, there is a bigger jump of 104 cycles?
My understanding is that this might be due to loading the instruction from the flash to the cache and after 8 cycles, another cache line needs to be filled. Is my understanding correct?
Does it take that long to load instructions from the flash?

I also repeated this experiment with some changes that caused the assembly code to be put further away from the main and I saw even bigger initial delay of around 400 cycles. Does the delay for loading an instruction depend on how far it is from the calling function? (I assume not). Or maybe does it have something to do with which cache line it is to be stored in?

Thanks in advance for the support.
Please let me know if you need further information or if something is not clear.

Re: ESP32-C3 Instruction load latency from flash

Posted: Thu Jul 17, 2025 2:47 am
by Sprite
MPCER can count a number of different things. What specifically are you counting?

Re: ESP32-C3 Instruction load latency from flash

Posted: Thu Jul 17, 2025 7:25 am
by MicroController
I wanted to measure the clock cycles it takes while a function is called the first time and when it is called again ...
in the main code, I measured the MPCER register before calling this function and then
You mean MPCCR?
You can also just use esp_cpu_get_cycle_count().

Re: ESP32-C3 Instruction load latency from flash

Posted: Thu Jul 17, 2025 12:45 pm
by jithinvarghese.de
MPCER can count a number of different things. What specifically are you counting?
Yes, Sorry, I meant I set MPCER to 0x01 to count CPU cycles and then measure the cycles using MPCCR register. I also set MPCER to 0x02 to measure the number of instructions.

Here's the part of the code I use:

Code: Select all

void app_main(void)
{
    uint32_t inst_start = 0;
    uint32_t inst_end = 0;
    uint32_t inst_delay = 0;
    RV_WRITE_CSR(0x7e0,0x01);
    inst_start = RV_READ_CSR(0x7e2);
    inst_end = instr_test();
    inst_delay = inst_end - inst_start;
    printf("ICache miss delay = %ld\n", inst_delay);
    inst_start = RV_READ_CSR(0x7e2);
    inst_end = instr_test();
    inst_delay = inst_end - inst_start;
    printf("ICache hit delay = %ld\n", inst_delay);
    RV_WRITE_CSR(0x7e0,0x02);
    inst_start = RV_READ_CSR(0x7e2);
    inst_end = instr_test();
    inst_delay = inst_end - inst_start;
    printf("Number of Instructions = %ld\n", inst_delay);
    ...
}   

The function instr_test is written in assembly:

Code: Select all

instr_test:
    //li t0, 0
    //addi t0, t0, 1
    //addi t0, t0, 1
    csrr a0, 0x7E2
    ret

I uncomment the lines to add more instructions to my test.

Re: ESP32-C3 Instruction load latency from flash

Posted: Fri Jul 18, 2025 8:16 pm
by MicroController
Why do I see a 32-cycle increase for every additional instruction,
I don't know. I believe you should see only 1 cycle for most instructions, until the code overflows into the next cache line.
Notice that the address of the function in memory can cause the code to be split and require an additional cache line if the function entry is not aligned to the cache line size. The function's alignment w.r.t. the cache line may vary with every change you make and recompile unless you explicitly specify the alignment as 32 (or greater).
Also notice that the code in your app_main() will have to be loaded into the cache too, which may affect your measurement.

Here's about how I usually do these measurements:

Code: Select all

// Make a value "volatile", i.e. unknown at compile time and pretend it may be different each time this function is called.
static inline uint32_t _v(uint32_t x) {
  asm volatile ("": "+r" (x));
  return x;
}

void app_main() {

  const uint32_t iterations = 3;
  
  uint32_t tMin = -1;
  uint32_t tMax = 0;
  
  for(uint32_t i = _v(iterations); i != 0; --i) {
  
    const uint32_t dummyArg = _v(0);
  
    const uint32_t tStart = esp_cpu_get_cycle_count();
    
    _v( functionUnderTest( dummyArg ) ); 
    
    const uint32_t dt = esp_cpu_get_cycle_count() - tStart;
    
    tMin = dt < tMin ? dt : tMin;
    tMax = dt > tMax ? dt : tMax;
    
  }
  ESP_LOGI(TAG, "min: %" PRIu32 ", max: %" PRIu32, tMin, tMax);
}
(The _v() "trick" is to make sure that 1) the for loop will not be unrolled by the compiler and 2) the functionUnderTest is actually called (in every iteration) in case it has no other compiler-visible side effects.)

I'm usually only interested in tMin though, i.e. the actual execution time, so I don't care about any inconsistency of tMax due to caching of app_main().