Re: Unexpectedly low floating-point performance in C
Posted: Fri Jan 13, 2017 10:50 pm
There are several things to be considered on the floating point test.
1. The ESP runs out of FLASH + 64K of CACHED RAM. The RAM runs at full speed.
The FLASH is clocked at 80MHz and is running in DUAL mode (2 bits per clock cycle) 200ns for a 32 bit word and 150ns for a 24bit word.
The ESP32 has both 24 and 32bit instructions.
2. The first time through a set of instructions it needs to go to the FLASH to get the code.
It stores it in the 64K cache for use next time.
The cache can get overwritten is there is a lot of code between the first read and the second usage. If this happens the code must be fetched from FLASH again.
3. There are 16 floating point registers some number of these are used during a floating point operation.
These need to be loaded and unloaded with data. If the registers already contain the value they can be used again.
4. There are 38 single precision floating point instructions.
These native instructions include add, subtract, multiply, and multiply with add or subtract (useful for FIR filters in DSP code).
They do not include divide, or any higher level operations like sqrt, pow, sin, ect. If these higher power operations are needed they are typical ran in math library.
5. The FPU must be attached to the task the first time it is needed.
6. Operations using the GPIO are generally written for general use and may involve one or more subroutine calls.
The ESP32 has special registers that allow set, clear, in, out with just one instruction.
Since GPIO are I/O devices and on more compiled processors like the ESP32 require multiple CPU clock cycle to operate.
I wrote several floating point test routines. Some are very short only testing the native FP operations. Others are more complex using higher power operations like DIVIDE, POW, SQRT, SIN.
All routines use loops to test the first operation out of FLASH and then on the second iteration the cached instructions out of RAM.
Note that even the first use of code may already de cached due to line fills and pre fetched code.
This can be seen when instructions that take longer time to execute it gives the FLASH time to run ahead and store the next instruction in the cache before they are needed.
This is why the MUL looks like it is faster than the ADD in Aschweiz test.
With all that said here are some of my results.
1. GPIO operations to set and clear a bit on a port takes about 62.4ns and must be subtracted from the time it takes a test to run since I set and clear a bit for every FP instruction tested.
2. Starting the FP processor the first time takes ~ 5.6172us. This time only happens once on a single task system.
3. A single precision ADD takes 2.065us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
4. A single precision MUL takes 4.0656us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
5. A single precision MUL+ADD takes 2.09us the first time through the loop since some of the FLASH code is already cached due to the longer time of the previous instruction.
The second time running out of cache it only takes 100.4ns.
All other operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the single precision results
1. DIVIDE takes 1.158us
2. SQRT takes 8.155us
3. POW takes 55.8us
4. SIN takes 15.776us
Since there is no double precision instructions all operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the double precision results
1. DADD takes 400ns
2. DMUL takes 787ns
3. DMUL+DADD takes 1.11us
5. DDIVIDE takes 4.085us
6. DSQRT takes 7.88us same as single precision
7. DPOW takes 55.37us same as single precision
8. DSIN takes 15.776us same as single precision
I was using the Arduino IDE and math Libraries and they only accept single precision inputs to SQRT, POW, SIN and this is why they are the same as single precision.
Although from the clock tick counter it looks like I am running at 240MHz. Looking at the shortest time of execution it takes 15 clock cycle to do an ADD.
I assume this is due to the loading and unloading of the floating point registers since all of them are declared volatile.
If I remove the volatile from the one input and let it to be cached in one the 16 FP registers then one of the inputs and one output need to be loaded/unloaded from the FP register. The ADD and MUL drop to 6 clock cycles or about 25ns.
Doing two ADD’s with one volatile on its input and no volatile one it output and then using these two results for a third ADD with it output declared volatile allows three registers to be cached in the FP registers.
The last ADD will use two of the results directly from the FP registers and only unload one result since it output is declare volatile. This reduces the ADD to 3 clock cycles or about 12.5ns.
One thing about using GPIO for speed measurements. You can only change GPIO so fast below that value the GPIO will not toggle.
In that case you need to run the code with and without the test instruction to determine its speed by subtracting the difference or use multiple iterations.
In my case it looks like the lower limit is around 100ns. After subtracting off the 62.4ns for the GPIO calls this leaves 37.6ns or 9 clock cycles.
1. The ESP runs out of FLASH + 64K of CACHED RAM. The RAM runs at full speed.
The FLASH is clocked at 80MHz and is running in DUAL mode (2 bits per clock cycle) 200ns for a 32 bit word and 150ns for a 24bit word.
The ESP32 has both 24 and 32bit instructions.
2. The first time through a set of instructions it needs to go to the FLASH to get the code.
It stores it in the 64K cache for use next time.
The cache can get overwritten is there is a lot of code between the first read and the second usage. If this happens the code must be fetched from FLASH again.
3. There are 16 floating point registers some number of these are used during a floating point operation.
These need to be loaded and unloaded with data. If the registers already contain the value they can be used again.
4. There are 38 single precision floating point instructions.
These native instructions include add, subtract, multiply, and multiply with add or subtract (useful for FIR filters in DSP code).
They do not include divide, or any higher level operations like sqrt, pow, sin, ect. If these higher power operations are needed they are typical ran in math library.
5. The FPU must be attached to the task the first time it is needed.
6. Operations using the GPIO are generally written for general use and may involve one or more subroutine calls.
The ESP32 has special registers that allow set, clear, in, out with just one instruction.
Since GPIO are I/O devices and on more compiled processors like the ESP32 require multiple CPU clock cycle to operate.
I wrote several floating point test routines. Some are very short only testing the native FP operations. Others are more complex using higher power operations like DIVIDE, POW, SQRT, SIN.
All routines use loops to test the first operation out of FLASH and then on the second iteration the cached instructions out of RAM.
Note that even the first use of code may already de cached due to line fills and pre fetched code.
This can be seen when instructions that take longer time to execute it gives the FLASH time to run ahead and store the next instruction in the cache before they are needed.
This is why the MUL looks like it is faster than the ADD in Aschweiz test.
With all that said here are some of my results.
1. GPIO operations to set and clear a bit on a port takes about 62.4ns and must be subtracted from the time it takes a test to run since I set and clear a bit for every FP instruction tested.
2. Starting the FP processor the first time takes ~ 5.6172us. This time only happens once on a single task system.
3. A single precision ADD takes 2.065us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
4. A single precision MUL takes 4.0656us the first time through the loop.
The second time running out of cache it only takes 62.4ns.
5. A single precision MUL+ADD takes 2.09us the first time through the loop since some of the FLASH code is already cached due to the longer time of the previous instruction.
The second time running out of cache it only takes 100.4ns.
All other operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the single precision results
1. DIVIDE takes 1.158us
2. SQRT takes 8.155us
3. POW takes 55.8us
4. SIN takes 15.776us
Since there is no double precision instructions all operations use multiple FP instructions to do the operation and the code is running out of cache. Here are the double precision results
1. DADD takes 400ns
2. DMUL takes 787ns
3. DMUL+DADD takes 1.11us
5. DDIVIDE takes 4.085us
6. DSQRT takes 7.88us same as single precision
7. DPOW takes 55.37us same as single precision
8. DSIN takes 15.776us same as single precision
I was using the Arduino IDE and math Libraries and they only accept single precision inputs to SQRT, POW, SIN and this is why they are the same as single precision.
Although from the clock tick counter it looks like I am running at 240MHz. Looking at the shortest time of execution it takes 15 clock cycle to do an ADD.
I assume this is due to the loading and unloading of the floating point registers since all of them are declared volatile.
If I remove the volatile from the one input and let it to be cached in one the 16 FP registers then one of the inputs and one output need to be loaded/unloaded from the FP register. The ADD and MUL drop to 6 clock cycles or about 25ns.
Doing two ADD’s with one volatile on its input and no volatile one it output and then using these two results for a third ADD with it output declared volatile allows three registers to be cached in the FP registers.
The last ADD will use two of the results directly from the FP registers and only unload one result since it output is declare volatile. This reduces the ADD to 3 clock cycles or about 12.5ns.
One thing about using GPIO for speed measurements. You can only change GPIO so fast below that value the GPIO will not toggle.
In that case you need to run the code with and without the test instruction to determine its speed by subtracting the difference or use multiple iterations.
In my case it looks like the lower limit is around 100ns. After subtracting off the 62.4ns for the GPIO calls this leaves 37.6ns or 9 clock cycles.