Integer division performance ????

Baldhead · Postby **Baldhead** » Sat Dec 07, 2019 9:00 pm

Hi,

How long does an integer division take on esp32 Tensilica Xtensa LX6 core ?

I think that 32 bit integer division should be faster in this architecture ? Or not ?
16 bits maybe ?

Inline assembly can be more faster ?
If yes.
How to do ?

Sixteen integer subtraction operations is faster ?

Example code:

Code: Select all

    static uint32_t fullBufferNum;                   
    fullBufferNum = (uint32_t) length / 4092;    // Integer division.

    static uint32_t remainderBufferNum;      
    remainderBufferNum = ( length % 4092 );

Application: software/hardware driver, then need to be faster.

Thank's for the help.

stavrosd · Postby **stavrosd** » Sun Dec 08, 2019 1:55 am

If you want to divide a positive integer by a power of 2, right sifting is faster than division.
In most architectures the shift command requires only one CPU clock.
4096 = 2^12
Instead of "X/4096", try "X>>12"
https://en.wikipedia.org/wiki/Arithmetic_shift

Postby **ESP_Sprite** » Sun Dec 08, 2019 2:53 am

stavrosd wrote: ↑
Sun Dec 08, 2019 1:55 am
If you want to divide a positive integer by a power of 2, right sifting is faster than division.

Note that if you divide a value with a constant power of 2 in C (compiled using GCC), the compiler will also note this and automagically replace the divide with a bit shift.

Baldhead · Postby **Baldhead** » Sun Dec 08, 2019 3:51 pm

Hi,

I want to initialize lldesc_t before a dma transfer. Each lldesc_t accepts a maximum of 4092 bytes (word aligned). My buffer are per example 15360 bytes but transfer are in words(32 bits big memory waste).
So,
Transfer = 15360 * 4 = 61440.
61440 / 4092 = 15 full lldesc_t with 4092 bytes or 1023 uint32_t.
More 1 lldesc_t with 60 bytes or 15 uint32_t.

My transfer has variable size and not fixed size.

It needs to be calculated quickly because there may be several sequential transfers. Ie: finish a transfer and start another transfer.

Thank's.

Postby **ESP_Sprite** » Mon Dec 09, 2019 2:21 am

How do you figure the division may be the bottle neck there? Before you try to optimize what you think is the bottleneck, it's probably better to instrument first so you know for sure. You may be surprised with what you find.

Baldhead · Postby **Baldhead** » Mon Dec 09, 2019 2:52 am

Hi ESP_Sprite ,

I already measured with i/o toggle before and after call function.

With 30720 bytes or 7680 uint32_t filling the linked list( 8 nodes ) takes 4.7 us.

And with 240 Mhz clock, i am finding it slow.

How long does an integer division take on esp32 Tensilica Xtensa LX6 core ?

How long does an shift take on esp32 Tensilica Xtensa LX6 core ?

I didn't find this information anywhere.

fullBufferNum = (uint32_t) length / 4092; assembly maybe: QUOU

remainderBufferNum = ( length % 4092 ); assembly maybe: REMU

uint32_t len = length << 2; assembly maybe: SLL, SLLi

Thank's.

Postby **ESP_Sprite** » Mon Dec 09, 2019 6:25 am

Sure, but you seemingly immediately assume the divide is slow, which I think is kind-of an odd conclusion, given the fact that a hardware divide even in the worst case isn't that slow.

I don't think you'll get much of an answer, by the way. The issue is that Cadence doesn't really give a cycles/instruction number for the Tensilica processor: the (extremely simplified) intention is that if you need something fast, you'll let their software analyze your code and it comes up with a core that does whatever you coded up as fast as possible.

You can probably measure it, however: especially the ccount register should be useful for that. I'd be highly surprised if a divide takes more than a a few tens of cycles, however.

Baldhead · Postby **Baldhead** » Mon Dec 09, 2019 7:27 pm

Hi ESP_Sprite,

How example:
Multiplication versus shift left.

I try to put inline in the function: inline int test ( uint32_t length ).

I try to put IRAM_ATTR in the function: int IRAM_ATTR test ( uint32_t length ).

I try to put inline and IRAM_ATTR in the function: inline int IRAM_ATTR test ( uint32_t length ).

And don't change nothing in time reponse.

Code: Select all

int test ( uint32_t length )
{
    static uint32_t len;  

    gpio_set_level( (gpio_num_t)25, 1 );

    len = length * 4;

    gpio_set_level( (gpio_num_t)25, 0 );
} 

int test ( uint32_t length )
{
    static uint32_t len;  

    gpio_set_level( (gpio_num_t)25, 1 );
    
    len = length << 2;

    gpio_set_level( (gpio_num_t)25, 0 );
}

Both operations take 212,3 ns.

Maybe i/o operation are slow.

How to use ccount register ?
An example code would be interesting.

Thank's.

ESP_Angus · Postby **ESP_Angus** » Mon Dec 09, 2019 10:37 pm

It's always worth checking the disassembly to verify your assumptions about what's happening. Experience has taught me many times that the compiler is usually smarter than I am.

If you write the exact C code posted above, the compiler will warn " warning: variable 'len' set but not used" and "control reaches end of non-void function". If you set it to the ignore these warnings, the compiler will happily optimise out the unused variables and operations and you're testing two no-ops that just call the gpio_set_level() functions and do nothing else.

Here's a main.c file which compiles without warnings:

Code: Select all

#include "esp_attr.h"
#include "driver/gpio.h"

int IRAM_ATTR test1 ( uint32_t length )
{
    uint32_t len;

    gpio_set_level( (gpio_num_t)25, 1 );

    len = length * 4;

    gpio_set_level( (gpio_num_t)25, 0 );

    return len;
}

int IRAM_ATTR test2 ( uint32_t length )
{
    uint32_t len;

    gpio_set_level( (gpio_num_t)25, 1 );

    len = length << 2;

    gpio_set_level( (gpio_num_t)25, 0 );

    return len;
}

void app_main(void)
{
    test1(33);
    test2(33);
}

Built with default config (ie default compiler optimisation level which is -Og (EDIT: fixed level), if you want max performance then set this to Performance). Now let's look at the disassembly:

(Note I'm using the CMake build system here so main.c compiles to main.c.obj. If using GNU Make, main.c will compile to main.o in a different directory but you can still use objdump to disassemble it.)

Code: Select all

$ xtensa-esp32-elf-objdump -d build/esp-idf/main/CMakeFiles/__idf_main.dir/main.c.obj 

build/esp-idf/main/CMakeFiles/__idf_main.dir/main.c.obj:     file format elf32-xtensa-le


Disassembly of section .iram1.0.literal:

00000000 <.iram1.0.literal>:
	...

Disassembly of section .iram1.1.literal:

00000000 <.iram1.1.literal>:
	...

Disassembly of section .literal.app_main:

00000000 <.literal.app_main>:
	...

Disassembly of section .iram1.0:

00000000 <test1>:
   0:	004136        	entry	a1, 32
   3:	1b0c      	movi.n	a11, 1
   5:	9a1c      	movi.n	a10, 25
   7:	000081        	l32r	a8, fffc0008 <test1+0xfffc0008>
   a:	0008e0        	callx8	a8
   d:	1122e0        	slli	a2, a2, 2
  10:	0b0c      	movi.n	a11, 0
  12:	9a1c      	movi.n	a10, 25
  14:	000081        	l32r	a8, fffc0014 <test1+0xfffc0014>
  17:	0008e0        	callx8	a8
  1a:	f01d      	retw.n

Disassembly of section .iram1.1:

00000000 <test2>:
   0:	004136        	entry	a1, 32
   3:	1b0c      	movi.n	a11, 1
   5:	9a1c      	movi.n	a10, 25
   7:	000081        	l32r	a8, fffc0008 <test2+0xfffc0008>
   a:	0008e0        	callx8	a8
   d:	1122e0        	slli	a2, a2, 2
  10:	0b0c      	movi.n	a11, 0
  12:	9a1c      	movi.n	a10, 25
  14:	000081        	l32r	a8, fffc0014 <test2+0xfffc0014>
  17:	0008e0        	callx8	a8
  1a:	f01d      	retw.n

Disassembly of section .text.app_main:

00000000 <app_main>:
   0:	004136        	entry	a1, 32
   3:	21a0a2        	movi	a10, 33
   6:	000081        	l32r	a8, fffc0008 <app_main+0xfffc0008>
   9:	0008e0        	callx8	a8
   c:	1a2c      	movi.n	a10, 33
   e:	000081        	l32r	a8, fffc0010 <app_main+0xfffc0010>
  11:	0008e0        	callx8	a8
  14:	f01d      	retw.n

(Note: you can use objdump -S instead of objdump -d if you want to see the source code mixed with disassembly.)

The compiler has noticed that you're multiplying by a power of two and chosen to use an slli (shift left) instruction in both functions. Clever compiler. "Use shift instead of multiple/divide by power of two" was probably useful advice with 80s and maybe 90s compilers, but it's one of the first optimisations a modern C compiler will apply. Unfortunately the "advice" still sticks around.

Regarding using CCOUNT to measure cycles, here's an example:

Code: Select all

#include <stdio.h>
#include "esp_attr.h"
#include "soc/cpu.h"

int IRAM_ATTR test1 ( uint32_t length )
{
    uint32_t start, end;
    uint32_t len;

    RSR(CCOUNT, start);

    len = length * 4;

    RSR(CCOUNT, end);

    printf("Operation executed in %u cycles\n", end - start);

    return len;
}

void app_main(void)
{
    test1(33);
}

Note that CCOUNT only works if the task is pinned to one core, if the task may migrate cores then use esp_timer_get_time() function which returns a uint64_t timestamp in microseconds.

Note also that interrupts are still on in these tests, so sometimes a context switch interrupt or other interrupt will happen during the operation and skew the results. Better to run the test a number of times and average out the outliers.

***

ESP_Sprite wrote: How do you figure the division may be the bottle neck there? Before you try to optimize what you think is the bottleneck, it's probably better to instrument first so you know for sure. You may be surprised with what you find.

This is good advice that ESP_Sprite gave you. If you can post the actual code you're trying to optimise and some performance measurements for it, someone can probably give you some useful tips.

Note that "inline IRAM_ATTR" is probably never what you want, because inline will incorporate the code into the caller - meaning the IRAM_ATTR is ignored as the code ends up in the same memory space as the caller is in (either IRAM or flash). Suggest either "static inline" if you want to incorporate into the caller, or "IRAM_ATTR" by itself if you want just this function to be in IRAM to avoid flash delays.

Baldhead · Postby **Baldhead** » Tue Dec 10, 2019 12:53 am

Hi ESP_Angus,

Thank's for your reply.

"Built with default config (ie default compiler optimisation level which is -Os, if you want max performance then set this to Performance)."
Where i can change to "The “Performance” setting will add the -O2 flag to CFLAGS".

In menuconfig -> compiler options -> Optimization Level -> Debug (-Og) or Release (-Os) options only.

Also in-O2 mode i don't know if i can use printf to debug.

Thank's.

Integer division performance ????

Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Re: Integer division performance ????

Who is online

About Us

Extra

Information