Page 1 of 1

Bug in FPU Coprocessor float division?

Posted: Tue May 14, 2019 8:03 pm
by michal1
Hello,

I have noticed that using the native FPU division instructions leads to unexpected results when dividing two floating numbers of significantly different exponents.

It seems that if a division operator is used in C code, gcc inserts a call to __divsf3() which does not use the native FPU division instructions but instead relies on software implementation and gives the correct result. As I am working on computationally intensive DSP code, performance is really important to me and I would like to use the native instructions instead. Using the FPU instruction sequence for division as documented in the ISA Reference manual and implemented in divsf() : https://github.com/espressif/esp-idf/bl ... /test_fp.c however leads to wrong results (typically 0.0) when the two exponents differ a lot. The FPU division should however be IEEE compliant according to documentation.

An example code:

Code: Select all

float x = 24e9;
float y = 4;

printf("C: %.2f/%.2f = %.2f\n", x,y, x/y);
printf("ASM: %.2f/%.2f = %.2f\n", x,y, divsf(x,y));

x = 17;
y = 4;

printf("C: %.2f/%.2f = %.2f\n", x,y, x/y);
printf("ASM: %.2f/%.2f = %.2f\n", x,y, divsf(x,y));
Output:

Code: Select all

C: 24000000000.00/4.00 = 6000000000.00
ASM: 24000000000.00/4.00 = 0.00
C: 17.00/4.00 = 4.25
ASM: 17.00/4.00 = 4.25
I was wondering if anybody has come across those issues and if this is indeed a hardware bug in the FPU or if there are perhaps limitations to the native FPU?

Thank you very much!

Re: Bug in FPU Coprocessor float division?

Posted: Wed May 15, 2019 4:33 pm
by michal1
It turns out there is a bug in divsf() implementation in https://github.com/espressif/esp-idf/bl ... /test_fp.c

I believe the correct implemenation should be:

Code: Select all

float divsf(float a, float b)
{
    float result;
    asm volatile (
        "wfr f0, %1\n"
        "wfr f1, %2\n"
        "div0.s f3, f1 \n"
        "nexp01.s f4, f1 \n"
        "const.s f5, 1 \n"
        "maddn.s f5, f4, f3 \n"
        "mov.s f6, f3 \n"
        "mov.s f7, f1 \n"
        "nexp01.s f8, f0 \n"
        "maddn.s f6, f5, f3 \n"
        "const.s f5, 1 \n"
        "const.s f2, 0 \n"
        "neg.s f9, f8 \n"
        "maddn.s f5,f4,f6 \n"
        "maddn.s f2, f9, f3 \n" /* Original was "maddn.s f2, f0, f3 \n" */
        "mkdadj.s f7, f0 \n"
        "maddn.s f6,f5,f6 \n"
        "maddn.s f9,f4,f2 \n"
        "const.s f5, 1 \n"
        "maddn.s f5,f4,f6 \n"
        "maddn.s f2,f9,f6 \n"
        "neg.s f9, f8 \n"
        "maddn.s f6,f5,f6 \n"
        "maddn.s f9,f4,f2 \n"
        "addexpm.s f2, f7 \n"
        "addexp.s f6, f7 \n"
        "divn.s f2,f9,f6\n"
        "rfr %0, f2\n"
        :"=r"(result):"r"(a), "r"(b)
    );
    return result;
}
Another question is, why does gcc by default use __divsf3() when the performance of the native FPU is significantly better?

Re: Bug in FPU Coprocessor float division?

Posted: Thu May 16, 2019 2:15 am
by ESP_Angus
Thanks for pointing this out, michal.

EDIT: The libgcc __divsf3() implementation in the toolchain uses FPU registers. IDF is accidentally linking the version in ROM which does not use FPU registers. We'll fix this so that you get the FPU version when building a project that uses floating point division.

EDIT 2: Fix is in internal review now.

Re: Bug in FPU Coprocessor float division?

Posted: Sat May 09, 2020 4:13 pm
by simap2000
Hello!
This post has been most helpful! I haven't tested it thoroughly, but this division implementation is much faster than what I'm getting natively.

I poked around on the esp-idf github, and it looks like it's still using the rom version that is slow. What happened to the fix? Should I file a github issue?

Re: Bug in FPU Coprocessor float division?

Posted: Sun May 10, 2020 11:22 pm
by ESP_Angus
Hi simap2000,

Sorry this post should have been updated when the fix was merged. The fix was merged in this commit and should be included in ESP-IDF v4.0 and newer.

Can you please give some more details about what you're seeing? Specifically: Which ESP-IDF version are you using, and what is the reason for saying "it looks like it's still using the rom version that is slow"?

Thanks,

Angus

Re: Bug in FPU Coprocessor float division?

Posted: Wed Feb 28, 2024 2:59 pm
by Bryght-Richard
@ESP_Angus, this still seems to be a problem on ESP32S3, IDF5.1.2, G++12.2.0, compiler optimization -O2, same result on godbolt.org

Code: Select all

float fdiv(float a, float b)
{
    return a / b;
}

float fmul(float a, float b)
{
    return a * b;
}
Compiles to:

Code: Select all

fdiv(float, float):
        entry   sp, 32
        mov.n   a11, a3
        mov.n   a10, a2
        call8   __divsf3
        mov.n   a2, a10
        retw.n
fmul(float, float):
        entry   sp, 32
        wfr     f0, a2
        wfr     f1, a3
        mul.s   f0, f0, f1
        rfr     a2, f0
        retw.n

Re: Bug in FPU Coprocessor float division?

Posted: Fri Mar 01, 2024 6:56 am
by ESP_Sprite
I don't see the issue? Division is always done in the __divsf3 function, as division on Xtensa is not one single opcode, but a bunch of them. Do you have a reason to think the __divsf3 linked does not use the FPU?

Re: Bug in FPU Coprocessor float division?

Posted: Fri Mar 01, 2024 4:19 pm
by Bryght-Richard
@ESP_sprint, thank you, I had misunderstood how the divide instructions worked, and though they were short enough to be inlined effectively without branching to __divsf3.