The P4's undocumented SIMD instructions

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

The P4's undocumented SIMD instructions

Postby MicroController » Wed Apr 15, 2026 11:26 am

As of today, documentation for the P4's "PIE"/SIMD instructions is still "to be added later".
However, I just unearthed this (presumably complete!) list of the 350+ PIE instructions in Espressif's binutils/objdump version:
https://github.com/espressif/binutils-g ... xespv2p1.c

For anyone familiar with the S3's PIE, it's not hard to guess what many of the instructions do - and with a bit of squinting one can come up with pretty reasonable guesses about the new instructions too, like

esp.vadd.u16 qv, qx, qy - unsigned vector addition?
esp.vmul.s16.s8xs8 qz, qv, qx, qy - vector s8xs8->s16 multiplication?
esp.vabs.16 qv, qy - vector absolute value?
esp.vsat.s16 qz, qx, ra, rb - vector saturation/clamping?
esp.vsadds.s16 qv, qx, ra - vector-scalar addition w/ saturation?
...

Anybody feel like speculating about the new fancy, or even testing them out on actual hardware?

(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)
Last edited by MicroController on Tue Apr 21, 2026 6:44 am, edited 2 times in total.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Wed Apr 15, 2026 10:02 pm

According to the comments in esp-dl, the following RISC-V registers cannot be used with the SIMD instructions:

Code: Select all

t0 (x5)
t1 (x6)
t2 (x7)
a6 (x16)
a7 (x17)
s2 (x18)
s3 (x19)
s4 (x20)
s5 (x21)
Consequently, trying something like esp.vld.128.ip q0, t0, 16 fails to build with an "Error: illegal operands" from the assembler. (Something to also be aware of when using inline assembly where the compiler may choose to provide input or output in one of those registers.) (- I don't expect we'll get a gcc version with P4-specific constraints/register classes... but one can use the RISC-V constraint "cr" (=x8...x15) as a work-around.)
Last edited by MicroController on Mon Apr 20, 2026 3:33 pm, edited 3 times in total.

Sprite
Espressif staff
Espressif staff
Posts: 10593
Joined: Thu Nov 26, 2015 4:08 am

Re: The P4's undocumented SIMD instructions

Postby Sprite » Thu Apr 16, 2026 9:58 am

Potentially this is also of interest.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Thu Apr 16, 2026 12:25 pm

And while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.

On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Sat Apr 18, 2026 10:43 am

So...
We actually have a "...xespv2p1.c" and a "...xespv2p2.c". I assume the "p2" describes the updated/newer version of the PIE (i.e. P4 v3), because it actually includes new "saturating" and "rounding" variants of many instructions:

esp.vmul.s16 qz, qx, qy - vector multiplication + arithmetic right-shift
esp.vmul.s16 qz, qx, qy, vs - saturating or truncating
esp.vmul.s16 qz, qx, qy, vr - rounding
esp.vmul.s16 qz, qx, qy, vs, vr - saturating/truncating + rounding

So the syntax would be
esp.vmul.s16 qz, qx, qy [, saturation][, rounding]

The rounding and saturation arguments are named here:

"rdn" - round down (floor)
"rup" - round up (ceiling)
"raz" - round away from zero
"rtz" - round toward zero
"rhaz" - round half away from zero
"rhtz" - round half toward zero
"rne" - round to nearest even
"dyn" - mystery! (Maybe determined by a value in the "cfg" or some other register?)

Hence it should look like

Code: Select all

esp.vmul.s16 q0, q1, q2, sat          // saturate
esp.vmul.s16 q0, q1, q2, trunc        // truncate
esp.vmul.s16 q0, q1, q2, rdn          // round down
esp.vmul.s16 q0, q1, q2, trunc, rtz   // truncate + round toward 0
The S3's ee.vmul... only does truncate + round-toward-0, so, for "compatibility", this may also be the default behavior on the P4.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Thu Apr 23, 2026 12:42 pm

Some more, AI-generated, partially accurate, information, including descriptions of instructions' behavior:
github.com/nullislandspace/tanmatsu-simd-tests/blob/main/PIE_REFERENCE.md

Actually completely AI garbage. Please ignore.

vvb333007
Posts: 71
Joined: Wed Jul 31, 2024 5:53 am
Location: Thailand
Contact:

Re: The P4's undocumented SIMD instructions

Postby vvb333007 » Thu May 07, 2026 10:02 am

(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)
Most useful PIEs can be found in Espressif's optimized libraries (signal processing, audio, AI accel..). Many of the new instructions are ald ones, with ee. replaced with esp. But for the rest we have to wait a couple of years I think :).

I'd like to try the new approach for signal capture (logic analyzer): tight loop on 1 core, reading GPIO values (through PIE), storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process
AI's are notoriously bad at ESP32-S3 and P4 assembler.
Thanks!
Slava.

vvb333007
Posts: 71
Joined: Wed Jul 31, 2024 5:53 am
Location: Thailand
Contact:

Re: The P4's undocumented SIMD instructions

Postby vvb333007 » Thu May 07, 2026 10:08 am

And while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.

On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.
I can try to incorporate your findings into Ghidra. This may be useful
Thanks!
Slava.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Thu May 07, 2026 12:45 pm

...storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process
Can just use memory-to-memory DMA, or let the cache handle it, possibly with a little help to reduce unnecessary transfers.

MicroController
Posts: 2661
Joined: Mon Oct 17, 2022 7:38 pm
Location: Europe, Germany

Re: The P4's undocumented SIMD instructions

Postby MicroController » Thu Jun 11, 2026 4:48 pm

Some more information (brief description of (many of?) the instructions) here:
https://github.com/espressif/esp-dl/blo ... uctions.md

Who is online

Users browsing this forum: Bytespider, Qwantbot, YisouSpider and 9 guests