The P4's undocumented SIMD instructions
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
The P4's undocumented SIMD instructions
As of today, documentation for the P4's "PIE"/SIMD instructions is still "to be added later".
However, I just unearthed this (presumably complete!) list of the 350+ PIE instructions in Espressif's binutils/objdump version:
https://github.com/espressif/binutils-g ... xespv2p1.c
For anyone familiar with the S3's PIE, it's not hard to guess what many of the instructions do - and with a bit of squinting one can come up with pretty reasonable guesses about the new instructions too, like
esp.vadd.u16 qv, qx, qy - unsigned vector addition?
esp.vmul.s16.s8xs8 qz, qv, qx, qy - vector s8xs8->s16 multiplication?
esp.vabs.16 qv, qy - vector absolute value?
esp.vsat.s16 qz, qx, ra, rb - vector saturation/clamping?
esp.vsadds.s16 qv, qx, ra - vector-scalar addition w/ saturation?
...
Anybody feel like speculating about the new fancy, or even testing them out on actual hardware?
(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)
However, I just unearthed this (presumably complete!) list of the 350+ PIE instructions in Espressif's binutils/objdump version:
https://github.com/espressif/binutils-g ... xespv2p1.c
For anyone familiar with the S3's PIE, it's not hard to guess what many of the instructions do - and with a bit of squinting one can come up with pretty reasonable guesses about the new instructions too, like
esp.vadd.u16 qv, qx, qy - unsigned vector addition?
esp.vmul.s16.s8xs8 qz, qv, qx, qy - vector s8xs8->s16 multiplication?
esp.vabs.16 qv, qy - vector absolute value?
esp.vsat.s16 qz, qx, ra, rb - vector saturation/clamping?
esp.vsadds.s16 qv, qx, ra - vector-scalar addition w/ saturation?
...
Anybody feel like speculating about the new fancy, or even testing them out on actual hardware?
(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)
Last edited by MicroController on Tue Apr 21, 2026 6:44 am, edited 2 times in total.
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
According to the comments in esp-dl, the following RISC-V registers cannot be used with the SIMD instructions:
Consequently, trying something like esp.vld.128.ip q0, t0, 16 fails to build with an "Error: illegal operands" from the assembler. (Something to also be aware of when using inline assembly where the compiler may choose to provide input or output in one of those registers.) (- I don't expect we'll get a gcc version with P4-specific constraints/register classes... but one can use the RISC-V constraint "cr" (=x8...x15) as a work-around.)
Code: Select all
t0 (x5)
t1 (x6)
t2 (x7)
a6 (x16)
a7 (x17)
s2 (x18)
s3 (x19)
s4 (x20)
s5 (x21)
Last edited by MicroController on Mon Apr 20, 2026 3:33 pm, edited 3 times in total.
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
And while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.
On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.
On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
So...
We actually have a "...xespv2p1.c" and a "...xespv2p2.c". I assume the "p2" describes the updated/newer version of the PIE (i.e. P4 v3), because it actually includes new "saturating" and "rounding" variants of many instructions:
esp.vmul.s16 qz, qx, qy - vector multiplication + arithmetic right-shift
esp.vmul.s16 qz, qx, qy, vs - saturating or truncating
esp.vmul.s16 qz, qx, qy, vr - rounding
esp.vmul.s16 qz, qx, qy, vs, vr - saturating/truncating + rounding
So the syntax would be
esp.vmul.s16 qz, qx, qy [, saturation][, rounding]
The rounding and saturation arguments are named here:
"rdn" - round down (floor)
"rup" - round up (ceiling)
"raz" - round away from zero
"rtz" - round toward zero
"rhaz" - round half away from zero
"rhtz" - round half toward zero
"rne" - round to nearest even
"dyn" - mystery! (Maybe determined by a value in the "cfg" or some other register?)
Hence it should look like
The S3's ee.vmul... only does truncate + round-toward-0, so, for "compatibility", this may also be the default behavior on the P4.
We actually have a "...xespv2p1.c" and a "...xespv2p2.c". I assume the "p2" describes the updated/newer version of the PIE (i.e. P4 v3), because it actually includes new "saturating" and "rounding" variants of many instructions:
esp.vmul.s16 qz, qx, qy - vector multiplication + arithmetic right-shift
esp.vmul.s16 qz, qx, qy, vs - saturating or truncating
esp.vmul.s16 qz, qx, qy, vr - rounding
esp.vmul.s16 qz, qx, qy, vs, vr - saturating/truncating + rounding
So the syntax would be
esp.vmul.s16 qz, qx, qy [, saturation][, rounding]
The rounding and saturation arguments are named here:
"rdn" - round down (floor)
"rup" - round up (ceiling)
"raz" - round away from zero
"rtz" - round toward zero
"rhaz" - round half away from zero
"rhtz" - round half toward zero
"rne" - round to nearest even
"dyn" - mystery! (Maybe determined by a value in the "cfg" or some other register?)
Hence it should look like
Code: Select all
esp.vmul.s16 q0, q1, q2, sat // saturate
esp.vmul.s16 q0, q1, q2, trunc // truncate
esp.vmul.s16 q0, q1, q2, rdn // round down
esp.vmul.s16 q0, q1, q2, trunc, rtz // truncate + round toward 0
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
Some more, AI-generated, partially accurate, information, including descriptions of instructions' behavior:
github.com/nullislandspace/tanmatsu-simd-tests/blob/main/PIE_REFERENCE.md
Actually completely AI garbage. Please ignore.
github.com/nullislandspace/tanmatsu-simd-tests/blob/main/PIE_REFERENCE.md
Actually completely AI garbage. Please ignore.
Re: The P4's undocumented SIMD instructions
Most useful PIEs can be found in Espressif's optimized libraries (signal processing, audio, AI accel..). Many of the new instructions are ald ones, with ee. replaced with esp. But for the rest we have to wait a couple of years I think(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)
I'd like to try the new approach for signal capture (logic analyzer): tight loop on 1 core, reading GPIO values (through PIE), storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process
AI's are notoriously bad at ESP32-S3 and P4 assembler.
Thanks!
Slava.
Slava.
Re: The P4's undocumented SIMD instructions
I can try to incorporate your findings into Ghidra. This may be usefulAnd while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.
On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.
Thanks!
Slava.
Slava.
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
Can just use memory-to-memory DMA, or let the cache handle it, possibly with a little help to reduce unnecessary transfers....storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process
-
MicroController
- Posts: 2661
- Joined: Mon Oct 17, 2022 7:38 pm
- Location: Europe, Germany
Re: The P4's undocumented SIMD instructions
Some more information (brief description of (many of?) the instructions) here:
https://github.com/espressif/esp-dl/blo ... uctions.md
https://github.com/espressif/esp-dl/blo ... uctions.md
Who is online
Users browsing this forum: Google [Bot], Semrush [Bot] and 9 guests
