The P4's undocumented SIMD instructions

MicroController · Postby **MicroController** » Wed Apr 15, 2026 11:26 am

As of today, documentation for the P4's "PIE"/SIMD instructions is still "to be added later".
However, I just unearthed this (presumably complete!) list of the 350+ PIE instructions in Espressif's binutils/objdump version:
https://github.com/espressif/binutils-g ... xespv2p1.c

For anyone familiar with the S3's PIE, it's not hard to guess what many of the instructions do - and with a bit of squinting one can come up with pretty reasonable guesses about the new instructions too, like

esp.vadd.u16 qv, qx, qy - unsigned vector addition?
esp.vmul.s16.s8xs8 qz, qv, qx, qy - vector s8xs8->s16 multiplication?
esp.vabs.16 qv, qy - vector absolute value?
esp.vsat.s16 qz, qx, ra, rb - vector saturation/clamping?
esp.vsadds.s16 qv, qx, ra - vector-scalar addition w/ saturation?
...

Anybody feel like speculating about the new fancy, or even testing them out on actual hardware?

(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)

MicroController · Postby **MicroController** » Wed Apr 15, 2026 10:02 pm

According to the comments in esp-dl, the following RISC-V registers cannot be used with the SIMD instructions:

Code: Select all

t0 (x5)
t1 (x6)
t2 (x7)
a6 (x16)
a7 (x17)
s2 (x18)
s3 (x19)
s4 (x20)
s5 (x21)

Consequently, trying something like esp.vld.128.ip q0, t0, 16 fails to build with an "Error: illegal operands" from the assembler. (Something to also be aware of when using inline assembly where the compiler may choose to provide input or output in one of those registers.) (- I don't expect we'll get a gcc version with P4-specific constraints/register classes... but one can use the RISC-V constraint "cr" (=x8...x15) as a work-around.)

Postby **Sprite** » Thu Apr 16, 2026 9:58 am

Potentially this is also of interest.

MicroController · Postby **MicroController** » Thu Apr 16, 2026 12:25 pm

And while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.

On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.

MicroController · Postby **MicroController** » Sat Apr 18, 2026 10:43 am

So...
We actually have a "...xespv2p1.c" and a "...xespv2p2.c". I assume the "p2" describes the updated/newer version of the PIE (i.e. P4 v3), because it actually includes new "saturating" and "rounding" variants of many instructions:

esp.vmul.s16 qz, qx, qy - vector multiplication + arithmetic right-shift
esp.vmul.s16 qz, qx, qy, vs - saturating or truncating
esp.vmul.s16 qz, qx, qy, vr - rounding
esp.vmul.s16 qz, qx, qy, vs, vr - saturating/truncating + rounding

So the syntax would be
esp.vmul.s16 qz, qx, qy [, saturation][, rounding]

The rounding and saturation arguments are named here:

"rdn" - round down (floor)
"rup" - round up (ceiling)
"raz" - round away from zero
"rtz" - round toward zero
"rhaz" - round half away from zero
"rhtz" - round half toward zero
"rne" - round to nearest even
"dyn" - mystery! (Maybe determined by a value in the "cfg" or some other register?)

Hence it should look like

Code: Select all

esp.vmul.s16 q0, q1, q2, sat          // saturate
esp.vmul.s16 q0, q1, q2, trunc        // truncate
esp.vmul.s16 q0, q1, q2, rdn          // round down
esp.vmul.s16 q0, q1, q2, trunc, rtz   // truncate + round toward 0

The S3's ee.vmul... only does truncate + round-toward-0, so, for "compatibility", this may also be the default behavior on the P4.

MicroController · Postby **MicroController** » Thu Apr 23, 2026 12:42 pm

Some more, AI-generated, partially accurate, information, including descriptions of instructions' behavior:
github.com/nullislandspace/tanmatsu-simd-tests/blob/main/PIE_REFERENCE.md
Actually completely AI garbage. Please ignore.

vvb333007 · Postby **vvb333007** » Thu May 07, 2026 10:02 am

(Btw, I also noticed that v3 of the P4 is supposed to have still more additional instructions/registers ("Updated PIE for saturation and rounding") than current/previous versions, so some instructions may be in the list but not yet work on available hardware...)

Most useful PIEs can be found in Espressif's optimized libraries (signal processing, audio, AI accel..). Many of the new instructions are ald ones, with ee. replaced with esp. But for the rest we have to wait a couple of years I think

.

I'd like to try the new approach for signal capture (logic analyzer): tight loop on 1 core, reading GPIO values (through PIE), storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process
AI's are notoriously bad at ESP32-S3 and P4 assembler.

vvb333007 · Postby **vvb333007** » Thu May 07, 2026 10:08 am

And while we're at it, let's also give a quick shout-out to the (documented) "hardware loop", which looks like this for example.

On the P4, we can even nest two hardware loops (outer loop: esp.lp.setup 1, ... - inner loop: esp.lp.setup 0, ...). However, I expect these to provide less (performance) benefit than the "zero-overhead loop" on the Xtensas because the P4 already uses branch prediction for the regular branches.

I can try to incorporate your findings into Ghidra. This may be useful

MicroController · Postby **MicroController** » Thu May 07, 2026 12:45 pm

...storing values to IRAM. And the second core slowly storing data from IRAM to PSRAM. I am pretty sure there are PIE's which can speed up the process

Can just use memory-to-memory DMA, or let the cache handle it, possibly with a little help to reduce unnecessary transfers.

MicroController · Postby **MicroController** » Thu Jun 11, 2026 4:48 pm

Some more information (brief description of (many of?) the instructions) here:
https://github.com/espressif/esp-dl/blo ... uctions.md

The P4's undocumented SIMD instructions

The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Re: The P4's undocumented SIMD instructions

Who is online

About Us

Extra

Information