Its a lot slower! It does support using a zero overhead loop to shift data when it can. So perhaps it spends too much time deciding to use it or doesn't end up using it for some reason.A generic memcpy function, like implemented in the DSP lib, must be able to deal with all kinds of unaligned memory addresses as input and this necessitates some overhead which may contribute to it appearing a little slower than theoretically possible.
Code: Untitled.asm Select all
// Main loop arr_src aligned
loopnez a5, ._main_loop_aligned // 32 bytes in one loop
ee.vld.128.ip q0, a3, 16 // load 16 bytes from arr_src to q0
ee.vld.128.ip q1, a3, 16 // load 16 bytes from arr_src to q1
ee.vst.128.ip q0, a2, 16 // save 16 bytes to arr_dest from q0
ee.vst.128.ip q1, a2, 16 // save 16 bytes to arr_dest from q1
._main_loop_aligned:
