Audio processing on esp32-s3 using esp-dsp library

talha94 · Postby **talha94** » Tue Jul 22, 2025 6:10 am

Hello!
I need some help to improve audio vocal quality and reduce the background noise in the captured audio.
I am working on a project with an i2s mems microphone and speaker, with two esp32-s3 devkits. I am working on PlatformIO using the Arduino framework.
My mic is configured to capture audio buffer at 16000Hz sampling rate, 16 bits per sample and mono audio. My audio buffer size is 1024 bytes, 512 samples.
I am using an RTOS task to capture audio, then I send it to a queue. I use esp now to send the audio to my other esp32-s3. There I send the received buffer to a queue and then write the audio to my speaker connected through an amplifier (MAX98537A).
If I write the audio buffers without processing, the audio amplitude is very low, and I have to put my ear to the speaker to hear the audio. So I decided to create another task to process audio. So now after capturing the audio, I send the audio to the processing queue instead of the sending queue. The processing task takes the audio buffer from that queue and process it, then sends it to the sending queue.
In the processing task, i added some gain to audio. The problem with this is that the background noise which was barely audible in the audio before is amplified as well. In complete silence there is almost no noise, but even a fan or an air-conditioner introduces a background noise (sounds like TV static noise).
I have tried using a band pass filter to attenuate the higher and lower frequencies to reduce the noise. The band pass filter with small quality factor barely reduces the background noise and the band pass filter with higher quality factor reduces the vocal quality significantly.
In my testing, the best vocal quality is the unprocessed audio that I captured from the mic but that audio also contains any and all background noise present.
After playing with different high pass, low pass and band pass filters, i have figured out that there is vocal data present even in the extreme high frequncies. If I remove any frequncies completely, the effect can be seen in the form of reduced vocal quality. This leads me to the conclusion that just targeting frequncies isnt the best approach for noise reduction. I need some method to distinguish between vocals and noise.
I have also tested a voice activity detection algorithm and it works great during the silent periods and gives complete silence on the speakers but when there are vocals present there is also background noise.
I tried spectral subtraction. I recorded an audio buffer during silent period and then performed fft (fast fourier transform) on my audio buffer and my noise buffer. Then I calculated the magnitudes and phases of the frequencies and subtracted the magnitude of noise from that of mic audio. Then I performed inverse fft to get the audio signal. It was somewhat effective. But I havent heard a purely clean good quality audio even once.
I want crisp audio quality and no background noise at all. Is it possible to achieve this on esp32-s3? I have been working on this on and off for almost half a year and it seems to me that I have made no progress at all. Either there is background noise or the vocals are no longer crisp.
Any help or guidance is appreciated. Thanks.

Here are my codes:
i2s setup:

Code: Select all

micConfig = {
        .mode                 = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
        .sample_rate          = sampleRate,
        .bits_per_sample      = bitRate,
        .channel_format       = micChannel,
        .communication_format = I2S_COMM_FORMAT_STAND_I2S,
        .intr_alloc_flags     = ESP_INTR_FLAG_LEVEL1,
        .dma_buf_count        = 8,
        .dma_buf_len          = 256,
        .use_apll             = true
    };

    micPins = {
        .bck_io_num           = micClockPin,
        .ws_io_num            = micWordSelectPin,
        .data_out_num         = I2S_PIN_NO_CHANGE,
        .data_in_num          = micDataPin
    };

    i2s_driver_install(micPort, &micConfig, 0, NULL);
    i2s_set_pin(micPort, &micPins);

audio capturing task:

Code: Select all

void audio_task(void *arg)
{
    DEBUGLN("Audio task started");
    while (true) {
        mic.read(vocalSamples, AUDIO_BUFFER_SIZE);
        if (xQueueSend(vocal_queue, vocalSamples, 0) != pdTRUE) {
            ESP_LOGW(TAG, "Audio queue full, dropping oldest");
            xQueueReceive(vocal_queue, dropBuf, 0);
            xQueueSend(vocal_queue, vocalSamples, 0);
        } 
        vTaskDelay(1);
    }
}

intializing my band pass filters coefficients:

Code: Select all

            const float low_hz   = 300.0f;    // lower cutoff
            const float high_hz  = 3000.0f;   // upper cutoff
            const float fs       = 16000.0f;  // sample rate

            float  f0    = sqrtf(low_hz * high_hz);
            float  Q     = 0.707f;                  
            float  normF = f0 / fs;

            if (dsps_biquad_gen_bpf0db_f32(coeffs, normF, Q) != ESP_OK) {
                return false;
            }

applying my band pass filter:

Code: Select all

        dsps_biquad_f32_ansi(in_f, out_f, N, coeffs, state);

        for (int i = 0; i < N; i++) {
            float y = out_f[i];
            if      (y >  32767.0f) y =  32767.0f;
            else if (y < -32768.0f) y = -32768.0f;
            out_samples[i] = (int16_t)y;
        }

spectral subtraction on vocal and noise:

Code: Select all

forward fft:
        dsps_fft2r_fc32(fft_data1, N);
        dsps_bit_rev2r_fc32(fft_data1, N);

calculating the magnitudes:
        float mag1_sq = real1*real1 + imag1*imag1;
        float mag1 = sqrtf(mag1_sq);

phase calculation:
        float phase1 = atan2f(imag1, real1);

subtraction:
        float enhanced_mag = mag1 - mag2;

reconstruct with original phase:
            float cos_phase = cosf(phase1);
            float sin_phase = sinf(phase1);
            fft_data1[2*i] = enhanced_mag * cos_phase;
            fft_data1[2*i+1] = enhanced_mag * sin_phase;

taking conjugate for inverse fft:
        for (int i = 0; i < N; i++) {
            fft_data1[2*i + 1] = -fft_data1[2*i + 1];
        }

applying forward fft to the conjugate to get the audio buffer back:
        dsps_fft2r_fc32(fft_data1, N);
        dsps_bit_rev2r_fc32(fft_data1, N);

MicroController · Postby **MicroController** » Wed Jul 23, 2025 9:23 pm

Wow! Impressive work!

Not a signal processing guy, but:
Can you use a more directional sound pickup (->beamforming using 2 or more microphones)? This can reduce noise significantly, see e.g. Benn Jordan's video on 'acoustic spying'. Or do you need omni-directionality?

Btw: If you run into performance issues with audio processing, the S3 has powerful SIMD instructions that are well-suited to speed up many 16-bit signal processing tasks by a factor of ~8, and even more when compared to floating point operations.

talha94 · Postby **talha94** » Thu Jul 24, 2025 7:22 am

Wow! Impressive work!

Not a signal processing guy, but:
Can you use a more directional sound pickup (->beamforming using 2 or more microphones)? This can reduce noise significantly, see e.g. Benn Jordan's video on 'acoustic spying'. Or do you need omni-directionality?

Btw: If you run into performance issues with audio processing, the S3 has powerful SIMD instructions that are well-suited to speed up many 16-bit signal processing tasks by a factor of ~8, and even more when compared to floating point operations.

Ok, thanks for the response.
I have played with two mics a little bit. Not for beam forming, of course. I was thinking more of using one mic to capture the noise and subtracting it from the other. I need to look into this beam forming idea and how it can be achieved.
Using multiple mics come with its own challenges. As far as I understand when using 2 mics, to have them completely synced you need to combine all their pins (word select, serial clock and data pins), then set one mic to only left and the other mic to only right.
Then in the firmware you read the audio as stereo, so audio from one mic will be stored in one channel and audio from the other mic will be stored in the other channel like [L, R, L, R, ..., L, R]. (Disclaimer: I haven't actually tested this because I burned all my mics trying to figure this out and testing different methods to combine mic audio. I have only one mic left, but I think this is the correct method.)
Without this, there is no way to sync the mic audio perfectly just from inside the firmware. I have tried using two separate tasks to capture audio from both mics. I have tried using event groups, and semaphores to make the tasks loop in lockstep, but there is always a little bit of slip between the buffers without the same clock.
The problem that I faced here is the same as before, that I wasn't able to get a clean noise buffer from the second mic to subtract it from the vocals. The audio in the second mic also contained vocals. And for vocal removal I again relied on the techniques mentioned previously (frequency filters).
I want a method that can distinguish between vocal and noise in the same frequencies. Sadly, I haven't even heard of anything like this except machine learning and AI. I will try them as well but with the obvious limitations of the ESP32. I have tested some python libs for noise suppression just to test what is possible in the realm of noise reduction. And the results were pretty good. Maybe I will have to look inside those libraries to figure out what is happening inside. I took a cursory glance and most of it flew over my head.
I haven't used SIMD before, thanks for that info. I will look into it for further optimization. I haven't faced any computational delay yet and everything is in real time.

MicroController · Postby **MicroController** » Fri Jul 25, 2025 9:09 am

I have played with two mics a little bit. Not for beam forming, of course. I was thinking more of using one mic to capture the noise and subtracting it from the other.

Yep, that's the most basic type of beam forming

Two mics, one facing 'forward' and one facing in the opposite direction; subtract one from the other (add with a "phase shift of 180°") and you cancel out much of the sound coming from the sides.

As far as I understand when using 2 mics, to have them completely synced you need to combine all their pins (word select, serial clock and data pins), then set one mic to only left and the other mic to only right. Then in the firmware you read the audio as stereo, so audio from one mic will be stored in one channel and audio from the other mic will be stored in the other channel like [L, R, L, R, ..., L, R].

That would be the right approach and should indeed give the best correlation between the samples of both mics. If you can synchronize two mono audio streams to within 1 sample period in some other way that should be fine too, and since the two sample rates are locked to the same master clock, once you have the streams sync'ed, they automatically stay in sync. (Syncing in software could be achieved by calculating the correlation between the two streams for offsets between 0 and n samples and using the offset which produces the highest correlation.)

The audio in the second mic also contained vocals.

Yes, it does. And by subtracting the signals you even reduce the vocals' amplitude a little; but the out-of-direction noise gets reduced much more, so the SNR improves and stays higer even after amplifying the resulting signal again.

I want a method that can distinguish between vocal and noise in the same frequencies. Sadly, I haven't even heard of anything like this except machine learning and AI.

I think there were approaches to this long before machine learning/AI in today's form were feasible.
One basic approach, which I think you already tried, is to sample the noise background and then subtract its spectral intensities from the noisy signal of interest.

I haven't used SIMD before, thanks for that info. I will look into it for further optimization.

I have a bit of exprience using the S3's SIMD instructions so may be able to help

They are documented in the S3's TRM and also referred to as "PIE" instructions if you want to search for resources online.
(Using the SIMD, mixing 8 16-bit stereo samples down to 8 mono samples, for example, should take about 6-7 CPU clock cycles including loading/storing from/to RAM.)

talha94 · Postby **talha94** » Tue Jul 29, 2025 7:04 am

I have a bit of exprience using the S3's SIMD instructions so may be able to help They are documented in the S3's TRM and also referred to as "PIE" instructions if you want to search for resources online.
(Using the SIMD, mixing 8 16-bit stereo samples down to 8 mono samples, for example, should take about 6-7 CPU clock cycles including loading/storing from/to RAM.)

I am very much interested in these kinds of optimizations. I try to optimize my code as much as possible, but my current knowledge is limited in this field. I will look into the info you provided. Thanks

MicroController · Postby **MicroController** » Tue Jul 29, 2025 9:39 am

For performance, try to avoid floating point operations; especially since your input and output already is signed 16-bit integers.
When using ESP-DSP use the "..._s16_..." functions where you can; only these functions can use the S3's SIMD instructions.

Audio processing on esp32-s3 using esp-dsp library

Audio processing on esp32-s3 using esp-dsp library

Re: Audio processing on esp32-s3 using esp-dsp library

Re: Audio processing on esp32-s3 using esp-dsp library

Re: Audio processing on esp32-s3 using esp-dsp library

Re: Audio processing on esp32-s3 using esp-dsp library

Re: Audio processing on esp32-s3 using esp-dsp library

Who is online

About Us

Extra

Information