| « Welcome to 2012! | Work is not possible in China » |
Descent into madness (FFT noise reduction)
Rewind about a week. After months of great performance from my Frankenstein voice recognition setup -- headset mic --> JACK --> PulseAudio --> VirtualBox running WinXP --> Dragon NaturallySpeaking -- suddenly dictation accuracy went into the toilet. What's more, it was bad only in Linux; Windows, no problem.
At this point, most users would complain to an online forum and/or buy a new microphone, and hope for the best. But I'm... an audio professional, so that's not nearly enough. (Note: Here's where it's going to start getting a bit technical. It will get a lot more technical as we go...)
So... first, compare recordings from Windows and Linux.
This shows two problems. 1 - the signal level in Linux is a lot lower. The dictation software can't tell when I'm speaking and when I'm not. 2 - the Linux signal is not centered around 0. In technical terms, it has a negative "DC (Direct Current) offset."
But why stop there? I also happen to be in possession of a state-of-the-art real-time digital signal processing engine (SuperCollider), and the architecture in Linux allows me to connect it to any other audio application. So what happens if I tell SuperCollider to remove the DC offset (LeakDC unit generator), and insert SC between the microphone and VirtualBox? Not quite a miracle, but the speech recognition improved dramatically. It was short of where it was before, but nearly usable. Without the extra processing, it was impossible.
LeakDC took care of problem #2. Problem #1 is not only that the signal level is lower. The voice signal is softer, but the noise level is about the same. If I use SC to amplify the signal, the noise will also be louder, and that's just as bad for dictation accuracy.
Enter the technique used for noise removal in a variety of software: analyze a short recording of the noise for its spectral characteristics, and then subtract those characteristics from the signal to be cleaned up. Well, SuperCollider can do that (with some help from Dan Stowell's excellent FFT unit generators in the sc3-plugins package)!
- Record the noise into a buffer.
SynthDef(\rec, {
RecordBuf.ar(LeakDC.ar(SoundIn.ar(0)), samplebuf, loop: 0,
doneAction: 2);
}).send(z);
z.sync; - Get the audio out of the buffer, apply a windowing envelope, and calculate the Fast Fourier Transform. ('nrfactor' only amplifies the noise level before the FFT.)
fork {
samplebuf.getToFloatArray(wait: 0.05, action: { |data_in|
data = data_in;
cond.unhang;
});
};
cond.hang;
hamm = Signal.hammingWindow(data.size);
data = Signal.fill(data.size, { |i| data[i] * hamm[i] });
fftdata = (data * nrfactor).fft(Signal.newClear(data.size),
Signal.fftCosTable(data.size)); - Put the FFT into a new buffer. The magnitudes and phases (real and imaginary parts) have to be interleaved: [].flop.flat.
fork {
fftbuf = Buffer.sendCollection(
z,
[fftdata.real, fftdata.imag].flop.flat,
1, 0.05, {
cond.unhang;
}
);
};
cond.hang;
It's a little tricky to use the new buffer. "PV" (Phase Vocoder) unit generators in SuperCollider often overwrite the FFT frame, because they expect a new FFT analysis to be generated in each go-round. So, though it might seem redundant, it's necessary to PV_Copy from the prepared buffer into a temporary buffer for the rest of the processing. The noise reduction itself is actually fairly simple: PV_MagSubtract subtracts energy corresponding to the noise spectrum.
SynthDef(\nr, { |fftbuf, amp, maxLevel = 0.99|
var sig = LeakDC.ar(SoundIn.ar(0)),
fft = FFT(LocalBuf(BufFrames.ir(fftbuf)), sig),
fftsource = FFTTrigger(fftbuf),
fftsub = FFTTrigger(LocalBuf(BufFrames.ir(fftbuf))),
copy = PV_Copy(fftsource, fftsub);
fft = PV_MagSubtract(fft, fftsub, 1);
Out.ar(0, Limiter.ar(IFFT(fft) * amp.dbamp, maxLevel) ! 2);
}).send(z);
But that's still not enough coding! This is something I need to use daily, so it has to be convenient. I should be able to run it as a script -- sclang /path/to/micfix-script.scd -- and it should do everything: manipulate the JACK connections, run the processing, and also show a little window with a few simple controls. I won't go into all the details in text, but the attached file shows how it's done, for the brave or the just curious. (Note: The script requires a development version of SC, with the QT GUI framework. The extension should be '.scd' really, but this bloggy software doesn't like that. Ho hum -- rename it after you download. Oh... and, I hope I removed all the swear words from the comments.)
How is it working? Pretty well -- maybe better than it ought to work, for such a naïve algorithm. NaturallySpeaking still struggles more than I would like, but accuracy is around 90% at best, good enough for now. Probably I will still have to get a new microphone, but my dictation setup is decently usable again. So it was a worthy experiment.
Edit: Updated the file. There was a bug with the "new noise sample" button. Fixed now.
Attachments:
micfix-script.txt (6.1 KB)
No feedback yet
Comments are not allowed from anonymous visitors. Please use the Contact link at the top or bottom of this page to email me for a user account. This is just an antispam measure.

