It's Just A Little Pitch Correction - DSP Final Project by A. Nagel

On The Surface

Pitch correction is a widely used and controversial form of music editing, probably better known as auto-tuning (though Auto-Tune is a specific audio processor licensed by Antares Audio Technologies.) Whether or not pitch correction is a form of musical "cheating" has been of great debate by professionals and layman alike. It can been used to make barely noticeable changes to a singer's pitch, but it has also started an entire musical style, pioneered by Cher with her song "Believe" in 1998. It seems like a straightforward concept--just change the undesirable pitch in a musical signal to a desirable one--but how is this done?

Deeper Down

A digital music signal is simply a changing value over time, stored as a sequence of numbers. The larger the number, the greater the amplitude, or loudness. A sequence is periodic if it is composed of a repeating sub-sequence, with the length of the section called its period. The frequency is equal to 1/period and is measured in cycles/second (Hertz). If the frequency of the signal is high, then the pitch of the audio signal is high, and if the frequency is low, then the pitch is low. If there are two of the same signals but they play at different times, then the difference in time is called the phase shift. Just like any integer can be made by adding a different number of ones together, any audio signal can be made by adding sine waves with varying amplitudes, frequencies and phases together. The Fourier spectrum breaks down an audio wave into its frequency components, showing at what amplitude and at what phase each individual sine wave of a specific frequency must be in order to add up to the original audio wave. The Fourier Transform transforms the time-domain into the frequency-domain. What this means is that, just as the x-axis is time for a signal in the time-domain (namely 0sec, 1sec, 2sec...), the x-axis is frequency for a signal in the frequency-domain (namely 0Hz, 1Hz, 2Hz...). The Discrete Fourier Transform (DFT) is a Fourier Transform that operates on a finite length, discrete signal (as opposed to a continuous signal, such as an analog signal) and is typically calculated using the Fast Fourier Transform (FFT) algorithm.

The frequency domain, while incredibly useful, offers no information on when things happen. If an audio wave is composed of a 1Hz section for 1 second followed by a 10Hz section for 5 seconds, the Fourier spectrum will just indicate that the signal contains 1Hz and 10Hz components. To find information on both time and frequency, a Short Time Fourier Transform (STFT) must be used. The STFT breaks the audio signal into different lengths of time, called windows, and takes the DFT of each window. Thus, for the earlier signal example, the STFT with windows of length 1 second will consist of 6 windows. The first window will indicate that there is 1Hz component to the signal, but since the window does not contain any time with the 10Hz signal, it will contain no 10Hz component. Likewise, the next 5 windows will contain only a 10Hz component, but no 1Hz component. The waveform in each window is usually multiplied by a function (called a window function) to reduce spectral leakage (basically unwanted artifact in the Fourier spectrum due to problems with periodicity), and usually the windows overlap (typically by 50%). The STFT is depicted with a spectrogram. A spectrogram has time for a x-axis, frequency for a y-axis, and amplitude in the z-axis (represented in color).

That was the simple stuff. Back to pitch correction. Pitch correction has two main steps: pitch detection and pitch shifting. For a simple synthesized signal, pitch detection can simply consist of finding the maximum values in each window of the STFT and then finding their corresponding frequencies, but already here problems arise. If the window is too large (such as a 6 second window for the 6 second audio file), information about time will be lost again. If the window is too small (such as a 0.01 second window for a 1Hz wave), information about frequency will be lost. If a sin wave takes 100 seconds to complete 1 cycle, but you only look at it for 1 second, it will be hard to distinguish it as a sin wave and even harder to tell what frequency it was at. If a real audio signal is used, more problems occur. When a singer sings one note, the pitch that is heard is its fundamental frequency, but her/his voice is also filled with both harmonics (integer multiples of the fundamental frequency) and random noise. Because of this, it can be difficult to detect what pitch he/she is singing at. Pitch shifting can be done by timescale modification, but one can not change the frequency of a signal without changing its time. To double the frequency of an audio signal, the period must be halved, but now the total signal lasts half the time.

This Project

The goal of this project is to create a primitive pitch-correction program. Since pitch correction is most widely done on solo vocalists, the test input will be simple synthesized waves of varying frequency and a recorded vocal glissando, with and without additive noise. A pitch correction program would ideally be able to distinguish between voiced and unvoiced sounds, but this will be ignored for simplicity's sake. I worked around the window-sizing problem by creating a variable window-size spectrogram. This approach requires taking an initial low-resolution (with large, non-overlapping windows) spectrogram to estimate the approximate pitch during each window, and then using that information to create a new spectrogram with the size of each window dependent on (inversely proportional to) the approximate pitch at that time. For the real audio signal, I also employ a Harmonic Product Spectrum, which is described here: la Cuadra, P., Master, A., and Sapp, C., 2001, Efficient pitch detection techniques for interactive music, International Computer Music Association. This works by downsampling (removing samples from a sequence) the input sequence by 2, zero-padding it to its original length (just adding zeros), and then dotting (sample by sample multiplication) it with the original sequence. This should theoretically accentuate the peak at the fundamental frequency.

My method for pitch shifting can be found here: Laroche, J., and Dolson, M., 1999, New phase-vocoder techniques for pitch-shifting, harmonizing, and other exotic effects, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p. 91-94. Correct pitch shifting requires, first and foremost, accurate pitch detection, but after that the signal must be frequency shifted so that the peak (assumed pitch) at the detected frequency is moved to the location of the desired frequency. However, the whole waveform can not be shifted a constant amount because then the harmonics would not line up. The 1st harmonic (twice the fundamental frequency) must be shifted twice the amount that the fundamental is shifted, the 2nd harmonic must be shifted three times the amount, etc.

All coding was done in Matlab.