Deep Learning 101: Lesson 23: The Basics of Audio Signal Processing with FFT

10 min readSep 2, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
☞ Learn with the visual tool: Audio Basics

The Fast Fourier Transform (FFT) is a fundamental algorithm in the field of digital signal processing, playing a crucial role in audio recognition. It is a method for efficiently computing the Discrete Fourier Transform (DFT), which converts a signal from its original time domain into the frequency domain. The time domain shows how a signal changes over time, while the frequency domain represents the signal in terms of its frequency content. The FFT algorithm decomposes a time-based signal into its constituent frequencies, revealing the amplitude and phase of each frequency component. This transformation is key in analyzing audio signals, as it allows for the identification and isolation of different sound elements based on their frequency characteristics.

In audio processing, FFT is used to analyze the frequency content of audio signals. For voice recognition, FFT helps in distinguishing between different phonemes (distinct units of sound) by breaking down the voice signal into its frequency components. This frequency-based analysis is critical in identifying and differentiating various sounds and speech elements. In music analysis, FFT is used to identify musical notes, chords, and rhythms by analyzing the frequency spectrum of a musical piece. It enables the detection of different instruments in a song and can even be used for audio effects like equalization, where certain frequency bands are amplified or attenuated.

FFT’s role in AI extends to various practical applications:

Speech Recognition: AI systems like virtual assistants use FFT to analyze speech patterns. By breaking down speech into frequency components, these systems can more accurately recognize words and phrases, even in noisy environments.
Sound Classification: In AI-driven sound classification, FFT is used to analyze environmental sounds or machinery noises for monitoring and diagnostic purposes. For instance, identifying the sound of a failing machine part in a factory setting for predictive maintenance.
Noise Reduction: In applications like call centers or voice-activated systems, FFT is instrumental in noise reduction algorithms. By analyzing the frequency spectrum of the audio, AI systems can filter out background noise, enhancing the clarity of the spoken words.
Audio Fingerprinting: FFT is used in creating unique fingerprints of audio tracks, which AI systems can then use to identify songs or detect copyright infringements.

Understanding FFT (Fast Fourier Transform) in Audio Signals

In the domain of audio recognition, comprehending the intricacies of audio signals and their FFT (Fast Fourier Transform) representations is crucial. Audio signals are inherently complex, comprising a myriad of frequencies. Grasping their composition is essential for pattern recognition within sounds. The FFT is a mathematical apparatus that decomposes an audio signal into its elemental frequencies, thus presenting it in a format that is more suitable for interpretation by neural networks. This procedure parallels the transformation of images into numerical arrays for image recognition, where FFT converts audio signals into a frequency spectrum, providing valuable input for deep learning models. Such a transformation empowers neural networks to discern and differentiate between diverse sound forms, including music, speech, or environmental noise.

Transforming Audio into Digital Signal

The first step in audio processing is converting the analog audio signal into a digital signal. This involves sampling the audio wave at regular intervals. The sampling rate, usually measured in Hz (samples per second), determines the resolution of the digital audio. A standard CD-quality audio, for example, has a sampling rate of 44.1 kHz. The equation for sampling is as follows:

Sampled Signal=Analog Signal(n x Ts)

Where n is the sample number, and Ts is the sampling period (the reciprocal of the sampling rate).

The Spectrogram

Once the audio signal is digitized, we can represent it as a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies in a sound signal as they vary with time. In simpler terms, it’s a graph that shows how different frequencies present in the audio signal fluctuate over time. The x-axis represents time, the y-axis represents frequency, and the intensity of colors in the graph indicates the amplitude (or loudness) of various frequencies at different times.

FFT and Its Mathematical Foundation

The Fast Fourier Transform (FFT) is used to convert the time-domain audio signal into a frequency-domain representation. The mathematical basis of FFT lies in the Fourier Transform, which decomposes a function of time (a signal) into its constituent frequencies. The Fourier Transform of a continuous time-domain signal

Where

The FFT algorithm efficiently computes this transformation for discrete signals, making it feasible to process even in real-time applications. The discrete Fourier Transform (DFT) for a discrete time-domain signal x[n] is given by:

Where

This equation translates a sequence of N complex numbers x[n] from the time domain into N complex numbers X[k] in the frequency domain. The FFT is an algorithmic strategy to calculate the same results as the DFT with significantly less computational effort. While the DFT requires O(N²) operations to compute, the FFT reduces this complexity to O(NlogN) by cleverly factoring and reusing results of smaller DFTs. This efficiency gain is achieved by breaking down the DFT into smaller DFTs recursively, exploiting the periodic and symmetrical properties of the complex exponential functions. The Cooley-Tukey algorithm, a common FFT algorithm, for instance, recursively divides the DFT into two smaller DFTs of even and odd indices.

The frame size in FFT refers to the number of signal samples used in each FFT operation, crucial for analyzing time-varying signals. It impacts the balance between frequency resolution — better with larger frames — and time resolution — better with smaller frames. Window functions are applied to each frame to minimize spectral leakage before performing FFT. Overlapping frames can improve the analysis, providing smoother transitions in the spectrogram. The choice of frame size is a trade-off, determined by whether temporal detail or frequency detail is more critical for the analysis.

In practical computations, the FFT algorithm, which efficiently calculates the DFT, requires that the frame size be a power of two to optimize processing speed. The FFT is then applied to each frame, resulting in a sequence of frequency domain “snapshots” of the signal over time. The choice of frame size is determined by the specific requirements of the analysis being performed and the characteristics of the signal. For example, in speech processing, frame sizes typically range from 20 ms to 40 ms, balancing the need to capture the quasi-stationary properties of speech with the ability to reflect dynamic changes.

Example of Fast Fourier Transform (FFT)

This section introduces an example to explain the Fast Fourier Transform (FFT). It sets the stage for a demonstration using a simple signal composed of three distinct frequencies. This example is foundational for understanding how FFT analyzes and breaks down complex signals into their constituent frequencies.

Figure 1: Audio Signal Comprising Three Frequencies

This above image depicts a signal comprising three frequencies: 100 Hz, 200 Hz, and 300 Hz, each spanning approximately 0.2 seconds. On the x-axis, we have time, reflecting the duration over which the signal extends, and on the y-axis, amplitude is displayed, indicating the strength or loudness of the signal at any given point in time. This visual representation helps in understanding how these three different frequencies are combined over time to form a composite audio signal

Here in the above image, the frequency spectrum plot clearly showcases the three distinct frequencies present in the signal. The x-axis represents the frequency in Hertz (Hz), showing the range of frequencies contained in the signal, while the y-axis represents the magnitude, indicating how prominent each frequency is within the signal. The presence of peaks at 100 Hz, 200 Hz, and 300 Hz confirms the composition of the signal as described earlier.

This above image is the FFT spectrogram of the signal containing the three frequencies. The x-axis of the spectrogram represents time, similar to the second image, while the y-axis now represents frequency. The color intensity in the spectrogram indicates the amplitude or strength of a particular frequency at a given time. This visual tool effectively illustrates how the frequencies of the signal vary over time, providing a comprehensive view of both the temporal and frequency aspects of the signal. This spectral representation, achieved through the application of the Fast Fourier Transform (FFT), provides the following detailed analysis of the frequency components within the signal. The frame size or window size used in the FFT spectrogram computation is 256 samples.

Peaks at 100 Hz, 200 Hz, and 300 Hz: These peaks are indicative of the presence of these specific frequencies within the signal. Each peak corresponds to a frequency component, with its height representing the magnitude or strength of that frequency at a particular point in time.
Significance of Peaks: The existence of these peaks in the spectrogram is a direct confirmation of the signal’s composition. In the context of audio processing, this kind of analysis is crucial for identifying different tones, notes, or sound characteristics. It’s particularly useful in scenarios like music production, audio engineering, and signal processing research.
Temporal Resolution: Unlike the previous frequency spectrum plot which showed the aggregate presence of frequencies over the entire signal duration, the spectrogram adds another layer of insight by showing how these frequencies vary over time. This temporal resolution allows for a more nuanced understanding of the signal’s dynamics.
Color Coding: The intensity of colors in the spectrogram typically represents the amplitude of frequencies at various points in time. Brighter or more intense colors at certain frequencies and times indicate higher amplitudes of those frequencies.

Below is a much simpler (toy) example of FFT computation. Here we have a Sine wave signal expressed in a table and a graph form. The frequency of this signal is 1 Hz. The sampling period, the time interval between the signal measurements, is 0.083 seconds. In terms of sampling frequency, it is 1/0.083 or 12 Hz.

Now in order to calculate the FFT, we use the below formula. The X[k] value gives us the FFT magnitude for a discrete frequency (bucket) k.

Where

To make things easy to understand, let’s use 12 frequency bins (0 through 11) each separated by 1 Hz so k = 0, 1, 2, … 12. Also let’s resolve the complex exponential function in the above equation into its real and imaginary parts and then compute the magnitude using the below formulas.

Below is a sample calculation for k = 1;

Calculating the FFT Magnitude for all frequency bins (k = 0 through 11), the results can be shown as below table and the bar chart. Please notice the presence of the 1 Hz frequency in the FFT table and the chart.

This bar chart shows the FFT magnitude for each frequency bin. The x-axis represents the frequency in Hertz (Hz), and the y-axis represents the FFT magnitude. Notice the significant peak at 1 Hz, which corresponds to the frequency of the sine wave signal. This confirms the presence of the 1 Hz frequency in the original signal.

Summary

The process of converting audio signals from the time domain to the frequency domain using FFT provides a powerful tool for analyzing and understanding the composition of sounds. By breaking down complex audio signals into their constituent frequencies, FFT allows for the identification and isolation of different sound elements. This transformation is crucial for various applications in AI and machine learning, including speech recognition, sound classification, noise reduction, and audio fingerprinting. Understanding the mathematical foundation and practical implementation of FFT is essential for leveraging its capabilities in these domains.