Audio Classification using Wavelet Transform and Deep Learning

Aditya Dutt
9 min readSep 5, 2021


A step-by-step implementation to classify audio signals using continuous wavelet transform (CWT) as features.

Photo by Jan Huber on Unsplash


Audio classification is a very important task. In the past decade, a lot of research has been done on classifying audio using different kinds of features and neural network architectures. Some real-world applications of this are Speaker recognition, environmental sound classification, music genre classification, and birds sound classification. The most commonly used features in audio are Spectrograms/Mel-spectrograms and state-of-the-art MFCCs. However, there are some other methods that researchers have explored to classify audio data. One such method is Wavelet Transform. In this article, we will learn about wavelet transform and how to use it with machine learning for classification tasks.

What are Wavelets?

Firstly, we will talk about why wavelets are needed.

Usually, Fourier Transform is used to extract frequencies from a signal. The Fourier Transform uses a series of sine waves with different frequencies to analyze a signal. But it has a big drawback. It’s picking the right window size. According to Heisenberg’s uncertainty principle:

  • A narrow window will localize the signal in time but there will be significant uncertainty in frequency.
  • If the window is wide enough, then the time uncertainty increases.

This is the tradeoff between time and frequency resolution.

One way to avoid this problem is Multiresolution Analysis (MRA). An example of MRA is Wavelet Transform. In MRA, a signal is analyzed at different resolution levels.

A Wavelet is a wave-like oscillation that is localized in time. Wavelets have 2 fundamental properties: scale and location. Scale defines how “stretched” or “squished” a wavelet is. Location is a position in time.

Below is the formula of the wavelet transform.

The formula of Wavelet Transform

Wavelet transform can change the “scale” parameter to find different frequencies in the signal along with their location. So, now we know which frequencies exist in the time signal and where they exist. Smaller-scale means a wavelet is squished. So, it can capture higher frequencies. On the other hand, a larger scale can capture lower frequencies. You can see in the image below an example of squished and stretched wavelet.

(Left) Smaller-scale or squished wavelet, (Right) Larger scale or stretched wavelet

This is the advantage that wavelet transform has over FFT. It can capture spectral and temporal information simultaneously. So, basically, a signal is convolved with a set of wavelets at different scales and positions. The original wavelet which is scaled and shifted is called the “mother wavelet”. There are so many wavelets to choose from. Different wavelets are used for different applications.

We will use Morlet wavelet in this tutorial.

Morlet wavelet (or Gabor wavelet) is used for both hearing and vision as it is closely related to human perception. It is composed of complex exponential multiplied with a gaussian window. The formula of the Morlet wavelet is shown below:

Morlet wavelet formula

This is what a Morlet wavelet looks like:

Morlet wavelet

Here is a nice article on medium that talks about wavelets in detail.

Problem Statement

Goal: We want to classify audio using continuous wavelet transform. We will use the Free Spoken Digit Dataset (FSDD). We will extract the wavelet transform of each sample and try to classify the speakers from this dataset using a deep neural network. To demonstrate this method, we will only classify 3 speakers in this tutorial.

Data Description

Free Spoken Digit Dataset (FSDD): A simple audio/speech open dataset consisting of recordings of spoken digits in .wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginning and ends. It contains:

  • 6 speakers
  • 3,000 recordings (50 of each digit per speaker)
  • English pronunciations


I will explain all the steps. In the end, there is the link to the entire code on GitHub as well.

Step 1: Import all the libraries

Example of Wavelet transform (Optional step for demonstration purpose)

Librosa library in python is used for music and audio analysis. We can read audio files and extract spectrograms using this library. For wavelet transform, we will use pywt library.

Here is the wavelet transform of a sample from the Free Spoken Digit Dataset.

In line 7, we have set the wavelet type to “Morlet”. In line 8, the sampling rate is set. In line 9, we have set the scales of the wavelet. In line17, we have computed the wavelet transform.


These are the scales that we are using: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]

These are the frequencies that re associated with the scales: [6500. 3250. 2166.66666667 1625. 1300. 1083.33333333 928.57142857 812.5 722.22222222 650. 590.90909091 541.66666667 500. 464.28571429 433.33333333 406.25 382.35294118 361.11111111 342.10526316 325. 309.52380952 295.45454545 282.60869565 270.83333333 260. 250. 240.74074074 232.14285714 224.13793103 216.66666667 209.67741935 203.125 196.96969697 191.17647059 185.71428571 180.55555556 175.67567568 171.05263158 166.66666667 162.5 158.53658537 154.76190476 151.1627907 147.72727273 144.44444444 141.30434783 138.29787234 135.41666667 132.65306122 130. 127.45098039 125. 122.64150943 120.37037037 118.18181818 116.07142857 114.03508772 112.06896552 110.16949153 108.33333333 106.55737705 104.83870968 103.17460317]

Shape of wavelet transform: (63, 6571)

Wavelet Transform of a small frame of audio sample

Now let’s read all the data and extract all their features.

Step 2: Read the audio files and split them into train/ test data

We are only classifying 3 speakers here. So, we will extract samples from 3 speakers: George, Jackson, and Lucas. We will write all data to a .npz file.


Training Data class distribution: (array([0, 1, 2]), array([360, 352, 338]))

Testing Data class distribution: (array([0, 1, 2]), array([140, 148, 162]))

Step 3: Extract Wavelet Transform Features

Let’s create a function to extract wavelet transform. After computing wavelet features, we will divide the time series into frames of length 400. The sampling rate is 8 kHz. So, 400 frames are equivalent to 50 milliseconds. We have also created a band-pass filter. We will only take frequencies between 80Hz and 1000 kHz.

Human Voice Frequency Range: The human ear can hear between 20 and 20,000 Hz (20 kHz) but it is most sensitive to everything that happens between 250 and 5,000 Hz. The voiced speech of a typical adult male will have a fundamental frequency from 85 to 180 Hz, and that of a typical adult female from 165 to 255 Hz. For a child’s voice, the average fundamental frequency is 300Hz. Consonants take up space between 2kHz and 5kHz. Vowel Sounds are prominent between 500Hz and 2kHz.

We will write all features in a .npz file so that we can easily load the features from there later. The shape of the features is (76 x 400). It’s in the format: (Features x timesteps).

We have also saved the unique id of each sample. It is important because we are dividing each sample into multiple frames and we need to know which frame belongs to which sample. All frames from one sample will have the same id.

Extract features of training data: We took random samples from training data and some random frames from each sample in order to decrease the training data.

Extract features of testing data:

The shape of training data is: (8267 x 76 x 400).

The shape of training data is: (12326 x 76 x 400).

The data is in the format: (Num_samples x Features x timesteps).

Step 4: Build the Deep Learning Model

We will use dilated 1D Convolutions along with Batch Normalization layers. We have used Time Distributed 1D Convolutions because this is a multivariate time series. Spectrograms and wavelet transform are not standard images. In a standard image, both the x-axis and y-axis carry similar pixel content. But, in the case of spectrograms and wavelet transforms, the x-axis is time and the y-axis is frequency. Read more about this here. But, feel free to use 2D convolutions as well because many people use 2D convolutions with spectrograms as well and achieve good results.

Anyways, dilated convolutions will help increase the reception field while keeping the number of parameters the same. This will be helpful because we have a multivariate time series of length 400. The Batch Normalization layer normalizes the mini-batches during training and solves the problem of internal covariate shift. It makes training faster and the model becomes robust. I highly recommend using this in your models.

Step 5: Train the network

We will train the network using 5-fold cross-validation. We need to change the data format before starting the training process. For the deep learning model, we need the data in the format: (Num_samples x Timesteps x Features).

Firstly, we need to standardize the data using a Standard scaler. We will save the mean and standard deviation of the training data. During testing, we will use this mean and standard deviation to standardize the data and then make predictions.

Step 6: Test the model

During testing, firstly we will make the predictions. But, we created multiple frames from one sample. We will classify all of the frames and then based on the majority vote of the frames, we will assign a final class to the sample.


Classification report

We got an accuracy of 97% on the testing dataset.

Here is the link to complete code on GitHub:


We have learned how to use the wavelet transform with machine learning for classification and got a 97% accuracy on the testing dataset. You can use this for other classification problems as well. It can be used for image classification also. Other neural network architectures can be used. You can also use 2D convolutions, transfer learning models, etc.

The only drawback of this method is that there is too much data to process. But, if you could save the transforms as images and downsample/ shrink the image, then you would have lesser data. I am not sure how that will affect the information because the timesteps will also be reduced. But it would be interesting to see how it performs.

Thanks for reading! I hope it was useful. 👋

Let me know if you have any questions.