Wavelet Based Features for Automatic Synthesizer Programming

This article consists of some notes I made during a quest I set out on — despite knowing not much about signal processing and even less about neural networks — to infer the settings of a musical synthesizer from samples of its output. Any remarks or concerns are definitely welcome!

This problem has been addressed several times before — see, for example, here, here, here, and elsewhere — so we will try to do something slightly different.

Audio Features Pt.1 — The Continuous Wavelet Transform
Subtractive Synthesis 101
Writing a Basic Synthesizer
Basic CWT Features
- Choosing Scales and the Central Frequency
- The Basic CNN & Aggressive Downsampling
- Logarithmic Downsampling
- The Discrete Cosine Transform
Audio Features Pt.2 — MFCCs (But For Wavelets)
Adaptive Training & Our Last CNN
Summary

Audio Features Pt.1 — The Continuous Wavelet Transform

The Fourier transform is arguably one of the shiniest gems of applied mathematics. And all it does is break up a given signal (even a very complicated one) into a sum of signals of the most basic kind: basic oscillations, or ‘cisoids’.

Even better: this transformation is invertible. One application out of millions is noise reduction — we record a signal, take the Fourier transform, remove, say, the mains hum (50Hz in the UK + some harmonics), cut out the static (typically buzzy and high frequency) and then put our signal back together.

As wonderful as this is it comes with a tradeoff. We are expressing our signal in a basis of time-free and completely periodic signals. However, we often care very much about time.

Take, as a topical example, this sound. We start with a buzzy sawtooth wave, then we filter away the buzz to nothing, then we do the reverse:

Here are the magnitudes of the frequencies involved, computed by the Fourier transform:

Press enter or click to view image in full size

If you squint you can tell that they are different — and this is necessarily the case, as we said that the transform is invertible — but we would like not to force our neural network later on to squint.

There are at least two ways to consider the frequency content of your signal without forgetting about time altogether:

1) The Short Time Fourier Transform

Here we break our signal into timeframes — of width, say, 20ms each — and then take the Fourier transform of each frame. We lose a sense of time within each frame, but still keep a sense of time across frames.

Typically we bin the frequencies in each frame. This gives us a finite matrix, which we could then use as a reasonable starting point for features that describe our signal, or plot as a spectrogram, for example using Matplotlibs specgram:

Spectrogram of the ‘anime wow’ sound effect. Frequency increasing along the y-axis, time on the x.

This is extremely common in signal processing and in engineering generally, but there is another way.

2) (Continuous) Wavelet Transforms

Our time issue stemmed from the fact that our basis functions, cisoids, are periodic, and aren’t really subject to time. Maybe we can avoid this issue by choosing basis functions that are timed. Enter the wavelets:

Press enter or click to view image in full size

Wavelets clearly have a very strong sense of time — like us, they only appear for some fixed, finite amount of time, and so if we express our signal as a sum of scaled and translated wavelets, we get an idea not only of frequency, but also of when exactly those frequencies are most apparent.

(It’s not strictly correct for me to talk about the frequency of a wavelet. Instead, we talk about scale, which is basically the width of the wavelet. But more on this later.)

We can actually choose a number of wavelet scales that we care about. Even at just a single scale this transform is already invertible, but we typically choose several scales, just for a more complete image. If we plot these coefficients as a scaleogram we can see that there is a higher-resolution enunciation of time:

Scaleogram of the same signal. Scale decreasing along the y-axis.

Cool! With these ideas in hand we are ready to meet our independent variables.

Subtractive Synthesis 101

There are many different flavours of sound synthesis. We will only discuss one: subtractive synthesis. In this section I have emboldened the variables that we will try to infer from the sound wave that they together make.

Of course, when you make a sound, you have to choose a pitch.

In subtractive sound synthesis we start off with a waveform that is rich in harmonics and then subtract away the harmonics we dont like. Basically the waveform is the marble from which we carve our sound.

What do we mean by ‘rich in harmonics’? We just mean that the Fourier transform has a lot of support, or that you need many different frequencies to build the signal. For example, here is how you build a sawtooth wave:

from scipy.fft import rfft, irfft
from matplotlib.animation import FuncAnimationfig, ax = plt.subplots()
sawwave = np.mod(np.arange(0, -3, -0.001), 1) - 0.5
sawFFT = rfft(sawwave)
def getFrame(num)
 sawFFTTrunc = sawFFT.copy()
 sawFFTTrunc[num:] = 0
 ax.clear()
 ax.plot(sawwave)
 ax.plot(irfft(sawFFTTrunc))
 return []
ani = FuncAnimation(fig, getFrame, frames=[3, 10, 25, 100, 500, 750, 1000, 1250, 1445, 1500])

Notice that it takes many high frequency components to model the discontinuity (the reset) properly. We will come back to this point when we discuss the DCT.

Here are the three waveforms we will choose between, as demonstrated by my Korg Minilogue:

The sine wave is the go-to for relaxed, bell type sounds. The sawtooth wave is a good choice for string-like sounds (in fact, if you watch a slow motion video of a violin being played, you can see the battle between friction and tension playing out as a sawtooth wave). Finally, we will also have a square wave, for digital bleepy bloopy sounds.

Now that we have our base we come to the ‘subtractive’ part of subtractive synthesis. This is typically done by a (nonideal) low pass filter, and today our filter has two parameters: cutoff and resonance. Roughly speaking, the cutoff frequency is the first frequency which our filter would diminish by a certain amount (3db), if it wasn’t for a small aberration around the cutoff frequency itself, the magnitude of which defined by the resonance. Resonance is one of those things you need to hear to understand — a high resonance sounds ‘wet’ and ‘buggy’ especially when combined with movement in the cutoff frequency. Here is what it sounds like when filter cutoff and resonance are modulated:

Hopefully the oscilloscope looks familiar! Typically you want this kind of modulation to be automated, usually in sync with the BPM, and in our synthesizer there will be two kinds of such modulation.

The first is a ‘global’ modulation — a rise and fall. This evolution is typically done by an ADSR envelope, with the four parameters attack, decay, sustain, and release.

Attack defines the time it takes for our filter to reach full blast. Then the decay defines how long it takes for the filter to settle down to our sustain level, which is the level we want to maintain as long as the note is held down. So for a plucked sound, we would have a sustain of zero, and for an organ sound, something more like one. Finally, release defines how long it takes for our signal to die away after we release the note.

To demonstrate, here are two sounds — firstly a pad, with long attack, decay, and release times, plus a high sustain, vs. a lead sound, with the opposite (but still lots of sustain):

An ADSR envelope can be applied to both filter cutoff and volume (and anything else), but we will only have an envelope for cutoff. Anyways, zero cutoff implies zero volume, so this is the stronger choice.

Not only can we have our cutoff frequency rise and fall over time (much like a wavelet) but we can also have it oscillate constantly over time (much like a cisoid). This modulation is carried out by an Low Frequency Oscillator, or LFO for short. An LFO has two parameters — frequency and intensity. We will also do the same to the pitch to give us some operatic vibrato, and with this, we have listed all of our parameters. Here’s how frequency and intensity affect the sound of vibrato:

Writing a Basic Synthesizer

This section will be short, as our synthesizer will really be quite straightforwards, and there isn’t a lot to say.

We will implement the synthesizer in C++ using the lovely Synthesis ToolKit in C++. The public API of our Synth class looks like this:

class Synth
{
public:
    enum waveform {
        SINE,
        SAW,
        SQUARE
    };    Synth(waveform waveType);
    void setPitch(float newPitch);
    void setWaveForm(waveform type);
    void setFilterParameters(float cutoff, float resonance);
    void setFilterADSR(float attackTime, float decayTime, float sustainLevel, float releaseTime);
    void setFilterLFO(float frequency, float intensity);
    void setVibrato(float frequency, float intensity);
    unique_ptr<StkFrames> synthesize();
    friend std::ostream& operator<<(std::ostream& os, const Synth& obj);
        ...
}

StkFramesis essentially just a vector of audio samples, which are themselves each basically floats. The only setter worth talking about is setWaveForm. We want to be able to set the frequency of our oscillator (obviously) — but the common class ancestor of our three waves is stk::Generatorwhich has no frequency (for example the ADSR envelope coming up shortly inherits from that same class). We could maybe get by with an std::variant but we instead factor out the frequency setter into a very simple Oscillator class. Now this setter just sets the private unique_ptr<Oscillator> generator:

void Synth::setWaveForm(waveform type) {
    generator.reset();
    wave = type;
    switch (type) {
    case SINE:
        generator = make_unique<SineWave>();
        break;
    case SAW:
        generator = make_unique<BlitSaw>();
        break;
    case SQUARE:
        generator = make_unique<BlitSquare>();
        break;
    default:
        std::cerr << "Weird wave seen";
        generator = make_unique<SineWave>();
    }
}

The operator<< override is to encode our patch (aka. the settings of the synth) into what will be the target vector for our ML:

ostream& operator<<(std::ostream& os, const Synth& syn)
{
    function<void(float)> writeParameter = [&os](float toWrite) 
    {
        os << std::to_string(toWrite) << " ";
    };    os << syn.waveFormEncodings[syn.wave] << " "; 
    writeParameter(syn.pitch);
    writeParameter(syn.filterCutoff);
    ...
    writeParameter(syn.pModInt);
    return os;
}

Here syn.waveFormEncodings just gives the one-hot encodings of our three waves:

static constexpr array<const char*, 3> waveFormEncodings{ "1 0 0", "0 1 0", "0 0 1" };

Finally we come to generate our sound. The meat here looks like:

unique_ptr<StkFrames> outputFrames = make_unique<StkFrames>(framesToGenerate, 1);function<void(int)> tick = [&](int frame) {
    generator->setFrequency(pitch * (1 + pModInt * vibratoLFO.tick()));
    filter.setLowPass(
        filterEnv.tick() * (1 + fModInt * filterLFO.tick()) * filterCutoff,
        filterResonance
    );
    (*outputFrames)[frame] = filter.tick(
        generator->tick()
    );
};
int frame;
filterEnv.keyOn();
for (frame = 0; frame < framesToGenerate * 3 / 4; ++frame) tick(frame);
filterEnv.keyOff();
for (; frame < framesToGenerate; ++frame) tick(frame);

At every tick, we sample our modulators — the two LFOs and the filter envelope — and multiply our pitch and cutoff frequency appropriately. We hold the note for 3/4th of the recording period, letting the attack and decay play out and the sustain hold, and then let go, letting the release follow.

We are now in the fortuitous situation where we can generate as much training data as we need. And all we do here (for now!) is generate random values for the parameters, plug them into our new synthesizer, and record the result. We choose the pitch randomly out of two octaves of twelve notes. Human perception of pitch is logarithmic (more on this shortly):

inline float notesFromA4(int note, int octave) {
    return 440.0f * pow(2, octave + note / 12.0f);
}

But we don’t want to pick all of parameters completely at random. For example want to have samples with literally zero sustain, or zero vibrato, etc., which is unlikely if we sample values uniformly. So we write things like

float fAttack = duration * uniformSample() / 4;
float fDecay = duration * uniformSample() / 4;
float fSustain = (rand() % 5) ? uniformSample() : 0;
float fRelease = duration * uniformSample() / 4;

where uniformSample returns a random float sampled uniformly between 0 and 1. We generate a validation set of 600 examples, which is fixed throughout the rest of the article for comparison purposes. Here is some variety in the sounds we generate:

That’s all! Lets move on to the audio features.

Basic CWT Features

Choosing Scales and the Central Frequency

All of the features we compute in this article are based on the coefficients produced by a CWT. These coefficients form a matrix, indexed by scale and then time, each index being a complex number having magnitude indicating the ‘affinity’ the signal has for that scale of wavelet centered around that time.

The PyWavelets library makes it super easy to find these coefficients, which we can plot as a scaleogram, using origin=”upper” as bigger scales have lower frequencies (and the first row is the lowest scale):

import pywt
import librosa
import numpy as np
import matplotlib.pyplot as pltSAMPLE_RATE = 24000
WAVELET_NAME = "mexh"
WAVELET_SCALES = np.arange(0.5, 32, 0.5)
signal, _ = librosa.load("string.wav", sr = SAMPLE_RATE)
coeffs, _ = pywt.cwt(signal, WAVELET_SCALES, WAVELET_NAME)
plt.imshow(np.abs(coeffs), cmap="viridis", aspect="auto", origin="upper")

Scaleogram of a synthesized string sound.

But how did we choose these scales? To think about this properly, we will need to complete the analogy between scale and frequency hinted at before. Take the mexican hat wavelet, named so for obvious reasons, and which is the wavelet we will use today:

Mexican hat wavelet.

Each scale of wavelet has a frequency that it is ‘like’, the so called center frequency. This has at least two definitions that don’t always coincide — it can be defined as the ‘center of mass’ as the DFT, or, as PyWavelets does it, just the heaviest frequency in the DFT. To pump the intuition we will just settle for an image:

Press enter or click to view image in full size

Center frequency (schematic).

So we choose our smallest and highest scales to have center frequencies matching some reasonable upper and lower bounds on the ‘useful’ harmonic content of our signal, respectively, and which we now have to pick.

The lower bound on wavelet scale is easy. We will just pick the Nyquist frequency corresponding to our sample rate. The intuition here is this — say we record by discrete samples a high frequency sine wave sin(Bt) where B is large, and the recording starts at t = 0, therefore setting the first sample to 0. If that sine wave peaks before we take our second sample — that is, if its frequency is more than half of our sampling frequency — it appears that it is still on its ‘way up’ in that second sample, even though it is actually on its way down. This phenomenon is known as aliasing, because the high frequency wave already on its way down and the low frequency wave still on its way up appear the same, and it explains the wagon wheel effect you notice in videos taken at speed. But the point is that the highest frequency we can really meaningfully talk about is half of our sampling frequency, and we can find our smallest scale from that:

pywt.frequency2scale(WAVELET_NAME, SAMPLE_RATE // 2) * SAMPLE_RATE
#0.5

For the upper bound, I choose 32, because it’s a power of 2 and seems not to truncate the scaleograms we generate later. With similar (non)reasoning I choose 64 evenly spaced scales between 0.5 and 32 inclusive. And with this, we run head first into our first issue:

np.shape(coeffs)
#(64, 96000)

These coefficients are simply too numerate to act as our features out of the box! I only have a laptop, and we will need to engineer some features out of these many many numbers.

The Basic CNN and Aggressive Linear Downsampling

Lets start with the most basic thing that comes to mind: just downsample the image. The scaleograms displayed in this article are very heavily downsampled — from a width of 96000 down to something like 500 — but they still look okay so I think its worth a try.

Lets first plug in our synthesizer as a generator which we will try to fit with Keras:

def processWavs(datapath):
 patchFiles = sorted(glob.glob(os.path.join(datapath, "*.txt")))
 waveFiles = sorted(glob.glob(os.path.join(datapath, "*.wav")))
 waves = [librosa.load(waveFile, sr=SAMPLE_RATE, dtype=np.float32)[0] for waveFile in waveFiles]
 scaleograms = [pywt.cwt(wave, WAVELET_SCALES, WAVELET_NAME)[0] for wave in waves]
 rawParams = [np.loadtxt(patchFile) for patchFile in patchFiles] return (
  np.asarray(scaleograms),
  np.split( #split to separate the categorisation from the rest
   [normaliseParams(params) for params in rawParams],
   [NUM_WAVES],
   axis=1
  )
 )
def generateBatch():
 print("generating new batch")
 datapath = os.path.abspath("./trainingdata")
 while True:
  subprocess.run(f"{SYNTHESIZER_PATH} {EXAMPLES_PER_BATCH} {datapath}")
  yield processWavs(datapath)

This grabs soundwaves and turns them into scaleograms, paired with the target vectors, as many as we need. Now we need to build a network to feed these scaleograms to. Downsampling is just average pooling, so here we go…

inputLayer = Input(shape=(len(WAVELET_SCALES), SAMPLES_PER_EXAMPLE, 1))
pool0 = AveragePooling2D(pool_size=(1,20), strides=(1,20), padding="same")(inputLayer)
conv1 = Conv2D(16, (3,8), strides=(2,6), activation="relu")(pool0)
pool1 = AveragePooling2D(pool_size=(1,8), strides=(1,7), padding="same")(conv1)
conv2 = Conv2D(32, (3,7), strides=(1,6), activation="relu")(pool1)
pool2 = AveragePooling2D(pool_size=(1,5), strides=(1,4), padding="same")(conv2)
conv3 = Conv2D(64, (2,3), strides=(2,2), activation="relu", padding="same")(pool2)
conv4 = Conv2D(128, (2,2), strides=(1,1), activation="relu", padding="same")(conv3)
#flatten and then bifurcate the network into the classification and regression parts
flat = Flatten()(conv4)
classDense1 = Dense(min(flat.shape[1] // 4, NUM_WAVES * 2), activation="relu")(classDenseflat)
classDense2 = Dense(NUM_WAVES, activation="relu")(classDense1)
classOutput = Softmax(name="classout")(classDense2)
regressionDense1 = Dense(min(flat.shape[1] // 4, NUM_OTHER_FEATURES * 4), activation="relu")(flat)
regressionOutput = Dense(NUM_OTHER_FEATURES, name="regressionout", activation="relu")(regressionDense1)model = Model(
 inputs = [inputLayer],
 outputs = [classOutput, regressionOutput],
)
model.compile(
 optimizer="adam",
 loss={
  "classout": CategoricalCrossentropy(from_logits=False),
  "regressionout": MeanSquaredError(),
 },
 metrics=["accuracy"]
)

Excited, I run my first ever neural network, but to my horror:

Press enter or click to view image in full size

This minima must come from somewhere. Lets ask Wolfram Alpha:

Press enter or click to view image in full size

And lets also look at the definition of our cross-entropy loss function:

where pᵢ is the probability our network assigns to the ith category, and tᵢ is an indicator variable, exactly one of which is one. So, in the best case the network is learning something — but no more than the distribution of its dataset — or what is more likely, it is guessing the same shape always. (Remember we picked one out of three waves uniformly at random).

I tried complicating the network, adding layers and making existing layers bigger, but no bueno. We will probably have to do better than the most simple thing we could have done.

Logarithmic Downsampling

All of our parameters apart from the ADSR settings are inferable from the ‘texture’ of even a short sample of our signal. But this fine-grained texture was completely destroyed by the downsampling we just tried. We would like somehow to present the network with both a short, undiluted sample of the signal, but also the entire signal, but also without introducing a boundary or any discontinuities or anything weird like that.

One way to try and achieve this is to have nonuniform blocks for our downsampling — exponentially increasing in length — that initially starts out very small, displaying the texture of the sound, but once the CNN hopefully gets the picture, we (relatively smoothly) squash in the large-scale shape of the ADSR envelope into what is left of our now much shorter image. Or, in code,

BASE = 1.01
PREFIX = 500
coeffs = np.abs(pywt.cwt(signal[PREFIX:], WAVELET_SCALES, WAVELET_NAME)[0])STOPPING_POINT = 0
for i in range(1, SAMPLES_PER_EXAMPLE):
 if i + BASE ** i >= SAMPLES_PER_EXAMPLE - PREFIX:
  STOPPING_POINT = i - 1
  break
def logScaleImage(image):
 indices = [i + int(BASE ** i) for i in range(1, STOPPING_POINT)]
 splitImage = np.split(image, indices, axis=1)
 return np.asarray([
  [np.mean(frame) for frame in imageFrame] for imageFrame in splitImage
 ]).T
fig, ax = plt.subplots(2)
ax[0].imshow(coeffs, aspect="auto")
ax[1].imshow(logScaleImage(coeffs), aspect="auto")

(I have always found it annoying that equations like x + bˣ = y don’t have nice closed form solutions for x.) We cut off a short prefix so we don’t dedicate half of our image to silence. Here’s an example resulting plot, pre and post scaling:

This kind of reminds me of the Medium logo! Note especially the zero we knocked off of our largest x tick. Because of this we can now afford to throw together a network much less aggressive in its downsampling:

inputLayer = Input(shape=(len(WAVELET_SCALES), STOPPING_POINT, 1))
conv1 = Conv2D(16, (3,3), strides=(2,2), activation="relu")(inputLayer)
conv2 = Conv2D(32, (3,3), strides=(2,2), activation="relu")(conv1)
conv3 = Conv2D(64, (3,3), strides=(2,2), activation="relu", padding="same")(conv2)
conv4 = Conv2D(128, (3,3), strides=(2,2), activation="relu", padding="same")(conv3)
conv5 = Conv2D(128, (3,3), strides=(2,2), activation="relu", padding="same")(conv4)
flat = Flatten()(conv5)
classDense1 = Dense(min(flat.shape[1] // 2, NUM_WAVES * 4), activation="relu")(flat)
classDense2 = Dense(NUM_WAVES, activation="relu")(classDense1)
classOutput = Softmax(name="classout")(classDense2)
regressionDense1 = Dense(min(flat.shape[1] // 4, NUM_OTHER_FEATURES * 4), activation="relu")(flat)
regressionOutput = Dense(NUM_OTHER_FEATURES, name="regressionout", activation="relu")(regressionDense1)

Great news! Our categorisation loss, amazingly, not only drops below ln(3), but it plummets close to zero! Our network is finally learning. Lets write some code that delivers some insight into how much exactly:

[scaleos, [waveTruth, continuousTruth]] = getValidationSet()
[wavePredictions, continuousPredictions] = model.predict(scaleos)
waveLosses = np.mean(- waveTruth * np.log(wavePredictions), axis=0)
continuousLosses = np.mean((continuousTruth - continuousPredictions)**2, axis=0)
loss = np.sum(waveLosses) + np.sum(continuousLosses)print(f"overall loss: {loss / 2:.4f}")
for waveName, waveLoss in zip(waveNames, waveLosses):
 print(f"{waveName}: {waveLoss:.4f}")
for paramName, paramLoss in zip(continuousParamNames, continuousLosses):
 print(f"{paramName}: {paramLoss:.4f}")

Here’s what it says:

overall loss: 0.4608
sine: 0.0135
saw: 0.0056
square: 0.0012
pitch: 0.0277
filterCutoff: 0.0255
filterResonance: 0.0894
fAttackTime: 0.0345
fDecayTime: 0.0874
fSustainLevel: 0.1030
fReleaseTime: 0.0906
fModFreq: 0.1025
fModInt: 0.0872
pModInt: 0.1156
pModFreq: 0.1378

(We have normalised all parameters to live in [0,1].) Predictably, this model found it hardest to tell all times apart from the attack. This makes sense, that information is contained in the only part of the image not exponentially downsampled. Lets try and use some features that treat our columns more fairly.

The Discrete Cosine Transform

If you refer back to the GIF above of the construction of the sawtooth wave, you can see that we have already another way to reduce dimensions — take the DFT, throw away some number of coefficients, and hope that the network can discover the meaning of whats left. So why are about to introduce some other thing? The answer lies in that same GIF. It’s to do with discontinuities at the boundaries.

Lets take a realistic signal, signal=np.tanh(np.linspace(-3, 3, N)), where say N = 1000:

When we take the DFT of this signal, or of any other finite length signal, we are implicitly defining an infinite extension of the signal from both ends. To wit in this case:

tanhFFT = fft(signal)
plt.plot(
 np.linspace(-3, 9, 2 * N),
 [
  np.sum(
   [
      (1/N) * tanhFFT[freq] * np.exp(1j * index * freq * 2 * np.pi / N)
      for freq in range(len(tanhFFT))
   ]
  )
  for index in range(int(-N/2), int(3 * N/2))
 ]
)
plt.plot(np.linspace(0, 6, N) , signal)

(The np.exp implements the inverse DFT, which will be explained in a minute.) In general the DFT secretly imposes a periodic boundary constraint, and so if your signal is not periodic, that manifests as a discontinuity which is then necessarily implemented by a smattering of high frequency content.

But given that we don’t really care about how exactly the signal is extended, as from our POV inside the recording period we cant tell anyway, it happens that we are free to add any extension we want before taking the DFT. So why not extend our signal in a way that makes it periodic? Why not just mirror it around 0?

signalMirrored = np.concatenate([np.flip(signal)[:-1], signal])
plt.plot(np.linspace(-6, 6, 2000), signalMirrored)

Now see that the periodic ‘DFT super-extension’ of this mirrored extension has no discontinuities:

Less discontinuity should mean less magnitude in the high-frequency end of the DFT. We can check this out by comparing the (normalised) spectra:

spectrum = np.abs(np.fft.rfft(signal))
spectrumMirrored = np.abs(np.fft.rfft(signalMirrored))freqs = np.fft.rfftfreq(len(signal))
freqsMirrored = np.fft.rfftfreq(len(signalMirrored))
plt.plot(freqs, spectrum / np.sum(spectrum), label="DFT")
plt.plot(freqsMirrored, spectrumMirrored / np.sum(spectrumMirrored), label="Mirrored DFT")

Energy compaction.

The DCT is nothing more than the this mirrored DFT. It’s invertible too, as for example we just invert the DFT and then go rightwards from x = 0. So, two questions,

a) What’s the point?

The point is energy compaction. We are trying to reduce dimensions by picking some small subset of the transform coefficients per row of coefficients, and moreover, we want to do this in a way that doesn’t depend on the data. (For example, with a singular spectrum analysis, you can’t properly comprehend your singular values apart from their corresponding elementary matrices, which depend very intimately on the data that they came from, so these won’t do here at least without some work.) With the DCT we can comfortably lean towards just taking the lower frequency coefficients, because at least heuristically more of the signal energy went that way. And the DCT basis vectors dont depend very closely on the data. In fact we are about to see that these basis vectors only depend on the number of samples, which is fixed or otherwise cheaply encoded.

This is one reason why the DCT is at the heart of many compression algorithms we love and use daily, such as JPEG, MPEG, and MP3.

b) Why is it called the Discrete Cosine Transform?

To answer this question we will first need to more formally define the DFT, which we have avoided doing so far. If you have a discrete signal (vector) x of length L, the DFT of that signal y is also a signal, with the kth component measuring by an inner product how much your signal is like a cisoid with angular velocity the kth harmonic of the fundamental frequency 2π/L (which is the frequency of a cycle per recording window):

Here we have reluctantly used j as the imaginary unit to be consistent with Numpy, and have extended x with zeroes. We also have Eulers formula:

We can already briefly answer the question now in a handwavey sense. By mirroring our signal around 0 we have made it even, that is, xₖ = x-ₖ. We are taking an inner product of an even signal with a sum of an even (cos) and an odd (sin) signal, and so the odd signal will cancel out. But just for completeness:

where after the fourth equals sign we used the evenness of x. From Eulers formula above

and finally

hence the name! Notice, also, that this transform is already real so long as x is real — yet another reason it’s liked in applications.

There are different types of DCT gotten by different types of mirroring, but the intuition just given is common to all types. Lets put it to work.