GitHub - deepanwadhwa/nanogpt-Audio: An experimental nanogpt fork that learns to speak Shakespeare by modeling EnCodec audio tokens.

2 min read Original article ↗

nanogpt-Audio

This project is inspired by Andrej Karpathy's nanogpt.

Project Description

In this project, I generated audio from Shakespeare's works using a text-to-speech model. Then, I trained a transformer model solely on this audio data. The output you hear is sound that the model has learned to generate from scratch, capturing the patterns and structures of the training audio.

Sample Output

Listen to a sample aligned output generated by the model:

aligned_sample.mp4

How to run this

Getting this to work is pretty straightforward, I did this on my Mac M3, the default settings are tuned for that. If you have a real GPU, feel free to crank things up!

1. Data Prep

First, we need to give the model something to listen to.

  • prepare_input_audio.py: This script reads input.txt (which has some Shakespeare in it) and uses edge_tts to turn it into speech. It saves this as shakespeare.wav.
  • prepare_tokens.py: Once we have the audio, this script runs it through EnCodec to turn the sound waves into tokens that the transformer can learn from. The output is train.bin.

2. Training

  • train.py:
    • Mac M3 Note: I used the config at config/train_shakespeare_audio.py which is set up for MPS (Metal Performance Shaders) acceleration. It keeps the batch sizes manageable so your Mac doesn't melt.
    • To start training, just run:
      python train.py config/train_shakespeare_audio.py

3. Making Noise (Generation)

  • generate_sample_audio.py: After training, run this to generate a new sample (aligned_sample.wav). It loads the checkpoint and tries to speak kinda like Shakespeare!