nanogpt-Audio
This project is inspired by Andrej Karpathy's nanogpt.
Project Description
In this project, I generated audio from Shakespeare's works using a text-to-speech model. Then, I trained a transformer model solely on this audio data. The output you hear is sound that the model has learned to generate from scratch, capturing the patterns and structures of the training audio.
Sample Output
Listen to a sample aligned output generated by the model:
aligned_sample.mp4
How to run this
Getting this to work is pretty straightforward, I did this on my Mac M3, the default settings are tuned for that. If you have a real GPU, feel free to crank things up!
1. Data Prep
First, we need to give the model something to listen to.
prepare_input_audio.py: This script readsinput.txt(which has some Shakespeare in it) and usesedge_ttsto turn it into speech. It saves this asshakespeare.wav.prepare_tokens.py: Once we have the audio, this script runs it through EnCodec to turn the sound waves into tokens that the transformer can learn from. The output istrain.bin.
2. Training
train.py:- Mac M3 Note: I used the config at
config/train_shakespeare_audio.pywhich is set up for MPS (Metal Performance Shaders) acceleration. It keeps the batch sizes manageable so your Mac doesn't melt. - To start training, just run:
python train.py config/train_shakespeare_audio.py
- Mac M3 Note: I used the config at
3. Making Noise (Generation)
generate_sample_audio.py: After training, run this to generate a new sample (aligned_sample.wav). It loads the checkpoint and tries to speak kinda like Shakespeare!