On making music with the machine

12 min read Original article ↗

I first tried making music with the machine in 2017. I had been writing (prose, poetry, nonsense) for as long as I could remember, and - for just as long - no one had shown much interest in reading my work. Then, after coming across this article, I thought: perhaps I should use one of them new-fangled neural-nets to turn my words into their primitive... into song?

My first attempt was naively under-powered and hilariously under-coded. But it was fun tinkering nonetheless, even if all it taught me was the hard limit of my ability. Then fast-forward to 2020, when an organization more suited to the purpose (and shamelessly devoted to the open-sourcing of machine models) released Jukebox. I have a flash-bulb memory of when I spun up the script for the first time, and when (after the usual hiccups and parameter-fixes) I managed to run it through to completion.

“I was bored with my beard so I tugged it. It came off as I feared.”

I had been born with a tin ear and soggy rhythm that no amount of violin lessons had been able to correct. I had never finished a karaoke song on the key I started it with. But behold: here were my words suddenly set to music... what power had I suddenly gained? Information had learned to process itself in intricate ways and the promise of the cybernetic revolution seemed to be at hand. I disappeared into the many-tiered sinkhole of Google Colab, and emerged two months later with what I could call an album. An album that - to my ear - sounded astonishing.

“To my ear” being the key phrase here, because the astonishment came from being forty at the time. I thought I was old enough to know what I would never be capable of. Yet here I was making music - with and due to the machine, yes - but I was an instigator nevertheless, an author and a cause. I had failed at this problem quite recently, so I had a sense of how difficult it must have been to solve. And it had been solved by a program that could conjure just the strange sounds that appealed to my aesthetics. No wonder I was pleased with the state I found myself in.

There was also the measured, painful slowness of the process by which the music was made. Depending on the server Google dealt me, Jukebox could take up to six hours to generate a batch of three songs. I’d leave the computer running, and the terminal window would scroll through the “% complete” and “time remaining” updates, and the anticipation would build. Like with Napster in a dorm room two decades ago... except these dispatches coming down from the ether were “mine”, supposedly. Then I’d sift through 20 or 30 such batches before I could stitch together a 2-minute song. Eventually, I heard in these Jukebox clips what others could not simply because the music had begun to mean more to me... I was invested. Which is also why it took me a while to admit that the songs were just not very... listenable.

You see, Jukebox had the ability to bang out these jangling, howling numbers that could get stuck onto the soul. In terms of compositional “interestingness”, I think it remains better than even today’s state-of-the-art. But as a production tool it had some crude limitations. The experience could be compared to herding together a band of uber-talented musicians, who could indeed play just about anything you directed them to. But these musicians also behaved like they had been force-fed a bottle of bourbon before being recorded in a studio last used by Robert Johnson. Very rock-n-roll - true! - but sampling from the enormously-dimensioned latent-space tended to produce a lot of randomness. The music could be weird and wonderful for a moment, then flag and drag and disintegrate into the void. And even after a hundred takes, even when I thought I’d found just the composition... the de-noising algo just couldn’t do enough.

“I was rolled into a joint on which the fate of the whole world hinged.”

Look, I love Jukebox, and I’m glad it was let out into the world before the walls went up around all such things. I am glad that it exists out in the git, in entirety, weights and code and all. But being a v1 technology, it suffered from v1 problems that never got sorted out in a v2. I thought that was the very charm of it, and why I spent hours lost in it, but my excitement didn’t translate to the average listener. When I sent my songs around, most were amazed - astonished! - that “the machine could now do this.” But as a parlour trick and nothing more.

(After a point, I suppose it doesn’t really matter how you make something, what matters is what is made, and how that thing conveys the emotion and the essence.)

And then there were the vocals. To get the model to sing properly (after all, words were what had brought me there), I had trawled through the message boards to find the set of lyrics it had been trained on. Given the length of their catalogues, the two artists topping the corpus looked to be Cash and Dylan. My one small contribution to the Jukebox project was to flesh out a feature that had been mentioned in the paper, but not released with the code itself – to “fuse” two voices together. I found this to be a good way to hack the model into singing a large block of lyrics (a task that it often had trouble completing). I decided to mind-meld the top-two on the leaderboard into one hoarse and mournful and more-effective-than-average singing voice... and that’s how I got to the sound of Bobby John.

“There’s no need to learn guitar when you’re wicked with the mandolin.”

This is also how, for a few months, I became Bobby John. The process of making music with the machine hasn’t really changed since the Jukebox days: seeding, prompting, generating, chopping and changing, listening, resampling, rinsing and repeating... iterating, resampling, prompting, listening, iterating... till you’ve crawled through the latent-space in just the right direction, and nudged the machine towards the vision that sits inside your head. But if you commit to this process fully, to the volume of work, to the hours and hours of listening, it becomes impossible for your mind to not become the sound. And so I came to associate the voice of Bobby John, over which I seemed to have so much agency, with the thing that was “me”. After all, it was singing “my” songs was it not.

Which was of course preposterous - I was neither Dylan nor Cash, I was simply stealing their life-work... and in a very different way than those two had stolen from the Hanks and the Van Ronks that had come before them. Or... maybe not preposterous at all? Because the argument on the other extreme states that in a deterministic universe no art can be attributed to an artist, because “art” and “artist” and “talent” and “effort” are all phenomena emerging from the physics and the math, and so a song cannot be written, it can at best be received.

I reject both these arguments. The second more strongly than the first because it is not an interesting take, even if it is technically correct. The first because I believe the sin of theft should be balanced against the hope this sort of tech gives to those who would otherwise not have the ability to make. In my opinion, it is better to accept that I am stealing, and be grateful for the men and the women and the machines that allow me to do so. It is better to promise that - in return - I will not take the responsibility lightly, that I will contribute and attribute, that I will avoid low-effort workslop, and - above all - always try to make from this theft something that is strange and beautiful and true.

I moved on from Jukebox because there was only so far you could go with it. A v2 was never released, and so the sound could never escape “AM radio with bad signal.” I figured if I waited long enough, the inexorable push of venture capital would bring something better my way. And sure enough, in August 2024, a friend introduced me to Suno, and that’s what I’ve been playing with since. I started off with two “proof-of-concept” albums: one that returned to the Jukebox sound, and one that experimented with a wider variety of genres.

Borges walks into a disco - Jukebox version.

Jorge’s disco - Suno version. [Spotify]

There’s a lot to love about Suno, mostly because the team has been devoted to the product. The sound is polished (enough to have a sound engineer agree to master its output), and the capabilities are getting better with every version. It takes a minute to pop out two songs, not hours and hours. It’s easy to use, and if you work with it, it does what you want it to do. There’s a lot of “how-to-Suno” out there on the internet, and I’ll add a little to that in the next section.

But for all its polish, sometimes I can’t help but wax nostalgic. It’s a push-the-button app not a python script. I can’t hack around in it. It takes a minute to pop out two songs, not hours and hours. And - most importantly - it’s harder to get to the madness, to the foot-stomping bacchanal that just came naturally to Jukebox. So for my third album, I decided to change my approach to making music: moving on from the uncanny valley of Bobby John, and into the electronic, into the thump and whine of the machine.

In April 2020, I was introduced to Mitsuko Uchida playing Schubert’s D-960/Sonata in B-Flat. I hadn’t listened much to Schubert before that, and I didn’t know Dame Uchida at all. But it was the right time in life to get welded to a brand new thing. The sonata made it into my work-at-home rotation, then it became a soundtrack, and finally, five years later (almost to the day!) I got to see her play it live in Carnegie Hall.

Until then, my songs had been built around the written word. I wrote the lyrics that I pushed into the algo, which generated the music next. Yes I’d learned to shape that music through prompts and hacks and seeds and such, but the word had always come first. There were a lot of people on the how-to’s who were going in the other direction, but I had never trusted myself to do that. Until that concert in the Hall, when both seed and story came to mind.

For the seed, I picked two short clips from the first movement of the sonata.

For story, I went to Franz Peter Schubert himself, dead at 31, but madly productive during his final time. D-960 was composed a few months before he died, but he had long known his end was coming. Years earlier he had written to a friend:

“I feel myself to be the most unhappy and wretched creature in the world. Imagine a man whose health will never be right again, a man whose most brilliant hopes have perished, to whom love and friendship have nothing to offer but pain, whose enthusiasm for all things beautiful is gone, and I ask you, is he not a miserable, unhappy being? Each night, on retiring to bed, I hope I may not wake again, and each morning but recalls yesterday’s grief.”

I imagined a sequence for a man I called “Peter Franz”: grappling with the shortness of his allotted time, despairing with his doctor, debating the point of art in the face of the red, descending into pathos while refusing to accept his fate, accepting his fate while dictating his last testament, being swallowed into the void and having his mind and his music shat into bits and beeps by some punk like me, demanding to live within the machine, hallucinating, then coming to his escape.

To begin, I shaped the second clip into a melody that could run through the first song of a rock-opera.

Then, I added the first words.

“There was a madness to his music that could not be explained by his math.”

Then I added a paraphrase of the letter quoted above, and went through a few thousand iterations, and six months later, I had this song (and eight others).

The full album is out on various channels including Spotify. It is less than 30 minutes long, but if you want to try out a smaller bite first, here is an EP of the 3 tracks in the middle.

(Mastered by the awesome Francesco Scauzillo)

I was watching Nouvelle Vague last week, which is littered with quotes like this one. Let’s suppose you spent some of the 9:13 to listen to my EP, and you found nothing revolutionary in there. Instead, let’s suppose you were infuriated with my description of process: how I took 30 seconds of a god-touching sonata, mashed some buttons, and came out with a hash of synth and bass.

It’s getting too easy these days. At least with Jukebox it was harder.

Other links:

read an earlier version and learn how to pronounce SONGXYTR

spotify | youtube | apple | soundcloud | codebase