GitHub - Jellypod-Inc/cartoon-studio: An open-source desktop studio app for making 2D animated cartoon shows.

5 min read Original article ↗

Cartoon Studio

Create your own 2D animated cartoon show.

An open-source desktop studio — script in, MP4 out. Bring your own API keys; no telemetry, no cloud.

License Electron Node Stars

Quick start · Three screens · How it works · Built on · API keys

Cartoon Studio preview

Quick start

Heads up — no pre-built binaries yet. Code signing and distribution are still being sorted out. For now, run from source. Works on macOS, Windows, and Linux.

You only need Node ≥ 24 (the repo pins 24 via .nvmrc — run nvm use to match CI). ffmpeg and Chrome are bundled via postinstall.

git clone https://github.com/Jellypod-Inc/cartoon-studio.git
cd cartoon-studio
npm install      # also downloads chrome-headless-shell (~190 MB)
npm start

npm start boots a Vite dev server for the renderer and an Electron process for main + preload, with HMR on the renderer. The postinstall script downloads chrome-headless-shell once for your host platform; rerun npm run setup:chrome if it gets stale or you switch architectures.

Three screens, three jobs

  • Stage — drag SVG characters onto a scene. Library on the left, 1920×1080 canvas on the right. Move, resize from any corner, Delete to remove.
  • Dialogue — write the script. Per-line speaker, voice (87 voices across 8 providers), and text. Press ▶ to generate a fresh take and play it inline. Press ✦ Write with AI to author the whole script from a premise.
  • Play — final-cut preview inside a film cabinet. Scrub, play, restart, render to MP4.

How it works

flowchart TD
    A([Script]) --> B[TTS · word timestamps]
    D([SVG character · 9-shape mouth rig]) --> E
    B --> C[Mouth cues from vowels]
    C --> E[Composition HTML]
    E --> F[HyperFrames render]
    F --> G([MP4])
Loading

What runs through an LLM and what doesn't:

  • LLM — script authoring (OpenAI), vision-based mouth detection on uploaded SVGs (OpenAI vision).
  • Not LLM — TTS, word-level alignment, mouth-cue generation (deterministic vowel heuristic), placement, composition, render. Output is reproducible from the same script + voices + characters.

Built on

Layer Project What it gives us
Speech Speech SDK One generateSpeech() call across 13 providers with native or Whisper-derived word timestamps. Built and maintained by Jellypod. The reason Cartoon Studio can swap voice providers from a dropdown.
Render HyperFrames HTML composition → MP4 via headless Chrome + GSAP timelines + ffmpeg. We hand it a paused timeline with .set() calls per mouth cue; HyperFrames seeks frame by frame and muxes the audio.
SVG art Recraft V4 Text → flat construction-paper SVG characters and scenes. We hint a magenta background plate so it can be reliably stripped, then auto-rig the mouth via geometric heuristic (or vision-LLM fallback for unusual faces).
Lip-sync catalog Preston Blair · Rhubarb The 9 mouth shapes (X closed · A–H open variants) Rhubarb popularized. We don't bundle Rhubarb — cues come from word timestamps via a vowel-per-word heuristic, which is faster and regenerates instantly when the text changes. Rhubarb itself is excellent for true phoneme-driven sync.

Bring your own API keys

Every key lives encrypted in your OS keyring — macOS Keychain · Windows DPAPI · Linux GNOME Libsecret / KWallet — via Electron's safeStorage. The settings file holds only ciphertext, with file mode 0600 on Unix. The renderer process never sees plaintext keys, only an "is set" boolean map. Audit all of this in Settings → SC.SEC · Key Vault.

The app boots fine with no keys. Every key is optional; nothing is gated. If a key isn't set when an action needs it, you get a single editorial toast with a one-click "Open Settings" shortcut.

Dialogue & scenes

Key Used for
OPENAI_API_KEY Script authoring · vision-based mouth detection on uploaded SVGs · OpenAI TTS · Whisper word alignment for any provider without native timestamps.
FAL_API_KEY Recraft V4 character + scene generation.

Text-to-speech

Key Provider · what it's good for
ELEVENLABS_API_KEY ElevenLabs — best voice quality, native word-level timestamps.
GOOGLE_API_KEY Google Gemini 3.1 Flash TTS — 30 expressive voices.
CARTESIA_API_KEY Cartesia Sonic-3 — ultra-low latency, expressive character voices.
DEEPGRAM_API_KEY Deepgram Aura-2 — natural conversational voices.
HUME_API_KEY Hume Octave-2 — emotionally intelligent, prompt-steerable.
FISH_AUDIO_API_KEY Fish Audio S2-Pro — paste reference IDs from fish.audio.
INWORLD_API_KEY Inworld TTS-1.5-Max — character voices built for game/agent NPCs.

Build a binary for your machine

Squirrel installer on Windows · .zip containing the .app on macOS · .deb / .rpm on Linux. Output lands in out/.

Signing. Builds are unsigned — on macOS you'll hit Gatekeeper ("damaged / can't be opened") and possibly Keychain access issues since the Fuses plugin invalidates the ad-hoc signature. On Windows you'll see SmartScreen. These go away once we ship signed builds via an Apple Developer ID + Authenticode certificate.

Project layout

src/
  main/        Electron main — settings, IPC, TTS, render, composition
  preload/     contextBridge exposing IpcApi to the renderer
  renderer/    Vanilla TS + Tailwind 4 — screens, modals, design system
    screens/   Stage / Dialogue / Play
  shared/      Cross-process types
resources/
  defaults/    Bundled characters (Bill · Ted · Jane · Max) + scenes
scripts/       CLI tools (e.g. `npm run rig` for SVG mouth-rigging)
docs/          Repo media

Forge is configured to:

  • Bundle resources/defaults/ as extraResource so default characters & scenes ship with the app.
  • Unpack node_modules/hyperframes/** from the asar so the render child process can resolve the CLI entry.