GitHub - akdeb/ElatoAI: Realtime Voice AI with 100+ Models on Arduino ESP32 with Secure Websockets and Edge Functions for AI Toys, Companions, and Devices

English | 中文

News

Apr 17 2026: Create a Global Devices/Toys network with Cloudflare Voice Agents and Durable Objects. Cloudflare's Workers AI provides Deepgram STT/TTS natively so all you need to bring is an LLM API Key to create a scalable, low-latency voice AI pipeline.
Apr 15 2026: You can launch over 100+ STT, LLM, TTS voice pipeline systems with a FastAPI server with Pipecat!
Mar 14 2026: Elato just launched Local AI Toys on Pi Day.🎉🎉 Your ESP32 devices can now support local AI models and voice generation with frontier Local LLMs and TTS models like Qwen, Mistral, and more with MLX. Check it out here.

Realtime AI Speech powered by 100+ Voice AI models on ESP32, with Secure WebSockets & Edge Functions for >20-minute uninterrupted conversations globally.

📽️ Demo Video

Video links: OpenAI Demo | Gemini Demo | Eleven Labs Demo | Hume AI EVI-4 Demo

🧠 Models

Deno Edge

OpenAI Realtime API
Gemini Live API
xAI Grok Voice Agent API
Eleven Labs Conversational AI Agents
Hume AI EVI-4
More speech to speech models coming soon!

Cloudflare Workers

LLM - 80+ Models OpenAI, Gemini, xAI, and more.
TTS - 10+ Models Deepgram, MeloTTS and more.
STT - 5 Models Whisper, Deepgram and more.

FastAPI

Select from 100+ models from the list above and use the FastAPI server to create your own voice pipeline. Check out the FastAPI server for more details.

👷‍♀️ DIY Hardware Design

📱 App Design

Control your ESP32 AI device from your phone with the ElatoAI webapp.

🌟 Full feature list

Realtime Speech-to-Speech: Instant speech conversion powered by OpenAI's Realtime API, Gemini's Live API, xAI's Grok Voice Agent API, Eleven Labs Conversational AI Agents and Hume AI EVI4.
Create Custom AI Agents: Create custom agents with different personalities and voices.
Customizable Voices: Choose from a variety of voices and personalities.
Secure WebSockets: Reliable, encrypted WebSocket communication.
Server VAD Turn Detection: Intelligent conversation flow handling for smooth interactions.
Opus Audio Compression: High-quality audio streaming with minimal bandwidth.
Global Edge Performance: Low latency Deno Edge Functions ensuring seamless global conversations.
ESP32 Arduino Framework: Optimized and easy-to-use hardware integration.
Conversation History: View your conversation history.
Device Management and Authentication: Register and manage your devices.
User Authentication: Secure user authentication and authorization.
Conversations with WebRTC and Websockets: Talk to your AI with WebRTC on the NextJS webapp and with websockets on the ESP32.
Volume Control: Control the volume of the ESP32 speaker from the NextJS webapp.
Realtime Transcripts: The realtime transcripts of your conversations are stored in the Supabase DB.
OTA Updates: Over the Air Updates for the ESP32 firmware.
Wifi Management with captive portal: Connect to your Wifi network or Hotspot from the ESP32 device.
Factory Reset: Factory reset the ESP32 device from the NextJS webapp.
Button and Touch Support: Use the button OR touch sensor to control the ESP32 device.
No PSRAM Required: The ESP32 device does not require PSRAM to run the speech to speech AI.
OAuth for Web client: OAuth for your users to manage their AI characters and devices.
Pitch Factor: Control the pitch of the AI's voice from the NextJS webapp to create cartoon-like voices.
Tool calling: Call tools and functions from the ESP32 device to the edge Functions for a complete voice AI agent.
Tap to turn on: Tap the touchpad to turn it on from sleep.
Deploy on Cloudflare: Connect to any LLM, TTS, STT service with Cloudflare Voice Agents and Durable Objects

Project Architecture

ElatoAI consists of three main components:

Frontend Client (Next.js hosted on Vercel) - to create and talk to your AI agents and 'send' it to your ESP32 device
Edge Server Functions (Deno Edge or Cloudflare Workers) - to handle the websocket connections from the ESP32 device and the LLM Provider API calls
ESP32 IoT Client (PlatformIO/Arduino) - to receive the websocket connections from the Edge Server Functions and send audio to the LLM Provider via the Deno edge server or Cloudflare Durable Objects.

🛠 Tech Stack

Component	Technology Used
Frontend	Next.js, Vercel
Backend	Supabase DB
Edge Functions	Deno Edge or Cloudflare Workers
IoT Client	PlatformIO, Arduino Framework, ESP32-S3
Audio Codec	Opus
Communication	Secure WebSockets
Libraries	ArduinoJson, WebSockets, AsyncWebServer, ESP32_Button, Arduino Audio Tools, ArduinoLibOpus

High-Level Flowchart

flowchart TD
  subgraph UserLayer
    UserInput[User Speech Input]
    UserOutput[AI Generated Speech Output]
  end
  
  UserInput --> ESP32
  ESP32[ESP32 Device] -->|WebSocket| Edge[Deno Edge / Cloudflare Workers]
  Edge -->|OpenAI API| OpenAI[OpenAI Realtime API]
  Edge -->|Gemini API| Gemini[Gemini Live API]
  Edge -->|xAI API| xAI[xAI Grok Voice Agent API]
  Edge -->|ElevenLabs API| ElevenLabs[ElevenLabs AI Agents]
  Edge -->|Hume API| Hume[Hume AI EVI4]
  OpenAI --> Edge
  Gemini --> Edge
  xAI --> Edge
  ElevenLabs --> Edge
  Hume --> Edge
  Edge -->|WebSocket| ESP32
  ESP32 --> UserOutput

Project Structure

graph TD
  repo[ElatoAI]
  repo --> frontend[Frontend Vercel NextJS]
  repo --> server[Deno Edge Function / Cloudflare Workers]
  repo --> esp32[ESP32 Arduino Client]
  server --> supabase[Supabase DB]

  frontend --> supabase
  esp32 --> websockets[Secure WebSockets]
  esp32 --> opus[Opus Codec]
  esp32 --> audio_tools[arduino-audio-tools]
  esp32 --> libopus[arduino-libopus]
  esp32 --> ESPAsyncWebServer[ESPAsyncWebServer]

📊 Important Stats

Latency: <2-3s round-trip globally
Audio Quality: Opus codec at 12kbps (high clarity) 24kHz sampling rate
Uninterrupted Conversations: Up to 20 minutes continuous conversations globally
Global Availability: Optimized with edge / monolith servers

🛡 Security

Secure WebSockets (WSS) for encrypted data transfers
Optional: API Key encryption with 256-bit AES (decrypted on server)
Supabase DB for secure authentication
Postgres RLS for all tables

🚫 Limitations

2-3s cold start time while connecting to edge server
Tested with up to 17 minutes of uninterrupted conversations
Edge server stops when wall clock time is exceeded
No speech interruption detection on ESP32

🙌 Contributing

We value your contributions! Here are some ideas to get you started:

Speech Interruption on ESP32 (works with OpenAI)
~~Adding Arduino IDE support~~
~~Add Hume API client for emotion detection~~
Add MCP support on Deno Edge
~~Plug in Eleven Labs API for voice generation~~
Add Azure OpenAI Support (easy pickens) - in review
Add Cartesia Support
Add Amazon Nova Support
Add Deepgram
~~Add Cloudflare Workers support~~

License

This project is licensed under the MIT License - see the LICENSE file for details.

Check out our hardware offerings at ElatoAI Products. If you find this project interesting or useful, support us by starring this project on GitHub. ⭐