GitHub - robert-mcdermott/arietta-voice: A local-first framework for building customizable AI powered voice assistants on Apple Silicon Macs with local speech, knowledge & tools.

Arietta Voice is a local-first framework for building a wake-word-driven voice assistant on Apple Silicon Macs. It combines local speech-to-text, text-to-speech, turn detection, tool routing, markdown knowledge retrieval, and a web admin console into a customizable assistant runtime. It's like your own private, custom, locally host Alexa/Echo, but better.

The project is meant to become your assistant, not a finished general-purpose assistant out of the box. As shipped, it is mostly a working runtime and admin surface; it becomes useful when you provide the assistant persona in SOUL.md, add knowledge articles for your environment, configure the wake word, and add deterministic or model-called tools for the things you want it to do.

The default assistant persona is named Bridget, and the default wake word is also Bridget. Both are easy to change in config/arietta_voice.toml.

For a detailed customization walkthrough, see GETTING_STARTED.md.

What This Project Includes

local speech-to-text with Moonshine
local text-to-speech with Kokoro
local VAD and turn-end detection with Silero VAD and Smart Turn
default local chat with Gemma on MLX
optional AWS Bedrock chat integration through the same chat-model seam
deterministic tool routing
model-requested follow-up tools
local markdown knowledge retrieval
editable SOUL.md and optional memory file
a wake-word-first runtime designed for an always-on Mac mini
authenticated HTTP API and admin console for inspection, chat testing, logs, diagnostics, tools, and knowledge

Design Scope

The project is intentionally focused on a voice-first assistant runtime with a small, editable surface area:

the typed config system
the knowledge indexing and retrieval pipeline
deterministic tools and model-requested tools
the local/Bedrock LLM adapter seam
STT, TTS, VAD, Smart Turn, and audio barge-in handling
prompt assembly and optional memory consolidation
a local admin/API surface for operational visibility and configuration workflows

It intentionally does not include:

webcam presence detection
camera and vision tools

Setup

Install the project environment:

Recommended on macOS for Kokoro:

First run downloads model assets as needed. Depending on your config, that can include Moonshine, Kokoro, Smart Turn, and the local MLX chat model.

Commands

Run the assistant:

uv run arietta-voice
uv run arietta-voice run

Useful variations:

uv run arietta-voice --no-wake
uv run arietta-voice --wake-word Bridget
uv run arietta-voice --wake-word Bridget --wake-word Arietta
uv run arietta-voice --session-timeout 30
uv run arietta-voice --voice bf_emma
uv run arietta-voice --chat-provider bedrock --model us.anthropic.claude-sonnet-4-20250514-v1:0
uv run arietta-voice --record
uv run arietta-voice --record tmp/session.wav

Inspect configuration:

uv run arietta-voice config

Run diagnostics:

uv run arietta-voice doctor

Run the HTTP API and admin console:

uv run arietta-voice serve

The admin console is available at http://127.0.0.1:8765/admin. The initial local credentials are admin / password; override them with ARIETTA_ADMIN_USERNAME and ARIETTA_ADMIN_PASSWORD.

Screenshot of the Arietta admin console's tool managment panel:

The voice runtime and web console can run as separate processes. For example, you can start the voice runtime in one terminal and the admin console in another:

uv run arietta-voice
uv run arietta-voice serve

In this mode, the admin console still provides observability and editing: health, status heartbeat, logs, diagnostics, chat testing, knowledge editing, tool validation, tool source editing, and config-backed tool enablement. The voice runtime publishes a heartbeat to logs/runtime_status.json, which lets the admin console show whether the separate voice process is running, stopped, or stale.

The main limitation is process ownership. If the voice runtime was started manually with uv run arietta-voice, the admin console can observe it but cannot stop or restart that terminal process. After changing tools, knowledge settings, or config used by the voice runtime, restart the manually started runtime yourself so it picks up those changes.

The admin console can also start, stop, and restart an API-managed voice runtime. Use the admin console's Start button when you want the web API to own the child voice process and make the Start, Stop, and Restart buttons fully effective. Managed runtime stdout and stderr are written to logs/managed_runtime.stdout.log and logs/managed_runtime.stderr.log.

If you use a custom config path, pass the same config to both processes:

uv run arietta-voice --config config/arietta_voice.toml
uv run arietta-voice serve --config config/arietta_voice.toml

For a headless Mac mini or other always-on Mac, use macOS launchd to start the voice runtime and admin console automatically after reboot/login. Because the voice runtime uses microphone/audio access, run it as a user LaunchAgent rather than a root LaunchDaemon. See Run Automatically on macOS With launchd for a step-by-step setup.

The Knowledge page can list, create, edit, delete, search, and re-index local knowledge files inside the configured knowledge directory.

List audio devices:

uv run arietta-voice devices

Work with the knowledge base:

uv run arietta-voice knowledge-search "what can you help with"
uv run arietta-voice knowledge-index

Run tests:

Wake Word Behavior

The initial wake-word implementation is local and transcript-based:

Arietta Voice listens continuously with VAD.
When speech ends, it transcribes the utterance locally.
If the transcript starts with a configured wake phrase like Bridget, the assistant opens a short active session.
If the wake utterance also contains a command, like Bridget what time is it, the assistant handles that in the same turn.
If you only say Bridget, the assistant answers with the configured acknowledgement and keeps listening for the follow-up.

This approach keeps the system simple and fully local while reusing the existing STT/VAD pipeline. It is also deliberately extensible: the wake-word logic lives in src/arietta_voice/wake.py, so a future acoustic hotword backend can be added without rewriting the runtime.

Configuration

The main config file is config/arietta_voice.toml.

The most important editable files are:

config/SOUL.md: the system prompt/personality
config/MEMORY.md: optional long-lived memory file, created automatically when memory is enabled
knowledge: your editable local knowledge articles

The main config sections are:

[assistant]: persona name, identity text, short greetings, goodbye phrases, history length
[wake_word]: wake phrases, acknowledgement, backend, and session timeout
[audio]: TTS, Smart Turn, AEC, chime, and Kokoro settings
[models]: local vs Bedrock chat provider, model ids, AWS settings, STT language, generation settings
[prompts]: SOUL.md and memory file locations
[memory]: optional memory behavior
[logging]: runtime log locations
[knowledge]: knowledge directory, index directory, backend, and retrieval scoring knobs
[tools]: deterministic tools checked before the model
[model_tools]: model-requested follow-up tools

Relative paths in the TOML are resolved relative to the config file, not the current shell directory.

Customizing The Assistant

Change the assistant name and wake word

Edit config/arietta_voice.toml:

[assistant]
name = "Arietta"

[wake_word]
phrases = ["arietta"]

Change the system prompt

Edit config/SOUL.md. Keep it focused and durable. Project-specific facts should usually live in knowledge/*.md, not in the soul file.

Add knowledge

Add markdown files to knowledge. A template is included at knowledge/_template.md.

Add a deterministic tool

Copy src/arietta_voice/tools/tool_template.py to a new module in the same directory, implement maybe_handle(...), then add the module name to [tools].enabled.

Use deterministic tools when:

the answer should be exact
routing should be explicit
the tool itself should produce the final answer

The built-in local_time tool is the canonical example.

Add a model-requested tool

Copy src/arietta_voice/model_tools/tool_template.py to a new module in the same directory, implement invoke(...), then add the module name to [model_tools].enabled.

Use model-requested tools when:

the model should decide if the lookup is necessary
the tool gathers supporting facts instead of speaking directly
you want one grounded follow-up answer after tool execution

This is the natural path for future home-automation actions and richer environment lookups.

Bedrock Integration

Local chat is the default:

[models]
chat_provider = "local"
chat_model = "mlx-community/gemma-4-E4B-it-4bit"

To switch chat turns to Bedrock:

[models]
chat_provider = "bedrock"
chat_model = "us.anthropic.claude-sonnet-4-20250514-v1:0"
bedrock_region = "us-west-2"
bedrock_profile = "default"

The rest of the runtime stays the same. STT, TTS, tools, wake word handling, and knowledge retrieval remain local.

Recommended Dedicated Mac Mini Setup

For an always-on voice assistant box, the intended target is a dedicated Apple Silicon Mac mini with:

a reliable USB microphone or speakerphone
a good near-field speaker
a quiet location with stable power
local knowledge and tool configuration committed alongside the project

The current repo is a strong foundation for a home assistant or office assistant. Home automation should be added through tools, not by hardcoding behaviors into the runtime.

Project Layout

.
├── config/                     # Configuration files, SOUL definitions, and memory examples
├── knowledge/                  # User-editable Markdown knowledge base
├── src/
│   └── arietta_voice/
│       ├── tools/              # Deterministic tools (directly invoked)
│       ├── model_tools/        # Tools invoked by model reasoning
│       └── ...                 # Core runtime, models, audio, and knowledge handling
├── tests/                      # Unit tests for config, tools, knowledge, wake word, and model tooling

Notes

Licensed under the Apache License 2.0.
The default wake-word backend is transcript-based, not an acoustic hotword model.
The local chat path is optimized for Apple Silicon and MLX.
knowledge-index is only required when you use the semantic or hybrid backend.
backend = "keyword" is the simplest starting point for a user-customized deployment.