When a visitor clicks the microphone button on an Aurrus-powered website, a pipeline fires that involves three distinct AI systems, each doing a specialised job. The total round-trip — audio in to audio out — typically completes in under two seconds on a standard server connection. Here is exactly what happens at each stage.
Stage 1: Audio Capture and WebSocket Transport
The Aurrus widget uses the browser's MediaRecorder API to capture audio from the visitor's microphone. When the visitor stops speaking (or manually ends the recording), the widget sends the audio chunk to the Aurrus server over a WebSocket connection. WebSocket is used rather than a standard HTTP POST because it supports bidirectional, low-latency communication — the server can stream the synthesised audio response back through the same connection without establishing a new TCP handshake.
The raw audio arrives at the server as a binary blob — typically in WebM or Opus format depending on the browser. Before Whisper can process it, it needs to be converted to a 16kHz mono WAV file, which is Whisper's expected input format. This conversion is handled by ffmpeg, which runs as a child process on the Aurrus server and completes in milliseconds for typical voice clips.
Stage 2: Whisper Speech-to-Text
OpenAI's Whisper is a transformer-based speech recognition model that runs locally on the Aurrus server using the whisper.cpp implementation. Aurrus uses the ggml-base.en model by default — a 140MB model that achieves high accuracy on clear English speech with a transcription latency under 300ms for typical voice inputs. The model runs entirely on-device, which means visitor audio is never sent to a third-party transcription service. This is a deliberate architectural choice for privacy and latency.
Whisper returns a text transcript of what the visitor said. The transcript is cleaned to remove timestamps and filler markers before being passed to the next stage. For very short or very unclear audio, Whisper may return an empty string — the pipeline handles this gracefully by prompting the visitor to try again rather than passing an empty string to Claude.
Stage 3: Claude Response Generation via the Bridge
The cleaned transcript is passed to the Aurrus claude-cli-bridge — a local proxy service that manages communication with the Anthropic API. The bridge constructs a full prompt that includes:
- The system prompt configured in the widget owner's dashboard (business context, persona, scope)
- The conversation history for the current WebSocket session (prior turns in this visit)
- The visitor's transcribed message
Claude processes the full context and returns a text response. The bridge handles token limits, error retries, and model versioning — when Anthropic releases a new Claude version, only the bridge configuration changes; the rest of the pipeline is unaffected. For subscribed widget owners, all Claude API costs are covered by the Aurrus subscription tier. The widget owner does not need their own Anthropic API key.
Stage 4: Piper Text-to-Speech
Claude's text response is passed to Piper, an open-source neural TTS system that runs locally on the Aurrus server. Piper accepts text input on stdin and writes a WAV file to the specified output path. Aurrus uses the en_US-amy-medium voice model by default — a 63MB model that produces natural-sounding American English with low latency. The synthesis step for a typical one-to-three sentence response takes under 200ms.
The resulting WAV file is read from disk and sent back to the browser over the open WebSocket connection as a binary message. The widget's JavaScript receives the audio data, decodes it into an AudioBuffer using the Web Audio API, and plays it through the browser's audio output. The visitor hears Claude's response as natural speech within two seconds of finishing their own sentence.
Why Local Models for STT and TTS
Running Whisper and Piper locally rather than using cloud API services for transcription and synthesis was a deliberate choice driven by three factors: latency, privacy, and cost predictability. Cloud STT and TTS APIs add network round-trips that increase total pipeline latency. Local models keep the audio processing entirely on the Aurrus server, which means visitor voice data is never transmitted to a third-party service. And local models have no per-request cost, which makes the per-conversation cost of the voice widget entirely predictable regardless of volume.
Premium Voice with ElevenLabs
For widget owners who want higher-fidelity voice output, Aurrus supports ElevenLabs as an optional TTS provider configured from the dashboard. ElevenLabs voices are perceptibly more natural than Piper on extended speech but add a cloud API round-trip and per-character cost that Aurrus passes through in the premium subscription tier. The pipeline architecture is identical — only the TTS stage changes — so switching between Piper and ElevenLabs is a dashboard toggle, not a code change.