Providers

Every input, STT, LLM, TTS, and output provider — supported models, transport, features, and configuration.

CompositeVoice uses five pipeline roles — input (audio capture), STT (speech-to-text), LLM (large language model), TTS (text-to-speech), and output (audio playback). Mix and match any combination to build your voice pipeline. Some providers cover multiple roles (e.g., NativeSTT handles both input and stt).

Audio Input

Provider	Environment	Roles	Description
MicrophoneInput	Browser	`input`	Wraps `getUserMedia` + `AudioContext` for microphone capture
BufferInput	Node/Bun/Deno	`input`	Accepts pushed `ArrayBuffer` data for server-side pipelines
NativeSTT	Browser	`input` + `stt`	Browser’s Web Speech API manages its own microphone internally

MicrophoneInput

Captures audio from the browser’s microphone via getUserMedia and AudioContext. Use this when pairing with a WebSocket-based STT provider like DeepgramSTT or AssemblyAISTT.

import { MicrophoneInput } from '@lukeocodes/composite-voice';

const input = new MicrophoneInput({
  sampleRate: 16000,        // audio sample rate in Hz
});

Buffers audio frames in the input queue during STT connection — no audio is ever lost
Works in all modern browsers that support getUserMedia
Requires HTTPS or localhost

BufferInput

Accepts audio data pushed programmatically. Use this for server-side pipelines (Node.js, Bun, Deno) where there is no microphone.

import { BufferInput } from '@lukeocodes/composite-voice';

const input = new BufferInput({
  sampleRate: 16000,
  encoding: 'linear16',
  channels: 1,
  bitDepth: 16,
});

// Push audio from any source (file, stream, WebSocket, etc.)
input.push(audioBuffer);

Zero browser dependencies — no navigator, window, or AudioContext
Works in Node.js, Bun, and Deno

Speech-to-Text (STT)

Provider	Transport	Models	Interim Results	Preflight
NativeSTT	Browser API	Browser default	Yes	No
DeepgramSTT	WebSocket	V1: nova-3, nova-2	Yes	No
DeepgramFlux	WebSocket	V2: flux-general-en	Yes	Yes
AssemblyAISTT	WebSocket	Default model	Yes	No
ElevenLabsSTT	WebSocket	scribe_v2_realtime	Yes	No

NativeSTT

Uses the browser’s built-in Web Speech API. Zero API keys required. Best for prototyping and demos.

import { NativeSTT } from '@lukeocodes/composite-voice';

const stt = new NativeSTT({
  language: 'en-US',        // BCP 47 language tag
  continuous: true,          // keep listening after each result
  interimResults: true,      // emit partial transcripts
  maxAlternatives: 1,        // number of recognition alternatives
});

No API key needed
Works offline
Supports 50+ languages via the browser
Managed audio — the browser controls the microphone directly
Does not work in de-Googled browsers (Ungoogled Chromium, Brave) — the Web Speech API requires Google’s speech servers

API reference

DeepgramSTT

Production-grade real-time speech recognition via WebSocket using Deepgram’s V1 (Nova) API. Best accuracy across the widest range of languages.

import { DeepgramSTT } from '@lukeocodes/composite-voice';

const stt = new DeepgramSTT({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  language: 'en',
  interimResults: true,
  options: {
    model: 'nova-3',          // nova-3 (recommended), nova-2, nova-3-medical
    smartFormat: true,         // auto-punctuation and formatting
    punctuation: true,
    profanityFilter: false,
    diarize: false,            // speaker identification
    endpointing: 300,          // ms of silence before end-of-speech
    utteranceEndMs: 1000,      // ms before utterance boundary
  },
});

nova-3 (highest accuracy, recommended default), nova-2 (wider language support)
Word-level confidence and timestamps
Smart formatting and auto-punctuation
Profanity filtering
Speaker diarization
VAD events

Does not support preflight/eager end-of-turn signals. For the eager LLM pipeline, use DeepgramFlux.

API reference

DeepgramFlux

Low-latency real-time speech recognition via WebSocket using Deepgram’s V2 (Flux) API. Supports eager end-of-turn signals for the eager LLM pipeline.

import { DeepgramFlux } from '@lukeocodes/composite-voice';

const stt = new DeepgramFlux({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  options: {
    model: 'flux-general-en',
    eagerEotThreshold: 0.5,    // enables eager end-of-turn signals
    eotThreshold: 0.7,
  },
});

Turn-based transcription via TurnInfo events
Eager end-of-turn signals (EagerEndOfTurn → isPreflight: true)
Configurable end-of-turn confidence thresholds
Keyterm boosting for domain vocabulary
Only STT provider that supports the eager LLM pipeline

API reference

AssemblyAISTT

Real-time speech recognition via WebSocket with word boosting for domain-specific vocabulary.

import { AssemblyAISTT } from '@lukeocodes/composite-voice';

const stt = new AssemblyAISTT({
  proxyUrl: '/api/proxy/assemblyai',
  // OR: apiKey: '...',
  sampleRate: 16000,
  language: 'en',
  wordBoost: ['CompositeVoice', 'WebSocket'],  // boost domain terms
});

Word boosting for domain vocabulary
Word-level timestamps and confidence
Automatic reconnection

API reference

ElevenLabsSTT

Real-time speech recognition via WebSocket using ElevenLabs Scribe V2 with ~150ms latency and 90+ language support.

import { ElevenLabsSTT } from '@lukeocodes/composite-voice';

const stt = new ElevenLabsSTT({
  proxyUrl: '/api/proxy/elevenlabs',
  // OR: apiKey: '...',
  // OR: token: '...',             // single-use token
  model: 'scribe_v2_realtime',
  audioFormat: 'pcm_16000',
  language: 'en',                  // BCP 47, ISO 639-1, or ISO 639-3
  commitStrategy: 'vad',           // 'vad' (default) or 'manual'
  includeTimestamps: true,         // word-level timestamps
});

VAD and manual commit strategies
90+ languages with auto-detection
Word-level timestamps and confidence
Three auth methods (API key, proxy, single-use token)
Shares proxy config with ElevenLabsTTS

API reference

Large Language Models (LLM)

Provider	Base	Default Model	Streaming
AnthropicLLM	Custom	claude-haiku-4-5	Yes
OpenAILLM	OpenAI-compatible	(required)	Yes
GroqLLM	OpenAI-compatible	llama-3.3-70b-versatile	Yes
MistralLLM	OpenAI-compatible	mistral-small-latest	Yes
GeminiLLM	OpenAI-compatible	gemini-2.0-flash	Yes
WebLLMLLM	Custom	(required)	Yes
OpenAICompatibleLLM	—	(required)	Yes

AnthropicLLM

Claude models via the Anthropic API. Uses a dedicated SDK (not OpenAI-compatible).

import { AnthropicLLM } from '@lukeocodes/composite-voice';

const llm = new AnthropicLLM({
  proxyUrl: '/api/proxy/anthropic',
  model: 'claude-haiku-4-5',    // claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6
  maxTokens: 1024,               // required (default: 1024)
});

System prompts at top level (Anthropic API convention)
Streaming via SSE
AbortSignal cancellation for the eager pipeline

API reference

OpenAILLM

GPT models via the OpenAI API.

import { OpenAILLM } from '@lukeocodes/composite-voice';

const llm = new OpenAILLM({
  proxyUrl: '/api/proxy/openai',
  model: 'gpt-4o-mini',
  // organizationId: 'org-...',  // for multi-org accounts
});

API reference

GroqLLM

Ultra-fast inference on Groq’s LPU hardware. Supports open-source models.

import { GroqLLM } from '@lukeocodes/composite-voice';

const llm = new GroqLLM({
  proxyUrl: '/api/proxy/groq',
  model: 'llama-3.3-70b-versatile',  // or mixtral-8x7b-32768, gemma2-9b-it
});

Lowest latency of any cloud LLM provider
Wide range of open-source models

API reference

MistralLLM

Mistral models with strong multilingual support.

import { MistralLLM } from '@lukeocodes/composite-voice';

const llm = new MistralLLM({
  proxyUrl: '/api/proxy/mistral',
  model: 'mistral-small-latest',  // or mistral-medium-latest, mistral-large-latest
});

API reference

GeminiLLM

Google Gemini models via their OpenAI-compatible endpoint.

import { GeminiLLM } from '@lukeocodes/composite-voice';

const llm = new GeminiLLM({
  proxyUrl: '/api/proxy/gemini',
  model: 'gemini-2.0-flash',  // or gemini-1.5-pro, gemini-1.5-flash
});

API reference

WebLLMLLM

Run LLMs entirely in the browser via WebGPU. No API keys, no network, full privacy.

import { WebLLMLLM } from '@lukeocodes/composite-voice';

const llm = new WebLLMLLM({
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  onLoadProgress: (progress) => {
    console.log(`Loading: ${(progress.progress * 100).toFixed(0)}%`);
  },
});

All data stays in the browser
Works offline after initial model download
Requires a WebGPU-capable browser
First load downloads model weights (100+ MB)

API reference

OpenAICompatibleLLM

Base class for any service that speaks the OpenAI chat completions format. Use this to connect custom or self-hosted models.

import { OpenAICompatibleLLM } from '@lukeocodes/composite-voice';

const llm = new OpenAICompatibleLLM({
  endpoint: 'https://my-model-server.example.com/v1',
  model: 'my-custom-model',
  apiKey: '...',
});

API reference

Text-to-Speech (TTS)

Provider	Transport	Voices	Streaming	Audio Format
NativeTTS	Browser API	System voices	No (managed)	N/A
DeepgramTTS	WebSocket	Aura 2 (7 voices)	Yes	linear16, mulaw, alaw
OpenAITTS	REST	6 voices	No	mp3, opus, aac, flac, wav
ElevenLabsTTS	WebSocket	Custom voice IDs	Yes	pcm, mp3, ulaw
CartesiaTTS	WebSocket	Custom voice IDs	Yes	pcm (s16le, f32le, mulaw, alaw)

NativeTTS

Uses the browser’s built-in SpeechSynthesis API. Zero API keys required.

import { NativeTTS } from '@lukeocodes/composite-voice';

const tts = new NativeTTS({
  voiceName: 'Samantha',    // partial match against available voices
  voiceLang: 'en-US',       // BCP 47 fallback filter
  rate: 1.0,                // speech rate
  pitch: 0,                 // semitones (-20 to 20)
});

No API key needed
Works offline
Managed audio — the browser plays directly
Supports pause, resume, and cancel
Voice enumeration via getAvailableVoices()

API reference

DeepgramTTS

Low-latency real-time streaming TTS via WebSocket with Aura 2 voices.

import { DeepgramTTS } from '@lukeocodes/composite-voice';

const tts = new DeepgramTTS({
  proxyUrl: '/api/proxy/deepgram',
  voice: 'aura-2-thalia-en',    // thalia, andromeda, janus, proteus, orion, luna, arcas
  sampleRate: 24000,
  outputFormat: 'linear16',
});

Lowest latency streaming TTS
Word-level timing metadata
Aura 2 voice models

API reference

OpenAITTS

OpenAI text-to-speech via REST. Returns complete audio in one request.

import { OpenAITTS } from '@lukeocodes/composite-voice';

const tts = new OpenAITTS({
  proxyUrl: '/api/proxy/openai',
  model: 'tts-1',          // tts-1 (fast) or tts-1-hd (quality)
  voice: 'nova',           // alloy, echo, fable, onyx, nova, shimmer
  responseFormat: 'mp3',   // mp3, opus, aac, flac, wav
  speed: 1.0,              // 0.25 to 4.0
});

Six distinct voices
Quality/speed tradeoff via model selection
4096 character limit per request

API reference

ElevenLabsTTS

High-quality voice cloning and synthesis via WebSocket streaming.

import { ElevenLabsTTS } from '@lukeocodes/composite-voice';

const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: 'your-voice-id',           // from ElevenLabs dashboard
  modelId: 'eleven_turbo_v2_5',       // turbo_v2_5, turbo_v2, multilingual_v2
  stability: 0.5,                      // voice consistency (0-1)
  similarityBoost: 0.75,              // voice fidelity (0-1)
  outputFormat: 'pcm_16000',          // pcm_16000, pcm_22050, pcm_24000, mp3_44100_128
});

Voice cloning
Multilingual models
Stability and similarity controls
Multiple output formats

API reference

CartesiaTTS

Ultra-low-latency streaming TTS with emotion controls.

import { CartesiaTTS } from '@lukeocodes/composite-voice';

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'your-voice-id',
  modelId: 'sonic-2',           // sonic-2 (latest), sonic, sonic-multilingual
  language: 'en',
  outputEncoding: 'pcm_s16le',
  outputSampleRate: 16000,
  speed: 'normal',              // or 'slow', 'fast'
  emotion: ['positivity:high'], // emotion tags
});

Context-based streaming links chunks into coherent utterances
Emotion controls
Word-level timestamps
sonic-2 model delivers the lowest latency

API reference

Agent Providers

Agent providers collapse the STT + LLM + TTS pipeline into a single persistent connection. Instead of configuring three separate providers, you configure one agent provider that covers all three roles. The SDK auto-fills MicrophoneInput and BrowserAudioOutput for the remaining input and output roles.

Provider	Transport	Roles	Description
DeepgramAgent	WebSocket	`stt` + `llm` + `tts`	Deepgram Voice Agent API — single WebSocket handles STT, LLM, and TTS server-side

DeepgramAgent

Connects to the Deepgram Voice Agent API via a single WebSocket. Deepgram handles speech recognition, LLM inference, and text-to-speech synthesis server-side — the client only sends raw audio and receives raw audio back.

import { CompositeVoice, DeepgramAgent } from '@lukeocodes/composite-voice';

const voice = new CompositeVoice({
  providers: [
    new DeepgramAgent({
      proxyUrl: '/api/proxy/deepgram-agent',
      think: {
        provider: { type: 'open_ai', model: 'gpt-4o-mini' },
        prompt: 'You are a helpful voice assistant.',
      },
      speak: {
        provider: { type: 'deepgram', model: 'aura-2-thalia-en' },
      },
      greeting: 'Hello! How can I help you?',
    }),
  ],
});

Covers stt + llm + tts — only 1 provider needed (SDK auto-fills MicrophoneInput + BrowserAudioOutput)
Configurable LLM: OpenAI, Anthropic, Google, Groq, AWS Bedrock
Configurable TTS: Deepgram, ElevenLabs, Cartesia, OpenAI, AWS Polly
Mid-session updates: updatePrompt(), updateSpeak(), updateThink()
Message injection: injectUserMessage(), injectAgentMessage()
Client-side and server-side function calling via onFunctionCall callback
Greeting message on session start
Barge-in support
Latency metrics via AgentStartedSpeaking events

Audio Output

Provider	Environment	Roles	Description
BrowserAudioOutput	Browser	`output`	Wraps `AudioContext` for speaker playback
NullOutput	Node/Bun/Deno	`output`	Silently discards audio for server-side pipelines
NativeTTS	Browser	`tts` + `output`	Browser’s SpeechSynthesis API manages its own speaker output

BrowserAudioOutput

Plays audio through the browser’s AudioContext and speakers. Use this when pairing with a WebSocket-based or REST-based TTS provider like DeepgramTTS, ElevenLabsTTS, or OpenAITTS.

import { BrowserAudioOutput } from '@lukeocodes/composite-voice';

const output = new BrowserAudioOutput();

Handles AudioContext resumption after user gestures
Buffers audio frames in the output queue during setup — no audio is ever lost

NullOutput

Silently discards all audio. Use this for server-side pipelines where there are no speakers.

import { NullOutput } from '@lukeocodes/composite-voice';

const output = new NullOutput();

Zero browser dependencies — no navigator, window, or AudioContext
Works in Node.js, Bun, and Deno

Choosing providers

For prototyping: NativeSTT + any LLM + NativeTTS — no API keys except the LLM.

For production: DeepgramSTT + AnthropicLLM + DeepgramTTS — best accuracy, lowest latency, streaming throughout.

For privacy: NativeSTT + WebLLMLLM + NativeTTS — everything runs in the browser. No data leaves the device.

For lowest latency: DeepgramFlux + GroqLLM + DeepgramTTS — eager end-of-turn signals, fastest LLM inference, low-latency streaming TTS.

For simplest config: DeepgramAgent — one provider replaces the entire STT + LLM + TTS pipeline. Deepgram handles everything server-side.