Skip to content

Providers

Every input, STT, LLM, TTS, and output provider — supported models, transport, features, and configuration.

CompositeVoice uses five pipeline roles — input (audio capture), STT (speech-to-text), LLM (large language model), TTS (text-to-speech), and output (audio playback). Mix and match any combination to build your voice pipeline. Some providers cover multiple roles (e.g., NativeSTT handles both input and stt).

Audio Input

ProviderEnvironmentRolesDescription
MicrophoneInputBrowserinputWraps getUserMedia + AudioContext for microphone capture
BufferInputNode/Bun/DenoinputAccepts pushed ArrayBuffer data for server-side pipelines
NativeSTTBrowserinput + sttBrowser’s Web Speech API manages its own microphone internally

MicrophoneInput

Captures audio from the browser’s microphone via getUserMedia and AudioContext. Use this when pairing with a WebSocket-based STT provider like DeepgramSTT or AssemblyAISTT.

import { MicrophoneInput } from '@lukeocodes/composite-voice';

const input = new MicrophoneInput({
  sampleRate: 16000,        // audio sample rate in Hz
});
  • Buffers audio frames in the input queue during STT connection — no audio is ever lost
  • Works in all modern browsers that support getUserMedia
  • Requires HTTPS or localhost

BufferInput

Accepts audio data pushed programmatically. Use this for server-side pipelines (Node.js, Bun, Deno) where there is no microphone.

import { BufferInput } from '@lukeocodes/composite-voice';

const input = new BufferInput({
  sampleRate: 16000,
  encoding: 'linear16',
  channels: 1,
  bitDepth: 16,
});

// Push audio from any source (file, stream, WebSocket, etc.)
input.push(audioBuffer);
  • Zero browser dependencies — no navigator, window, or AudioContext
  • Works in Node.js, Bun, and Deno

Speech-to-Text (STT)

ProviderTransportModelsInterim ResultsPreflight
NativeSTTBrowser APIBrowser defaultYesNo
DeepgramSTTWebSocketV1: nova-3, nova-2YesNo
DeepgramFluxWebSocketV2: flux-general-enYesYes
AssemblyAISTTWebSocketDefault modelYesNo
ElevenLabsSTTWebSocketscribe_v2_realtimeYesNo

NativeSTT

Uses the browser’s built-in Web Speech API. Zero API keys required. Best for prototyping and demos.

import { NativeSTT } from '@lukeocodes/composite-voice';

const stt = new NativeSTT({
  language: 'en-US',        // BCP 47 language tag
  continuous: true,          // keep listening after each result
  interimResults: true,      // emit partial transcripts
  maxAlternatives: 1,        // number of recognition alternatives
});
  • No API key needed
  • Works offline
  • Supports 50+ languages via the browser
  • Managed audio — the browser controls the microphone directly
  • Does not work in de-Googled browsers (Ungoogled Chromium, Brave) — the Web Speech API requires Google’s speech servers

API reference

DeepgramSTT

Production-grade real-time speech recognition via WebSocket using Deepgram’s V1 (Nova) API. Best accuracy across the widest range of languages.

import { DeepgramSTT } from '@lukeocodes/composite-voice';

const stt = new DeepgramSTT({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  language: 'en',
  interimResults: true,
  options: {
    model: 'nova-3',          // nova-3 (recommended), nova-2, nova-3-medical
    smartFormat: true,         // auto-punctuation and formatting
    punctuation: true,
    profanityFilter: false,
    diarize: false,            // speaker identification
    endpointing: 300,          // ms of silence before end-of-speech
    utteranceEndMs: 1000,      // ms before utterance boundary
  },
});
  • nova-3 (highest accuracy, recommended default), nova-2 (wider language support)
  • Word-level confidence and timestamps
  • Smart formatting and auto-punctuation
  • Profanity filtering
  • Speaker diarization
  • VAD events

Does not support preflight/eager end-of-turn signals. For the eager LLM pipeline, use DeepgramFlux.

API reference

DeepgramFlux

Low-latency real-time speech recognition via WebSocket using Deepgram’s V2 (Flux) API. Supports eager end-of-turn signals for the eager LLM pipeline.

import { DeepgramFlux } from '@lukeocodes/composite-voice';

const stt = new DeepgramFlux({
  proxyUrl: '/api/proxy/deepgram',   // server proxy (recommended)
  // OR: apiKey: 'dg-...',           // direct API key (dev only)
  options: {
    model: 'flux-general-en',
    eagerEotThreshold: 0.5,    // enables eager end-of-turn signals
    eotThreshold: 0.7,
  },
});
  • Turn-based transcription via TurnInfo events
  • Eager end-of-turn signals (EagerEndOfTurnisPreflight: true)
  • Configurable end-of-turn confidence thresholds
  • Keyterm boosting for domain vocabulary
  • Only STT provider that supports the eager LLM pipeline

API reference

AssemblyAISTT

Real-time speech recognition via WebSocket with word boosting for domain-specific vocabulary.

import { AssemblyAISTT } from '@lukeocodes/composite-voice';

const stt = new AssemblyAISTT({
  proxyUrl: '/api/proxy/assemblyai',
  // OR: apiKey: '...',
  sampleRate: 16000,
  language: 'en',
  wordBoost: ['CompositeVoice', 'WebSocket'],  // boost domain terms
});
  • Word boosting for domain vocabulary
  • Word-level timestamps and confidence
  • Automatic reconnection

API reference

ElevenLabsSTT

Real-time speech recognition via WebSocket using ElevenLabs Scribe V2 with ~150ms latency and 90+ language support.

import { ElevenLabsSTT } from '@lukeocodes/composite-voice';

const stt = new ElevenLabsSTT({
  proxyUrl: '/api/proxy/elevenlabs',
  // OR: apiKey: '...',
  // OR: token: '...',             // single-use token
  model: 'scribe_v2_realtime',
  audioFormat: 'pcm_16000',
  language: 'en',                  // BCP 47, ISO 639-1, or ISO 639-3
  commitStrategy: 'vad',           // 'vad' (default) or 'manual'
  includeTimestamps: true,         // word-level timestamps
});
  • VAD and manual commit strategies
  • 90+ languages with auto-detection
  • Word-level timestamps and confidence
  • Three auth methods (API key, proxy, single-use token)
  • Shares proxy config with ElevenLabsTTS

API reference


Large Language Models (LLM)

ProviderBaseDefault ModelStreaming
AnthropicLLMCustomclaude-haiku-4-5Yes
OpenAILLMOpenAI-compatible(required)Yes
GroqLLMOpenAI-compatiblellama-3.3-70b-versatileYes
MistralLLMOpenAI-compatiblemistral-small-latestYes
GeminiLLMOpenAI-compatiblegemini-2.0-flashYes
WebLLMLLMCustom(required)Yes
OpenAICompatibleLLM(required)Yes

AnthropicLLM

Claude models via the Anthropic API. Uses a dedicated SDK (not OpenAI-compatible).

import { AnthropicLLM } from '@lukeocodes/composite-voice';

const llm = new AnthropicLLM({
  proxyUrl: '/api/proxy/anthropic',
  model: 'claude-haiku-4-5',    // claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6
  maxTokens: 1024,               // required (default: 1024)
});
  • System prompts at top level (Anthropic API convention)
  • Streaming via SSE
  • AbortSignal cancellation for the eager pipeline

API reference

OpenAILLM

GPT models via the OpenAI API.

import { OpenAILLM } from '@lukeocodes/composite-voice';

const llm = new OpenAILLM({
  proxyUrl: '/api/proxy/openai',
  model: 'gpt-4o-mini',
  // organizationId: 'org-...',  // for multi-org accounts
});

API reference

GroqLLM

Ultra-fast inference on Groq’s LPU hardware. Supports open-source models.

import { GroqLLM } from '@lukeocodes/composite-voice';

const llm = new GroqLLM({
  proxyUrl: '/api/proxy/groq',
  model: 'llama-3.3-70b-versatile',  // or mixtral-8x7b-32768, gemma2-9b-it
});
  • Lowest latency of any cloud LLM provider
  • Wide range of open-source models

API reference

MistralLLM

Mistral models with strong multilingual support.

import { MistralLLM } from '@lukeocodes/composite-voice';

const llm = new MistralLLM({
  proxyUrl: '/api/proxy/mistral',
  model: 'mistral-small-latest',  // or mistral-medium-latest, mistral-large-latest
});

API reference

GeminiLLM

Google Gemini models via their OpenAI-compatible endpoint.

import { GeminiLLM } from '@lukeocodes/composite-voice';

const llm = new GeminiLLM({
  proxyUrl: '/api/proxy/gemini',
  model: 'gemini-2.0-flash',  // or gemini-1.5-pro, gemini-1.5-flash
});

API reference

WebLLMLLM

Run LLMs entirely in the browser via WebGPU. No API keys, no network, full privacy.

import { WebLLMLLM } from '@lukeocodes/composite-voice';

const llm = new WebLLMLLM({
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC',
  onLoadProgress: (progress) => {
    console.log(`Loading: ${(progress.progress * 100).toFixed(0)}%`);
  },
});
  • All data stays in the browser
  • Works offline after initial model download
  • Requires a WebGPU-capable browser
  • First load downloads model weights (100+ MB)

API reference

OpenAICompatibleLLM

Base class for any service that speaks the OpenAI chat completions format. Use this to connect custom or self-hosted models.

import { OpenAICompatibleLLM } from '@lukeocodes/composite-voice';

const llm = new OpenAICompatibleLLM({
  endpoint: 'https://my-model-server.example.com/v1',
  model: 'my-custom-model',
  apiKey: '...',
});

API reference


Text-to-Speech (TTS)

ProviderTransportVoicesStreamingAudio Format
NativeTTSBrowser APISystem voicesNo (managed)N/A
DeepgramTTSWebSocketAura 2 (7 voices)Yeslinear16, mulaw, alaw
OpenAITTSREST6 voicesNomp3, opus, aac, flac, wav
ElevenLabsTTSWebSocketCustom voice IDsYespcm, mp3, ulaw
CartesiaTTSWebSocketCustom voice IDsYespcm (s16le, f32le, mulaw, alaw)

NativeTTS

Uses the browser’s built-in SpeechSynthesis API. Zero API keys required.

import { NativeTTS } from '@lukeocodes/composite-voice';

const tts = new NativeTTS({
  voiceName: 'Samantha',    // partial match against available voices
  voiceLang: 'en-US',       // BCP 47 fallback filter
  rate: 1.0,                // speech rate
  pitch: 0,                 // semitones (-20 to 20)
});
  • No API key needed
  • Works offline
  • Managed audio — the browser plays directly
  • Supports pause, resume, and cancel
  • Voice enumeration via getAvailableVoices()

API reference

DeepgramTTS

Low-latency real-time streaming TTS via WebSocket with Aura 2 voices.

import { DeepgramTTS } from '@lukeocodes/composite-voice';

const tts = new DeepgramTTS({
  proxyUrl: '/api/proxy/deepgram',
  voice: 'aura-2-thalia-en',    // thalia, andromeda, janus, proteus, orion, luna, arcas
  sampleRate: 24000,
  outputFormat: 'linear16',
});
  • Lowest latency streaming TTS
  • Word-level timing metadata
  • Aura 2 voice models

API reference

OpenAITTS

OpenAI text-to-speech via REST. Returns complete audio in one request.

import { OpenAITTS } from '@lukeocodes/composite-voice';

const tts = new OpenAITTS({
  proxyUrl: '/api/proxy/openai',
  model: 'tts-1',          // tts-1 (fast) or tts-1-hd (quality)
  voice: 'nova',           // alloy, echo, fable, onyx, nova, shimmer
  responseFormat: 'mp3',   // mp3, opus, aac, flac, wav
  speed: 1.0,              // 0.25 to 4.0
});
  • Six distinct voices
  • Quality/speed tradeoff via model selection
  • 4096 character limit per request

API reference

ElevenLabsTTS

High-quality voice cloning and synthesis via WebSocket streaming.

import { ElevenLabsTTS } from '@lukeocodes/composite-voice';

const tts = new ElevenLabsTTS({
  proxyUrl: '/api/proxy/elevenlabs',
  voiceId: 'your-voice-id',           // from ElevenLabs dashboard
  modelId: 'eleven_turbo_v2_5',       // turbo_v2_5, turbo_v2, multilingual_v2
  stability: 0.5,                      // voice consistency (0-1)
  similarityBoost: 0.75,              // voice fidelity (0-1)
  outputFormat: 'pcm_16000',          // pcm_16000, pcm_22050, pcm_24000, mp3_44100_128
});
  • Voice cloning
  • Multilingual models
  • Stability and similarity controls
  • Multiple output formats

API reference

CartesiaTTS

Ultra-low-latency streaming TTS with emotion controls.

import { CartesiaTTS } from '@lukeocodes/composite-voice';

const tts = new CartesiaTTS({
  proxyUrl: '/api/proxy/cartesia',
  voiceId: 'your-voice-id',
  modelId: 'sonic-2',           // sonic-2 (latest), sonic, sonic-multilingual
  language: 'en',
  outputEncoding: 'pcm_s16le',
  outputSampleRate: 16000,
  speed: 'normal',              // or 'slow', 'fast'
  emotion: ['positivity:high'], // emotion tags
});
  • Context-based streaming links chunks into coherent utterances
  • Emotion controls
  • Word-level timestamps
  • sonic-2 model delivers the lowest latency

API reference


Audio Output

ProviderEnvironmentRolesDescription
BrowserAudioOutputBrowseroutputWraps AudioContext for speaker playback
NullOutputNode/Bun/DenooutputSilently discards audio for server-side pipelines
NativeTTSBrowsertts + outputBrowser’s SpeechSynthesis API manages its own speaker output

BrowserAudioOutput

Plays audio through the browser’s AudioContext and speakers. Use this when pairing with a WebSocket-based or REST-based TTS provider like DeepgramTTS, ElevenLabsTTS, or OpenAITTS.

import { BrowserAudioOutput } from '@lukeocodes/composite-voice';

const output = new BrowserAudioOutput();
  • Handles AudioContext resumption after user gestures
  • Buffers audio frames in the output queue during setup — no audio is ever lost

NullOutput

Silently discards all audio. Use this for server-side pipelines where there are no speakers.

import { NullOutput } from '@lukeocodes/composite-voice';

const output = new NullOutput();
  • Zero browser dependencies — no navigator, window, or AudioContext
  • Works in Node.js, Bun, and Deno

Choosing providers

For prototyping: NativeSTT + any LLM + NativeTTS — no API keys except the LLM.

For production: DeepgramSTT + AnthropicLLM + DeepgramTTS — best accuracy, lowest latency, streaming throughout.

For privacy: NativeSTT + WebLLMLLM + NativeTTS — everything runs in the browser. No data leaves the device.

For lowest latency: DeepgramFlux + GroqLLM + DeepgramTTS — eager end-of-turn signals, fastest LLM inference, low-latency streaming TTS.

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency