Audio Configuration
Configure microphone capture and audio playback — sample rates, buffering, noise suppression, and more.
How audio flows through the SDK
CompositeVoice manages audio through input and output providers that sit at opposite ends of the voice pipeline:
- Input provider (e.g.,
MicrophoneInput) — captures audio and delivers PCM chunks to the STT provider. - Output provider (e.g.,
BrowserAudioOutput) — receives audio chunks from the TTS provider and plays them through the speakers.
Configure audio settings directly on the input and output providers, not on the top-level CompositeVoice config:
import { CompositeVoice, MicrophoneInput, BrowserAudioOutput } from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
providers: [
new MicrophoneInput({ /* AudioInputConfig */ }),
// ...your STT, LLM, and TTS providers
new BrowserAudioOutput({ /* AudioOutputConfig */ }),
],
});
Any option you omit falls back to a sensible default.
Audio capture (microphone input)
MicrophoneInput wraps the browser’s getUserMedia and Web Audio API into a start/stop interface. When capture starts, the following pipeline is assembled:
getUserMedia (MediaStream)
|
MediaStreamAudioSourceNode
|
AudioWorkletNode (preferred, off-main-thread)
or ScriptProcessorNode (fallback for older browsers)
|
Downsample if hardware rate differs from config
|
Float32 -> Int16 PCM conversion
|
ArrayBuffer delivered to STT provider via sendAudio()
The SDK requests microphone access with the constraints you specify (sample rate, channel count, echo cancellation, noise suppression, automatic gain control). It creates an AudioContext at the configured sample rate and uses an AudioWorkletNode for off-main-thread audio processing. If AudioWorklet is unavailable (older browsers, restricted environments), the SDK falls back to the deprecated ScriptProcessorNode. Both paths produce identical PCM output.
If the hardware sample rate differs from your configured rate (e.g., the microphone captures at 48kHz but you configured 16kHz), the SDK automatically downsamples using sample-window averaging before converting to 16-bit PCM.
AudioInputConfig reference
| Option | Type | Default | Description |
|---|---|---|---|
sampleRate | number | 16000 | Sample rate in Hz. Most STT providers work best at 16000. |
format | string | 'pcm' | Audio format. Currently 'pcm' (16-bit linear) is fully implemented. |
channels | number | 1 | Channel count. Use 1 (mono) for speech. |
chunkDuration | number | 100 | Duration of each audio chunk in milliseconds. |
echoCancellation | boolean | true | Enable browser echo cancellation. Prevents TTS audio from being re-transcribed. |
noiseSuppression | boolean | true | Enable browser noise suppression. Reduces background noise for cleaner transcription. |
autoGainControl | boolean | true | Enable automatic gain control. Normalizes volume when users move relative to the mic. |
The defaults are exported as DEFAULT_AUDIO_INPUT_CONFIG:
{
sampleRate: 16000,
format: 'pcm',
channels: 1,
chunkDuration: 100,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
}
Audio playback (speaker output)
BrowserAudioOutput uses the Web Audio API to play TTS audio through the speakers. It supports two modes:
- Complete playback — play a single audio
Blobviaplay()(used by REST-based TTS providers like OpenAITTS). - Streaming playback — queue individual
AudioChunkobjects viaaddChunk(), buffered and played sequentially (used by WebSocket-based TTS providers like DeepgramTTS, ElevenLabsTTS, CartesiaTTS).
For streaming playback, the player implements a buffering strategy:
- Chunks arrive from the TTS provider and are pushed into an internal queue.
- The player waits until the buffered duration meets
minBufferDurationbefore starting playback. - Each chunk is decoded into an
AudioBuffer(viadecodeAudioData, with a raw-PCM fallback usingAudioMetadata). - Chunks are played sequentially through
AudioBufferSourceNodeinstances connected to theAudioContextdestination. - When
enableSmoothingis active, crossfading is applied between adjacent chunks to eliminate clicks at boundaries.
AudioOutputConfig reference
| Option | Type | Default | Description |
|---|---|---|---|
bufferSize | number | 4096 | Buffer size in samples for audio processing. |
minBufferDuration | number | 200 | Minimum buffered audio (ms) before playback starts. Prevents choppy output. |
sampleRate | number | auto | AudioContext sample rate. Defaults to the TTS provider’s metadata or browser default. |
enableSmoothing | boolean | true | Apply crossfading between chunks to eliminate clicks and pops at boundaries. |
The defaults are exported as DEFAULT_AUDIO_OUTPUT_CONFIG:
{
bufferSize: 4096,
minBufferDuration: 200,
enableSmoothing: true,
}
Managed audio vs. raw audio providers
Not all providers use the SDK’s MicrophoneInput and BrowserAudioOutput. The distinction matters when deciding which audio settings apply:
Managed audio providers handle their own audio I/O through browser APIs, bypassing the input/output providers entirely:
- NativeSTT uses the Web Speech API (
SpeechRecognition), which captures microphone audio internally. YourMicrophoneInputsettings (sample rate, noise suppression, etc.) have no effect on NativeSTT. - NativeTTS uses the SpeechSynthesis API, which plays audio directly through the browser’s built-in speech engine. Your
BrowserAudioOutputsettings (buffer size, smoothing, etc.) have no effect on NativeTTS.
Raw audio providers stream audio data through the SDK, and your configuration applies fully:
- DeepgramSTT, AssemblyAISTT — receive PCM chunks from
MicrophoneInput. All input config options take effect. - DeepgramTTS, ElevenLabsTTS, CartesiaTTS — send audio chunks to
BrowserAudioOutput. All output config options take effect. - OpenAITTS — sends a complete audio blob to
BrowserAudioOutputfor one-shot playback.
If you use NativeSTT or NativeTTS, you typically do not need to include MicrophoneInput or BrowserAudioOutput in your providers array — those providers manage their own audio I/O.
When to adjust audio settings
Mobile devices — mobile browsers often run at 48kHz natively. The SDK downsamples to your configured rate automatically, but you can reduce chunkDuration to 50ms for lower latency on fast connections, or increase it to 200ms to reduce processing overhead on slower devices:
const agent = new CompositeVoice({
providers: [
new MicrophoneInput({
sampleRate: 16000,
chunkDuration: 200, // less frequent chunks, less CPU on mobile
}),
// ...your STT, LLM, and TTS providers
new BrowserAudioOutput(),
],
});
Noisy environments — all three browser audio processing features are enabled by default. If you find that noise suppression interferes with speech detection (rare), you can disable it selectively:
new MicrophoneInput({
echoCancellation: true,
noiseSuppression: false, // disable if it clips speech in your environment
autoGainControl: true,
})
Low-latency needs — reduce minBufferDuration to start playback sooner. This risks audio glitches on slow networks, so test thoroughly:
new BrowserAudioOutput({
minBufferDuration: 50, // start playing after just 50ms of buffered audio
bufferSize: 2048, // smaller processing buffer
})
High-quality audio — if your TTS provider outputs 24kHz or 48kHz audio, match the output sample rate to avoid unnecessary resampling:
new BrowserAudioOutput({
sampleRate: 24000, // match Deepgram Aura 2 output
})
Full configuration example
import {
CompositeVoice,
MicrophoneInput,
DeepgramSTT,
AnthropicLLM,
DeepgramTTS,
BrowserAudioOutput,
} from '@lukeocodes/composite-voice';
const agent = new CompositeVoice({
providers: [
new MicrophoneInput({
sampleRate: 16000,
format: 'pcm',
channels: 1,
chunkDuration: 100,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
}),
new DeepgramSTT({
proxyUrl: '/api/proxy/deepgram',
interimResults: true,
options: { model: 'nova-3', endpointing: 300 },
}),
new AnthropicLLM({
proxyUrl: '/api/proxy/anthropic',
model: 'claude-haiku-4-5',
systemPrompt: 'You are a helpful voice assistant.',
maxTokens: 200,
}),
new DeepgramTTS({
proxyUrl: '/api/proxy/deepgram',
options: { model: 'aura-2-thalia-en', encoding: 'linear16', sampleRate: 24000 },
}),
new BrowserAudioOutput({
bufferSize: 4096,
minBufferDuration: 200,
sampleRate: 24000,
enableSmoothing: true,
}),
],
});
await agent.initialize();
await agent.startListening();