Skip to content

DeepgramSTT

Add production-grade real-time speech recognition to your voice pipeline with Deepgram's WebSocket API.

Use DeepgramSTT for production voice pipelines that need high accuracy, word-level timestamps, and wide language/model support via Deepgram’s V1 (Nova) streaming API.

Looking for eager end-of-turn / preflight signals? Use DeepgramFlux instead — it connects to Deepgram’s V2 API and supports the eager LLM pipeline.

Prerequisites

  • A Deepgram API key
  • No additional peer dependencies required

DeepgramSTT connects through a raw native WebSocket connection that it manages directly.

For production, set up a proxy server so your API key stays server-side.

Basic setup

import { CompositeVoice, MicrophoneInput, DeepgramSTT, AnthropicLLM, NativeTTS } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  providers: [
    new MicrophoneInput(),
    new DeepgramSTT({
      proxyUrl: '/api/proxy/deepgram',
      options: {
        model: 'nova-3',
        smartFormat: true,
      },
    }),
    new AnthropicLLM({
      proxyUrl: '/api/proxy/anthropic',
      model: 'claude-haiku-4-5',
      systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
    }),
    new NativeTTS(),
  ],
});

await agent.initialize();
await agent.startListening();

Configuration options

OptionTypeDefaultDescription
proxyUrlstringURL of your CompositeVoice proxy endpoint (recommended)
apiKeystringDeepgram API key (development only)
authType'token' | 'bearer''token'Controls WebSocket auth. Default: 'token' (subprotocol ['token', apiKey]). Set to 'bearer' for OAuth tokens.
languagestring'en-US'Language code
interimResultsbooleantrueEmit partial transcripts while the user speaks
options.modelstring'nova-3'Transcription model (see model table below)
options.smartFormatbooleantrueAuto-punctuation and formatting
options.punctuationbooleantrueAdd punctuation to results
options.endpointingboolean | numberfalseMilliseconds of silence before end-of-speech (false to disable)
options.diarizebooleanfalseSpeaker identification (V1 only)
options.keywordsstring[]Boost recognition of specific terms (with optional weight, e.g. 'Deepgram:2')
options.vadEventsbooleanfalseEmit SpeechStarted events (V1 only)
options.detectEntitiesbooleanfalseDetect entities in the transcript (V1 only)
options.numeralsbooleanfalseConvert spoken numbers to digits (V1 only)
options.redactstring[]Redact sensitive info: 'pci', 'ssn', 'numbers' (V1 only)
options.multichannelbooleanfalseTranscribe each audio channel independently (V1 only)
options.utterancesbooleanfalseEnable utterance segmentation (V1 only)

See the API reference for the full list.

Models

DeepgramSTT uses Deepgram’s V1 (Nova) model family:

ModelDescription
nova-3Latest model, highest accuracy, recommended default
nova-3-medicalOptimized for medical terminology
nova-2Previous generation — use if you need a language not yet in Nova-3
nova-2-*Domain variants: meeting, finance, conversationalai, voicemail, medical, drivethru, automotive
novaLegacy, not recommended for new projects

V1 uses an event-streaming model with Results events containing is_final and speech_final flags. Nova-3 delivers the best accuracy across the widest range of languages. Use Nova-2 variants for domain-specific vocabulary.

For Flux models (e.g., flux-general-en) with turn-based transcription and eager end-of-turn signals, use the DeepgramFlux provider instead.

Complete example

import { CompositeVoice, MicrophoneInput, DeepgramSTT, AnthropicLLM, DeepgramTTS, BrowserAudioOutput } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  providers: [
    new MicrophoneInput(),
    new DeepgramSTT({
      proxyUrl: '/api/proxy/deepgram',
      language: 'en',
      interimResults: true,
      options: {
        model: 'nova-3',
        smartFormat: true,
        punctuation: true,
        endpointing: 300,
        keywords: ['CompositeVoice'],
      },
    }),
    new AnthropicLLM({
      proxyUrl: '/api/proxy/anthropic',
      model: 'claude-haiku-4-5',
      maxTokens: 256,
      systemPrompt: 'You are a helpful voice assistant. Keep responses under two sentences.',
    }),
    new DeepgramTTS({
      proxyUrl: '/api/proxy/deepgram',
      voice: 'aura-2-thalia-en',
    }),
    new BrowserAudioOutput(),
  ],
  // eagerLLM requires DeepgramFlux — see the DeepgramFlux guide for eager pipeline setup
  conversationHistory: { enabled: true, maxTurns: 10 },
  logging: { enabled: true, level: 'info' },
});

agent.on('transcription.final', (event) => {
  console.log('User said:', event.text);
});

await agent.initialize();
await agent.startListening();

How utterance completion works

DeepgramSTT buffers is_final segments from the Deepgram WebSocket and emits the complete utterance text when speech_final arrives. Internally, this sets utteranceComplete: true on the TranscriptionResult, which is the flag CompositeVoice checks to trigger LLM processing. The older speechFinal field is still present on transcription events for display purposes but is deprecated for pipeline triggering — utteranceComplete is now the canonical signal.

Tips and gotchas

  • Always use a proxy in production. Pass proxyUrl instead of apiKey so your Deepgram key never reaches the browser. The SDK converts http(s) to ws(s) automatically.
  • No peer dependencies. DeepgramSTT uses a raw native WebSocket, not the @deepgram/sdk. No extra packages to install.
  • Utterance buffering. Deepgram may split one utterance into multiple is_final segments before emitting speech_final. DeepgramSTT buffers these segments and delivers the complete utterance text when utteranceComplete: true.
  • No preflight signals. DeepgramSTT (V1/Nova) does not emit preflight/eager end-of-turn events. For the eager LLM pipeline, use DeepgramFlux instead.
  • Connection timeout. The WebSocket connection defaults to a 10-second timeout. Adjust with timeout in the config if your network is slow.

© 2026 CompositeVoice. All rights reserved.

Font size
Contrast
Motion
Transparency