Owning the Signal: Why We Built Our Own Voice Activity Detector

High Level (For Execs)

If you don't need to know what a Hann window is, this section is for you.

Real-time voice systems depend on one critical decision: "Is this speech?" Most companies outsource that decision to opaque libraries or heavy third-party models. It works until latency, privacy, cost, or dependency risk become product constraints.

We chose to own that layer.

SonusVAD is the real-time, browser-native voice detection core behind Skipflo's conversational infrastructure. It runs fully client-side. No external AI services. No model downloads. No black boxes.

We're open-sourcing it because very few companies share foundational signal infrastructure anymore. The core layers are usually hidden, wrapped, or monetized. We believe infrastructure credibility comes from showing your work.

If you're evaluating Skipflo: this is how we build.

The Details (For Engineers)

Real-time voice systems live or die at the signal layer. Before transcription. Before LLMs. Before orchestration. There's a single gating question:

Is this speech or not?

Most implementations fall into two categories:

A black-box C/C++ implementation (e.g., WebRTC VAD)
A deep neural model bundled via TensorFlow, ONNX, or TFLite

Both work. Both are widely used. But both introduce tradeoffs: opacity, dependency chains, model downloads, or limited tunability.

When building Skipflo's real-time conversational infrastructure, we wanted deterministic control over detection behavior. So we built SonusVAD.

Architecture

SonusVAD is a browser-native VAD that runs entirely client-side using the Web Audio API.

Mic -> Hann window -> Custom radix-2 FFT -> 10 engineered spectral features 
    -> 2-layer MLP (10 -> 8 ReLU -> 1 sigmoid) -> Exponential smoothing 
    -> Adaptive thresholding -> 4-state speech machine

Frame size defaults to 256 samples at 16 kHz (~16 ms resolution). No external models. No WebAssembly. No ML frameworks. No runtime downloads.

The neural layer is compact and fully inspectable:

Input: 10 normalized features
Hidden: 8 ReLU units
Output: 1 sigmoid probability

Weights are readable in source. Inference is deterministic.

Feature Engineering

Each frame extracts interpretable, speech-relevant features:

Log energy (RMS)
Zero-crossing rate
Spectral centroid
Spectral flatness
Speech-band ratio (300-3400 Hz)
Low-band ratio (85-300 Hz)
High-band ratio (3400-8000 Hz)
F1 band ratio (300-1000 Hz)
F2 band ratio (1000-2500 Hz)
Previous frame probability (temporal persistence)

This hybrid approach combines classical DSP with a shallow classifier layer. The model is small by design. The intelligence comes from structured features, not depth.

Production Behavior

Raw probability alone is not enough. SonusVAD layers in stability logic:

Exponential moving average smoothing
2.5-second ambient calibration at startup
Adaptive thresholding above measured baseline
4-state detection machine (idle, listening, speech, silence_pending)
Configurable silence hysteresis (default 150 ms)

This prevents:

Single-frame spikes
Word-gap clipping
HVAC noise triggers
Flickering state transitions

The result is low-latency, stable speech detection suitable for real-time UX and API gating.

Why This Matters

Voice detection is infrastructure. It affects:

Latency
Privacy
Cost (when gating API calls)
UX responsiveness
System reliability

By owning this layer, we remove external inference dependencies and maintain full control over detection logic.

For many real-time systems, that tradeoff: transparency over model complexity is intentional.

SonusVAD began as the signal-processing backbone of Skipflo. We're open-sourcing it because the signal layer shouldn't be a black box.

View SonusVAD on GitHub

Own the signal.