Owning the Signal: Why We Built Our Own Voice Activity Detector
Real-time voice systems depend on one critical decision: Is this speech? We built SonusVAD to own that layer.
High Level (For Execs)
If you don't need to know what a Hann window is, this section is for you.
Real-time voice systems depend on one critical decision: "Is this speech?" Most companies outsource that decision to opaque libraries or heavy third-party models. It works until latency, privacy, cost, or dependency risk become product constraints.
We chose to own that layer.
SonusVAD is the real-time, browser-native voice detection core behind Skipflo's conversational infrastructure. It runs fully client-side. No external AI services. No model downloads. No black boxes.
We're open-sourcing it because very few companies share foundational signal infrastructure anymore. The core layers are usually hidden, wrapped, or monetized. We believe infrastructure credibility comes from showing your work.
If you're evaluating Skipflo: this is how we build.
The Details (For Engineers)
Real-time voice systems live or die at the signal layer. Before transcription. Before LLMs. Before orchestration. There's a single gating question:
Is this speech or not?
Most implementations fall into two categories:
- A black-box C/C++ implementation (e.g., WebRTC VAD)
- A deep neural model bundled via TensorFlow, ONNX, or TFLite
Both work. Both are widely used. But both introduce tradeoffs: opacity, dependency chains, model downloads, or limited tunability.
When building Skipflo's real-time conversational infrastructure, we wanted deterministic control over detection behavior. So we built SonusVAD.
Architecture
SonusVAD is a browser-native VAD that runs entirely client-side using the Web Audio API.
Mic -> Hann window -> Custom radix-2 FFT -> 10 engineered spectral features
-> 2-layer MLP (10 -> 8 ReLU -> 1 sigmoid) -> Exponential smoothing
-> Adaptive thresholding -> 4-state speech machine
Frame size defaults to 256 samples at 16 kHz (~16 ms resolution). No external models. No WebAssembly. No ML frameworks. No runtime downloads.
The neural layer is compact and fully inspectable:
- Input: 10 normalized features
- Hidden: 8 ReLU units
- Output: 1 sigmoid probability
Weights are readable in source. Inference is deterministic.
Feature Engineering
Each frame extracts interpretable, speech-relevant features:
- Log energy (RMS)
- Zero-crossing rate
- Spectral centroid
- Spectral flatness
- Speech-band ratio (300-3400 Hz)
- Low-band ratio (85-300 Hz)
- High-band ratio (3400-8000 Hz)
- F1 band ratio (300-1000 Hz)
- F2 band ratio (1000-2500 Hz)
- Previous frame probability (temporal persistence)
This hybrid approach combines classical DSP with a shallow classifier layer. The model is small by design. The intelligence comes from structured features, not depth.
Production Behavior
Raw probability alone is not enough. SonusVAD layers in stability logic:
- Exponential moving average smoothing
- 2.5-second ambient calibration at startup
- Adaptive thresholding above measured baseline
- 4-state detection machine (idle, listening, speech, silence_pending)
- Configurable silence hysteresis (default 150 ms)
This prevents:
- Single-frame spikes
- Word-gap clipping
- HVAC noise triggers
- Flickering state transitions
The result is low-latency, stable speech detection suitable for real-time UX and API gating.
Why This Matters
Voice detection is infrastructure. It affects:
- Latency
- Privacy
- Cost (when gating API calls)
- UX responsiveness
- System reliability
By owning this layer, we remove external inference dependencies and maintain full control over detection logic.
For many real-time systems, that tradeoff: transparency over model complexity is intentional.
SonusVAD began as the signal-processing backbone of Skipflo. We're open-sourcing it because the signal layer shouldn't be a black box.
Own the signal.