metadata

title: VAD Demo - Real-time Speech Detection
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit

🎤 VAD Demo: Real-time Speech Detection Framework

Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces

This demo showcases a comprehensive speech removal framework designed for privacy-preserving audio recordings, featuring 3 state-of-the-art AI models with real-time processing and interactive visualization.

🎯 Live Demo Features

🤖 Multi-Model Support

Compare 3 different AI models side-by-side:

Model	Parameters	Speed	Accuracy	Best For
Silero-VAD	1.8M	⚡⚡⚡	⭐⭐⭐⭐	General purpose
WebRTC-VAD	<0.1M	⚡⚡⚡⚡	⭐⭐⭐	Ultra-fast processing
E-PANNs	22M	⚡⚡	⭐⭐⭐⭐	Efficient AI (73% parameter reduction)

📊 Real-time Visualization

Dual Analysis: Compare two models simultaneously
Waveform Display: Live audio visualization
Probability Charts: Real-time speech detection confidence
Performance Metrics: Processing time comparison across models

🔒 Privacy-Preserving Applications

Smart Home Audio: Remove personal conversations while preserving environmental sounds
GDPR Compliance: Privacy-aware audio dataset processing
Real-time Processing: Continuous 4-second chunk analysis at 16kHz
CPU Optimized: Runs efficiently on standard hardware

🚀 Quick Start

Option 1: Use Live Demo (Recommended)

Click the Hugging Face Spaces badge above to try the demo instantly!

Option 2: Run Locally

git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py

🎛️ How to Use

🎤 Record Audio: Click microphone and record 2-4 seconds of speech
🔧 Select Models: Choose different models for Model A and Model B comparison
⚙️ Adjust Threshold: Lower = more sensitive detection (0.0-1.0)
🎯 Process: Click "Process Audio" to analyze
📊 View Results: Observe probability charts and detailed analysis

🏗️ Technical Architecture

CPU Optimization Strategies

Lazy Loading: Models load only when needed
Efficient Processing: Optimized audio chunk processing
Memory Management: Smart buffer management for continuous operation
Fallback Systems: Graceful degradation when models unavailable

Audio Processing Pipeline

Audio Input (Microphone) 
    ↓ 
Preprocessing (Normalization, Resampling)
    ↓
Feature Extraction (Spectrograms, MFCCs)
    ↓
Multi-Model Inference (Parallel Processing)
    ↓
Visualization (Interactive Plotly Dashboard)

Model Implementation Details

Silero-VAD (Production Ready)

Source: torch.hub official Silero model
Optimization: Direct PyTorch inference
Memory: ~50MB RAM usage
Latency: ~30ms processing time

WebRTC-VAD (Ultra-Fast)

Source: Google WebRTC project
Fallback: Energy-based VAD when WebRTC unavailable
Latency: <5ms processing time
Memory: ~10MB RAM usage

E-PANNs (Efficient Deep Learning)

Features: Mel-spectrogram + MFCC analysis
Optimization: Simplified neural architecture
Speed: 2-3x faster than full PANNs
Memory: ~150MB RAM usage

📈 Performance Benchmarks

Evaluated on CHiME-Home dataset (adapted for CPU):

Model	F1-Score	RTF (CPU)	Memory	Use Case
Silero-VAD	0.806	0.065	50MB	Lightweight
WebRTC-VAD	0.708	0.003	10MB	Ultra-fast
E-PANNs	0.847	0.180	150MB	Balanced

RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)

🔬 Research Applications

Privacy-Preserving Audio Processing

Domestic Recordings: Remove personal conversations
Smart Speakers: Privacy-aware voice assistants
Audio Datasets: GDPR-compliant data collection
Surveillance Systems: Selective audio monitoring

Speech Technology Research

Model Comparison: Benchmark different VAD approaches
Real-time Systems: Low-latency speech detection
Edge Computing: CPU-efficient processing
Hybrid Systems: Combine multiple detection methods

📊 Technical Specifications

System Requirements

CPU: 2+ cores (4+ recommended)
RAM: 1GB minimum (2GB recommended)
Python: 3.8+ (3.10+ recommended)
Browser: Chrome/Firefox with microphone support

Hugging Face Spaces Optimization

Memory Limit: Designed for 16GB Spaces limit
CPU Cores: Optimized for 8-core allocation
Storage: <500MB model storage requirement
Networking: Minimal external dependencies

Audio Specifications

Input Format: 16-bit PCM, mono/stereo
Sample Rates: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
Chunk Size: 4-second processing windows
Latency: <200ms processing delay

📚 Research Citation

If you use this demo in your research, please cite:

@inproceedings{bibbo2025speech,
    title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
    author={[Authors omitted for review]},
    booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
    year={2025},
    organization={IEEE}
}

🤝 Contributing

We welcome contributions! Areas for improvement:

New Models: Add state-of-the-art VAD models
Optimization: Further CPU/memory optimizations
Features: Additional visualization and analysis tools
Documentation: Improve tutorials and examples

📞 Support

Issues: GitHub Issues
Discussions: Hugging Face Discussions
WASPAA 2025: Visit our paper presentation

📄 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Silero-VAD: Silero Team
WebRTC-VAD: Google WebRTC Project
E-PANNs: Efficient PANNs Implementation
Hugging Face: Free Spaces hosting
Funding: AI4S, University of Surrey, EPSRC, CVSSP

🎯 Ready for WASPAA 2025 Demo | ⚡ CPU Optimized | 🆓 Free to Use | 🤗 Hugging Face Spaces