A newer version of the Gradio SDK is available:
5.45.0
title: VAD Demo - Real-time Speech Detection
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.42.0
app_file: app.py
pinned: false
license: mit
🎤 VAD Demo: Real-time Speech Detection Framework
Real-time multi-model voice activity detection with interactive visualization - optimized for CPU and free Hugging Face Spaces
This demo showcases a comprehensive speech removal framework designed for privacy-preserving audio recordings, featuring 3 state-of-the-art AI models with real-time processing and interactive visualization.
🎯 Live Demo Features
🤖 Multi-Model Support
Compare 3 different AI models side-by-side:
Model | Parameters | Speed | Accuracy | Best For |
---|---|---|---|---|
Silero-VAD | 1.8M | ⚡⚡⚡ | ⭐⭐⭐⭐ | General purpose |
WebRTC-VAD | <0.1M | ⚡⚡⚡⚡ | ⭐⭐⭐ | Ultra-fast processing |
E-PANNs | 22M | ⚡⚡ | ⭐⭐⭐⭐ | Efficient AI (73% parameter reduction) |
📊 Real-time Visualization
- Dual Analysis: Compare two models simultaneously
- Waveform Display: Live audio visualization
- Probability Charts: Real-time speech detection confidence
- Performance Metrics: Processing time comparison across models
🔒 Privacy-Preserving Applications
- Smart Home Audio: Remove personal conversations while preserving environmental sounds
- GDPR Compliance: Privacy-aware audio dataset processing
- Real-time Processing: Continuous 4-second chunk analysis at 16kHz
- CPU Optimized: Runs efficiently on standard hardware
🚀 Quick Start
Option 1: Use Live Demo (Recommended)
Click the Hugging Face Spaces badge above to try the demo instantly!
Option 2: Run Locally
git clone https://huggingface.co/spaces/gbibbo/vad_demo
cd vad_demo
pip install -r requirements.txt
python app.py
🎛️ How to Use
- 🎤 Record Audio: Click microphone and record 2-4 seconds of speech
- 🔧 Select Models: Choose different models for Model A and Model B comparison
- ⚙️ Adjust Threshold: Lower = more sensitive detection (0.0-1.0)
- 🎯 Process: Click "Process Audio" to analyze
- 📊 View Results: Observe probability charts and detailed analysis
🏗️ Technical Architecture
CPU Optimization Strategies
- Lazy Loading: Models load only when needed
- Efficient Processing: Optimized audio chunk processing
- Memory Management: Smart buffer management for continuous operation
- Fallback Systems: Graceful degradation when models unavailable
Audio Processing Pipeline
Audio Input (Microphone)
↓
Preprocessing (Normalization, Resampling)
↓
Feature Extraction (Spectrograms, MFCCs)
↓
Multi-Model Inference (Parallel Processing)
↓
Visualization (Interactive Plotly Dashboard)
Model Implementation Details
Silero-VAD (Production Ready)
- Source:
torch.hub
official Silero model - Optimization: Direct PyTorch inference
- Memory: ~50MB RAM usage
- Latency: ~30ms processing time
WebRTC-VAD (Ultra-Fast)
- Source: Google WebRTC project
- Fallback: Energy-based VAD when WebRTC unavailable
- Latency: <5ms processing time
- Memory: ~10MB RAM usage
E-PANNs (Efficient Deep Learning)
- Features: Mel-spectrogram + MFCC analysis
- Optimization: Simplified neural architecture
- Speed: 2-3x faster than full PANNs
- Memory: ~150MB RAM usage
📈 Performance Benchmarks
Evaluated on CHiME-Home dataset (adapted for CPU):
Model | F1-Score | RTF (CPU) | Memory | Use Case |
---|---|---|---|---|
Silero-VAD | 0.806 | 0.065 | 50MB | Lightweight |
WebRTC-VAD | 0.708 | 0.003 | 10MB | Ultra-fast |
E-PANNs | 0.847 | 0.180 | 150MB | Balanced |
RTF: Real-Time Factor (lower is better, <1.0 = real-time capable)
🔬 Research Applications
Privacy-Preserving Audio Processing
- Domestic Recordings: Remove personal conversations
- Smart Speakers: Privacy-aware voice assistants
- Audio Datasets: GDPR-compliant data collection
- Surveillance Systems: Selective audio monitoring
Speech Technology Research
- Model Comparison: Benchmark different VAD approaches
- Real-time Systems: Low-latency speech detection
- Edge Computing: CPU-efficient processing
- Hybrid Systems: Combine multiple detection methods
📊 Technical Specifications
System Requirements
- CPU: 2+ cores (4+ recommended)
- RAM: 1GB minimum (2GB recommended)
- Python: 3.8+ (3.10+ recommended)
- Browser: Chrome/Firefox with microphone support
Hugging Face Spaces Optimization
- Memory Limit: Designed for 16GB Spaces limit
- CPU Cores: Optimized for 8-core allocation
- Storage: <500MB model storage requirement
- Networking: Minimal external dependencies
Audio Specifications
- Input Format: 16-bit PCM, mono/stereo
- Sample Rates: 8kHz, 16kHz, 32kHz, 48kHz (auto-conversion)
- Chunk Size: 4-second processing windows
- Latency: <200ms processing delay
📚 Research Citation
If you use this demo in your research, please cite:
@inproceedings{bibbo2025speech,
title={Speech Removal Framework for Privacy-Preserving Audio Recordings},
author={[Authors omitted for review]},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
year={2025},
organization={IEEE}
}
🤝 Contributing
We welcome contributions! Areas for improvement:
- New Models: Add state-of-the-art VAD models
- Optimization: Further CPU/memory optimizations
- Features: Additional visualization and analysis tools
- Documentation: Improve tutorials and examples
📞 Support
- Issues: GitHub Issues
- Discussions: Hugging Face Discussions
- WASPAA 2025: Visit our paper presentation
📄 License
This project is licensed under the MIT License.
🙏 Acknowledgments
- Silero-VAD: Silero Team
- WebRTC-VAD: Google WebRTC Project
- E-PANNs: Efficient PANNs Implementation
- Hugging Face: Free Spaces hosting
- Funding: AI4S, University of Surrey, EPSRC, CVSSP
🎯 Ready for WASPAA 2025 Demo | ⚡ CPU Optimized | 🆓 Free to Use | 🤗 Hugging Face Spaces