Release: v1.0 — Multi-Modal Biometric Authentication System
Overview
A real-time access control system that authenticates users by fusing three independent verification channels — face recognition, speaker verification, and a spoken password — into a single deep learning decision. Built as a Deep Learning capstone project.
The system implements a multi-gate security pipeline with defense in depth: an attacker must simultaneously defeat all channels to gain access. Failure at any gate terminates the attempt early.
Architecture
User → Gate 1: Voice Password (Google STT + fuzzy match)
↓ pass
Gate 2: Biometric Capture
├── Face → MTCNN → FaceNet (VGGFace2) → 512-dim embedding
└── Voice → ECAPA-TDNN (VoxCeleb) → 192-dim embedding
↓
Liveness Check (blink detection via EAR + head pose via solvePnP)
↓ pass
Gate 3: Late Fusion MLP (704-dim → 256 → 128 → 4 classes)
↓
3-Tier Confidence Decision → ACCESS GRANTED / DENIED
Decision logic:
- ≥ 85% confidence → Immediate access
- 50–85% → Gray area: cosine similarity fallback against enrolled profiles (both face ≥ 0.4 and voice ≥ 0.4 must pass)
- < 50% or "unknown" → Access denied
Results
| Metric | Value |
|---|---|
| Test Accuracy | 99.5% (376/378 samples) |
| David | 100% |
| Itzhak | 100% |
| Yossi | 100% |
| Unknown Detection | 98% recall, 100% precision |
Key Technical Highlights
- Transfer learning — Frozen pretrained backbones (InceptionResnetV1 on VGGFace2 with 3.3M faces, ECAPA-TDNN on VoxCeleb with 7000+ speakers) used as feature extractors. No fine-tuning required.
- Leakage-free data splitting — Train/val/test splits happen per-person, per-modality, before creating face-voice pairs. No embedding appears in more than one set.
- Cross-modal spoofing detection — Mismatched face+voice pairs (e.g., David's face + Yossi's voice) are trained as the "unknown" class, teaching the model that face-voice correlation matters.
- BatchNorm in fusion — Essential for handling the ~800x variance mismatch between voice (192-d, std ~16) and face (512-d, std ~0.02) embeddings. Without it, voice dimensions dominate.
- Liveness detection — Blink detection (Eye Aspect Ratio via MediaPipe) and head pose variation (solvePnP) block photo prints and video replay attacks. OR logic: either check passing is sufficient.
- Audio reuse — The password recording is reused for voice embedding extraction. The user speaks only once.
- Class-weighted loss — Inverse-frequency weighting handles unequal sample counts across users.
- Full reproducibility — All RNG seeds fixed (Python, NumPy, PyTorch CPU/CUDA), CuDNN deterministic mode enabled.
Data Pipeline
- Face: ~30 raw photos/person → MTCNN preprocessing (detection, alignment, 160x160 crop) → 15x augmentation (flip, rotation, brightness, blur, cutout, noise, perspective) → ~480 images/person → FaceNet 512-d embeddings
- Voice: ~30 raw clips/person → 16kHz resampling, silence trim, normalization → 3x augmentation (noise, pitch shift, speed perturbation) → ~120 clips/person → ECAPA-TDNN 192-d embeddings
- Enrollment: Mean embedding per person per modality, L2-normalized onto the unit hypersphere
What's Included
- Complete data preparation pipeline (preprocessing, augmentation, embedding extraction, enrollment)
- Late Fusion MLP training with early stopping, LR scheduling, and class-weighted loss
- 7 evaluation visualizations at 300 DPI (t-SNE clusters, training curves, confusion matrix, similarity distributions, per-class performance, architecture diagram, dashboard)
- Live authentication application with real-time camera preview, multi-frame quality scoring, and liveness detection
- Centralized configuration system (15 class-based namespaces in
utils/config.py) - JSONL authentication logging with admin correction support for active learning
Tech Stack
| Component | Technology |
|---|---|
| Face Detection & Embedding | MTCNN + InceptionResnetV1 / FaceNet (facenet-pytorch) |
| Voice Embedding | ECAPA-TDNN (SpeechBrain, pretrained on VoxCeleb) |
| Speech-to-Text | Google Speech Recognition API |
| Liveness Detection | MediaPipe FaceLandmarker (EAR + solvePnP) |
| Fusion Model | Custom MLP (PyTorch) |
| Camera / Audio | OpenCV, sounddevice, scipy |