Release: v1.0 — Multi-Modal Biometric Authentication System

Overview

A real-time access control system that authenticates users by fusing three independent verification channels — face recognition, speaker verification, and a spoken password — into a single deep learning decision. Built as a Deep Learning capstone project.

The system implements a multi-gate security pipeline with defense in depth: an attacker must simultaneously defeat all channels to gain access. Failure at any gate terminates the attempt early.

Architecture

User → Gate 1: Voice Password (Google STT + fuzzy match)
         ↓ pass
       Gate 2: Biometric Capture
         ├── Face  → MTCNN → FaceNet (VGGFace2) → 512-dim embedding
         └── Voice → ECAPA-TDNN (VoxCeleb)      → 192-dim embedding
         ↓
       Liveness Check (blink detection via EAR + head pose via solvePnP)
         ↓ pass
       Gate 3: Late Fusion MLP (704-dim → 256 → 128 → 4 classes)
         ↓
       3-Tier Confidence Decision → ACCESS GRANTED / DENIED

Decision logic:

≥ 85% confidence → Immediate access
50–85% → Gray area: cosine similarity fallback against enrolled profiles (both face ≥ 0.4 and voice ≥ 0.4 must pass)
< 50% or "unknown" → Access denied

Results

Metric	Value
Test Accuracy	99.5% (376/378 samples)
David	100%
Itzhak	100%
Yossi	100%
Unknown Detection	98% recall, 100% precision

Key Technical Highlights

Transfer learning — Frozen pretrained backbones (InceptionResnetV1 on VGGFace2 with 3.3M faces, ECAPA-TDNN on VoxCeleb with 7000+ speakers) used as feature extractors. No fine-tuning required.
Leakage-free data splitting — Train/val/test splits happen per-person, per-modality, before creating face-voice pairs. No embedding appears in more than one set.
Cross-modal spoofing detection — Mismatched face+voice pairs (e.g., David's face + Yossi's voice) are trained as the "unknown" class, teaching the model that face-voice correlation matters.
BatchNorm in fusion — Essential for handling the ~800x variance mismatch between voice (192-d, std ~16) and face (512-d, std ~0.02) embeddings. Without it, voice dimensions dominate.
Liveness detection — Blink detection (Eye Aspect Ratio via MediaPipe) and head pose variation (solvePnP) block photo prints and video replay attacks. OR logic: either check passing is sufficient.
Audio reuse — The password recording is reused for voice embedding extraction. The user speaks only once.
Class-weighted loss — Inverse-frequency weighting handles unequal sample counts across users.
Full reproducibility — All RNG seeds fixed (Python, NumPy, PyTorch CPU/CUDA), CuDNN deterministic mode enabled.

Data Pipeline

Face: ~30 raw photos/person → MTCNN preprocessing (detection, alignment, 160x160 crop) → 15x augmentation (flip, rotation, brightness, blur, cutout, noise, perspective) → ~480 images/person → FaceNet 512-d embeddings
Voice: ~30 raw clips/person → 16kHz resampling, silence trim, normalization → 3x augmentation (noise, pitch shift, speed perturbation) → ~120 clips/person → ECAPA-TDNN 192-d embeddings
Enrollment: Mean embedding per person per modality, L2-normalized onto the unit hypersphere

What's Included

Complete data preparation pipeline (preprocessing, augmentation, embedding extraction, enrollment)
Late Fusion MLP training with early stopping, LR scheduling, and class-weighted loss
7 evaluation visualizations at 300 DPI (t-SNE clusters, training curves, confusion matrix, similarity distributions, per-class performance, architecture diagram, dashboard)
Live authentication application with real-time camera preview, multi-frame quality scoring, and liveness detection
Centralized configuration system (15 class-based namespaces in utils/config.py)
JSONL authentication logging with admin correction support for active learning

Tech Stack

Component	Technology
Face Detection & Embedding	MTCNN + InceptionResnetV1 / FaceNet (facenet-pytorch)
Voice Embedding	ECAPA-TDNN (SpeechBrain, pretrained on VoxCeleb)
Speech-to-Text	Google Speech Recognition API
Liveness Detection	MediaPipe FaceLandmarker (EAR + solvePnP)
Fusion Model	Custom MLP (PyTorch)
Camera / Audio	OpenCV, sounddevice, scipy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

git push origin v1.0

Choose a tag to compare

Sorry, something went wrong.