Skip to content

git push origin v1.0

Latest

Choose a tag to compare

@David-Khutsishvili David-Khutsishvili released this 22 Mar 03:56
· 2 commits to main since this release
bada9c3

Release: v1.0 — Multi-Modal Biometric Authentication System

Overview

A real-time access control system that authenticates users by fusing three independent verification channels — face recognition, speaker verification, and a spoken password — into a single deep learning decision. Built as a Deep Learning capstone project.

The system implements a multi-gate security pipeline with defense in depth: an attacker must simultaneously defeat all channels to gain access. Failure at any gate terminates the attempt early.

Architecture

User → Gate 1: Voice Password (Google STT + fuzzy match)
         ↓ pass
       Gate 2: Biometric Capture
         ├── Face  → MTCNN → FaceNet (VGGFace2) → 512-dim embedding
         └── Voice → ECAPA-TDNN (VoxCeleb)      → 192-dim embedding
         ↓
       Liveness Check (blink detection via EAR + head pose via solvePnP)
         ↓ pass
       Gate 3: Late Fusion MLP (704-dim → 256 → 128 → 4 classes)
         ↓
       3-Tier Confidence Decision → ACCESS GRANTED / DENIED

Decision logic:

  • ≥ 85% confidence → Immediate access
  • 50–85% → Gray area: cosine similarity fallback against enrolled profiles (both face ≥ 0.4 and voice ≥ 0.4 must pass)
  • < 50% or "unknown" → Access denied

Results

Metric Value
Test Accuracy 99.5% (376/378 samples)
David 100%
Itzhak 100%
Yossi 100%
Unknown Detection 98% recall, 100% precision

Key Technical Highlights

  • Transfer learning — Frozen pretrained backbones (InceptionResnetV1 on VGGFace2 with 3.3M faces, ECAPA-TDNN on VoxCeleb with 7000+ speakers) used as feature extractors. No fine-tuning required.
  • Leakage-free data splitting — Train/val/test splits happen per-person, per-modality, before creating face-voice pairs. No embedding appears in more than one set.
  • Cross-modal spoofing detection — Mismatched face+voice pairs (e.g., David's face + Yossi's voice) are trained as the "unknown" class, teaching the model that face-voice correlation matters.
  • BatchNorm in fusion — Essential for handling the ~800x variance mismatch between voice (192-d, std ~16) and face (512-d, std ~0.02) embeddings. Without it, voice dimensions dominate.
  • Liveness detection — Blink detection (Eye Aspect Ratio via MediaPipe) and head pose variation (solvePnP) block photo prints and video replay attacks. OR logic: either check passing is sufficient.
  • Audio reuse — The password recording is reused for voice embedding extraction. The user speaks only once.
  • Class-weighted loss — Inverse-frequency weighting handles unequal sample counts across users.
  • Full reproducibility — All RNG seeds fixed (Python, NumPy, PyTorch CPU/CUDA), CuDNN deterministic mode enabled.

Data Pipeline

  • Face: ~30 raw photos/person → MTCNN preprocessing (detection, alignment, 160x160 crop) → 15x augmentation (flip, rotation, brightness, blur, cutout, noise, perspective) → ~480 images/person → FaceNet 512-d embeddings
  • Voice: ~30 raw clips/person → 16kHz resampling, silence trim, normalization → 3x augmentation (noise, pitch shift, speed perturbation) → ~120 clips/person → ECAPA-TDNN 192-d embeddings
  • Enrollment: Mean embedding per person per modality, L2-normalized onto the unit hypersphere

What's Included

  • Complete data preparation pipeline (preprocessing, augmentation, embedding extraction, enrollment)
  • Late Fusion MLP training with early stopping, LR scheduling, and class-weighted loss
  • 7 evaluation visualizations at 300 DPI (t-SNE clusters, training curves, confusion matrix, similarity distributions, per-class performance, architecture diagram, dashboard)
  • Live authentication application with real-time camera preview, multi-frame quality scoring, and liveness detection
  • Centralized configuration system (15 class-based namespaces in utils/config.py)
  • JSONL authentication logging with admin correction support for active learning

Tech Stack

Component Technology
Face Detection & Embedding MTCNN + InceptionResnetV1 / FaceNet (facenet-pytorch)
Voice Embedding ECAPA-TDNN (SpeechBrain, pretrained on VoxCeleb)
Speech-to-Text Google Speech Recognition API
Liveness Detection MediaPipe FaceLandmarker (EAR + solvePnP)
Fusion Model Custom MLP (PyTorch)
Camera / Audio OpenCV, sounddevice, scipy