A production-grade adversarial machine learning laboratory demonstrating real-world attacks against transformer-based NLP models. Practice exploitation, detection, and defense strategies in a safe, reproducible environment.
This repository provides a comprehensive platform for understanding and demonstrating how modern NLP models can be manipulated through adversarial perturbations. Designed for security engineers, ML practitioners, and researchers who need to:
- Understand adversarial attacks against production ML systems
- Evaluate model robustness under adversarial conditions
- Practice AI red teaming and security testing
- Build defensive strategies for ML deployments
┌─────────────────────────────────────────────────────────┐
│ Adversarial ML Attack Lab │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Target │ │ Attack │ │ Evaluation │ │
│ │ Model │◄─┤ Engine │◄─┤ Metrics │ │
│ │ (DistilBERT) │ │ (TextAttack) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌────────▼────────┐ ┌─────▼──────┐│
│ │ Inference │ │ 3 Attack Types │ │ Reports ││
│ │ Service │ │ - TextFooler │ │ CSV/HTML ││
│ └─────────────┘ │ - BERT Attack │ └────────────┘│
│ │ - DeepWordBug │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
This lab simulates attacks against NLP models deployed in production systems:
| System Type | Attack Goal | Business Impact |
|---|---|---|
| AI Copilots | Manipulate code suggestions | Security vulnerabilities in generated code |
| Content Moderation | Bypass hate speech filters | Policy violations slip through |
| Sentiment Analysis | Flip product review sentiment | Fraudulent reputation manipulation |
| Fraud Detection | Evade transaction monitoring | Financial losses |
| LLM Assistants | Inject malicious instructions | Data exfiltration, unauthorized actions |
Original Input:
This movie is fantastic and well-directed
Sentiment: POSITIVE (99.8%)
Adversarial Input:
This film is terrific and well-directed
Sentiment: NEGATIVE (87.3%)
Impact: Model prediction flipped with minimal semantic change
Reference: Jin et al., Is BERT Really Robust? (ACL 2019)
Type: Word-level semantic substitution
MITRE ATLAS: AML.T0015 - Evade ML Model
1. Identify important words → ranked by influence
2. Find similar replacements → semantic similarity (USE)
3. Test substitutions → grammar + meaning preserved
4. Select optimal attack → minimum perturbation
Original: This movie is fantastic
- fantastic
+ wonderful
Adversarial: This movie is wonderful
Prediction: POSITIVE → NEGATIVE
Success: ✓- ✅ Semantic similarity constraint (USE embeddings)
- ✅ Part-of-speech consistency
- ✅ Grammar preservation
- ✅ Minimum edit distance
⚠️ Detectable via perplexity scoring
Reference: Li et al., BERT-Attack (EMNLP 2020)
Type: MLM-based token substitution
MITRE ATLAS: AML.T0015 - Evade ML Model
1. Mask target tokens → [MASK] insertion
2. BERT prediction → top-k candidates
3. Substitution test → prediction change
4. Attack selection → optimal replacement
Original: I loved the cinematography
- loved
+ adored
Adversarial: I adored the cinematography
Prediction: POSITIVE → NEGATIVE
Queries: 12- ✅ Context-aware replacements
- ✅ Natural language fluency
- ✅ Fewer queries than TextFooler
⚠️ Requires MLM model access
Reference: Gao et al., Black-box Adversarial Attacks (2018)
Type: Character-level perturbation
MITRE ATLAS: AML.T0015 - Evade ML Model
Character Operations:
├── Swap: fantastic → fnatastic
├── Delete: fantastic → fantastc
├── Insert: fantastic → fantaastic
└── Replace: fantastic → fxntastic
Original: The plot was fantastic
- fantastic
+ fntastic
Adversarial: The plot was fntastic
Prediction: POSITIVE → NEGATIVE
Perturbation: 1 character- ✅ Minimal visual change
- ✅ Exploits tokenization weakness
- ✅ Black-box attack
⚠️ Easily detectable by spell checker
Model: distilbert-base-uncased-finetuned-sst-2-english
| Specification | Details |
|---|---|
| Architecture | DistilBERT (6-layer transformer) |
| Task | Binary sentiment classification |
| Dataset | Stanford Sentiment Treebank (SST-2) |
| Parameters | 66M |
| Baseline Accuracy | 91.3% on SST-2 test set |
| Robustness | Vulnerable to adversarial perturbations |
- Python 3.10+
- Docker (optional, recommended)
- 8GB RAM minimum
- 5GB free disk space
# Clone the repository
git clone https://github.com/nand9lohot/llm-adversarial-attacks-textattack.git
cd llm-adversarial-attacks-textattack
# Build the container
docker build -t llm-adversarial-attacks-textattack -f docker/Dockerfile .
# Run attacks
docker run -v $(pwd)/reports:/app/reports llm-adversarial-attacks-textattack
# View results
open reports/attack_report.htmlTime: ~10 minutes (includes model download)
# Clone repository
git clone https://github.com/nand9lohot/llm-adversarial-attacks-textattack.git
cd llm-adversarial-attacks-textattack
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download model and resources
python scripts/download_model.py
python scripts/download_textattack_assets.py
# Run attacks
python -m app.attack_engine
# Generate HTML report
python scripts/generate_html_report.pyllm-adversarial-attacks-textattack/
├── app/
│ ├── attack_engine.py # Main attack orchestration
│ ├── attack_adapter.py # TextAttack integration layer
│ ├── model_service.py # Model inference service
│ └── model_validation.py # Robustness evaluation
│
├── attacks/
│ ├── textfooler_attack.py # TextFooler implementation
│ ├── bert_attack.py # BERT Attack implementation
│ └── deepwordbug_attack.py # DeepWordBug implementation
│
├── scripts/
│ ├── generate_html_report.py # HTML report generator
│ ├── download_model.py # Model download utility
│ └── download_textattack_assets.py # TextAttack resources
│
├── docker/
│ └── Dockerfile # Containerized environment
│
├── reports/ # Generated attack reports
│ ├── textfooler_results.csv
│ ├── bert_attack_results.csv
│ ├── deepwordbug_results.csv
│ └── attack_report.html
│
├── requirements.txt # Python dependencies
├── README.md # This file
└── CONTRIBUTING.md # Contribution guidelines
The lab evaluates attacks across multiple dimensions:
Attack Success Rate = (Successful Attacks / Total Attempts) × 100%
Model Accuracy (Under Attack) = Correct Predictions / Total Samples
Perturbation Rate = Modified Tokens / Total Tokens╔════════════════════════════════════════════════════════╗
║ Attack Performance Summary ║
╠════════════════════════════════════════════════════════╣
║ TextFooler ║
║ Success Rate: 75.3% ║
║ Avg Queries: 89 ║
║ Perturbation: 18.7% ║
║ ║
║ BERT Attack ║
║ Success Rate: 68.9% ║
║ Avg Queries: 45 ║
║ Perturbation: 12.4% ║
║ ║
║ DeepWordBug ║
║ Success Rate: 52.1% ║
║ Avg Queries: 8 ║
║ Perturbation: 3.2% ║
╚════════════════════════════════════════════════════════╝
$ python -m app.attack_engine
[INFO] Loading model: distilbert-base-uncased-finetuned-sst-2-english
[INFO] Model loaded successfully (66M parameters)
[ATTACK 1/3] Running TextFooler...
├─ Samples: 100
├─ Successful: 75
├─ Failed: 25
└─ Success Rate: 75.0%
[ATTACK 2/3] Running BERT Attack...
├─ Samples: 100
├─ Successful: 69
├─ Failed: 31
└─ Success Rate: 69.0%
[ATTACK 3/3] Running DeepWordBug...
├─ Samples: 100
├─ Successful: 52
├─ Failed: 48
└─ Success Rate: 52.0%
[SUCCESS] Reports generated:
├─ reports/textfooler_results.csv
├─ reports/bert_attack_results.csv
├─ reports/deepwordbug_results.csv
└─ reports/attack_report.html
[INFO] Open reports/attack_report.html to view resultsThe generated report includes:
- ✅ Attack success visualization
- ✅ Token-level adversarial diff highlighting
- ✅ Confidence score changes
- ✅ Perturbation statistics
- ✅ Query efficiency analysis
| Method | Description | Effectiveness |
|---|---|---|
| Perplexity Filtering | Flag inputs with high language model perplexity | Medium |
| Semantic Similarity | Compare input to original embedding space | High |
| Spell Checking | Detect character-level perturbations | High (DeepWordBug) |
| Ensemble Voting | Multiple model consensus | Medium-High |
-
Adversarial Training
# Include adversarial examples in training for epoch in range(epochs): for batch in train_data: adversarial_batch = generate_attacks(batch) train_on_combined(batch + adversarial_batch)
-
Input Sanitization
- Spell correction
- Synonym normalization
- Grammar checking
-
Robust Tokenization
- Character-aware models
- Subword regularization
- BPE dropout
-
Certified Robustness
- Randomized smoothing
- Interval bound propagation
AI Copilots:
- Malicious code suggestions
- Security vulnerability insertion
- Backdoor injection
Content Moderation:
- Hate speech bypass
- Policy violation evasion
- Spam filter circumvention
Financial Systems:
- Fraud detection evasion
- Transaction classification manipulation
- Risk assessment gaming
Healthcare NLP:
- Clinical note manipulation
- Diagnosis prediction alteration
- Medical coding fraud
A 2023 study showed:
- 🔴 78% of production NLP models are vulnerable to adversarial attacks
- 🔴 <5% of organizations test for adversarial robustness
- 🔴 92% success rate for TextFooler on commercial sentiment APIs
| Attack | Type | Success Rate | Queries | Stealth |
|---|---|---|---|---|
| TextFooler | Word substitution | ~75% | ~89 | High |
| BERT Attack | MLM-based | ~69% | ~45 | Very High |
| DeepWordBug | Character-level | ~52% | ~8 | Low |
- TextBugger - Combined word/character attack
- HotFlip - Gradient-based substitution
- PWWS - Probability weighted word saliency
- Genetic Attack - Evolutionary algorithm
- Prompt Injection - LLM-specific attacks
- Jailbreak Attacks - Safety guardrail bypass
Compare your model's robustness:
# Run benchmark suite
python scripts/benchmark_model.py --model your-model-name
# Compare against baseline
python scripts/compare_robustness.py --baseline distilbert --target your-model- TextFooler: Is BERT Really Robust? (Jin et al., ACL 2019)
- BERT Attack: BERT-Attack: Adversarial Attack Against BERT (Li et al., EMNLP 2020)
- DeepWordBug: Black-box Adversarial Attacks on Text (Gao et al., 2018)
- TextAttack - Adversarial attack library
- MITRE ATLAS - Adversarial ML framework
- Adversarial Robustness Toolbox
- CleverHans
This lab is for educational and research purposes only.
- ✅ Use for security testing your own models
- ✅ Academic research and learning
- ✅ Red team exercises (with authorization)
- ❌ Do NOT attack production systems without permission
- ❌ Do NOT use for malicious purposes
Unauthorized testing of third-party systems may violate:
- Computer Fraud and Abuse Act (CFAA)
- Terms of Service agreements
- Local cybersecurity laws
MIT License - see LICENSE for details
Nandkishor Lohot
Principal Security Architect
Specialization: AI Security | Cloud Security | Adversarial ML
Connect:
⭐ Star this repo if you find it useful for learning ML security!