Reference-based phishing detection without a predefined reference list — powered by Vision-Language Models.
An extension of our USENIX Security 2024 work "Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List."
• Read our Paper • • Visit our Website • • Download our Datasets • • Cite our Paper •
- Introduction
- Framework
- Repository Structure
- Setup
- Prepare a Dataset
- Run PhishVLM
- Understand the Output
- Configuration
- Troubleshooting
- Citation
Existing reference-based phishing detection:
- ❌ Relies on a predefined reference list, which lacks comprehensiveness and incurs a high maintenance cost.
- ❌ Does not fully exploit the textual semantics present on the webpage.
PhishVLM builds a reference-based phishing detector that is:
- ✅ Free of a predefined reference list — modern VLMs have encoded far more extensive brand–domain knowledge than any hand-curated list.
- ✅ Chain-of-thought credential-taking prediction — the credential-taking status is reasoned step-by-step directly from the screenshot.
Input: a URL and its screenshot | Output: Phish / Benign and the phishing target brand.
-
Step 1 — Brand recognition. Input: cropped logo screenshot. Output: the VLM's predicted brand domain.
-
Step 2 — Credential-Requiring-Page (CRP) classification. Input: the webpage screenshot. The VLM chooses A. Credential-taking page or B. Non-credential-taking page. If
A, go to Step 4; ifB, go to Step 3. -
Step 3 — CRP transition (only when Step 2 returns
B). Input: screenshots of all clickable UI elements. The most likely login UI is clicked, and the pipeline returns to Step 1 with the updated webpage and URL (bounded byrank.depth_limit). -
Step 4 — Decision. A page is flagged as phishing when all of the following hold:
- the predicted brand's domain is inconsistent with the webpage's own domain; and
- brand validation passes — by default the on-page logo is matched against Google Image search results for the predicted brand (
brand_valid.activate: True); if validation is disabled, the predicted brand domain is instead required to be alive; and - the page is classified as a credential-taking page (Step 2 returns
A).
If the predicted brand is itself a web-hosting / cloud provider (see
datasets/hosting_blacklists.txt), the page is treated as benign. Otherwise the page is reported as benign.
PhishVLM/
├── param_dict.yaml # Pipeline hyper-parameters
├── requirements.txt
├── prompts/ # VLM prompts (system + few-shot examples)
│ ├── brand_recog_prompt.json
│ ├── crp_pred_prompt.json
│ └── crp_trans_prompt.json
├── datasets/
│ ├── hosting_blacklists.txt # Web-hosting / cloud-provider domains
│ ├── test_sites/ # Bundled demo site (www.baidu.com)
│ ├── openai_key.txt # (you create) OpenAI API key
│ └── google_api_key.txt # (you create) Google Search key + engine id
├── figures/
└── scripts/
├── infer/
│ └── run.py # Entry point (inference loop)
├── pipeline/
│ └── phishvlm.py # PhishVLM class — the 4-step pipeline
├── utils/ # Web interaction, drawing, logging helpers
│ ├── web_utils.py
│ ├── draw_utils.py
│ ├── logger_utils.py
│ └── PhishIntentionWrapper.py
└── phishintention/ # Logo detector / siamese / OCR backbones
├── model_config.py # load_config(): builds the vision models
├── configs/ # *.yaml model configs
├── modules/ # detector + logo matching
├── ocr_lib/ # OCR-aided siamese encoder
├── utils/
└── setup.sh # Downloads pretrained weights
Tested on Ubuntu with an NVIDIA GPU and CUDA 11. A CPU-only run is possible but slow; the vision backbones fall back to CPU automatically.
A new conda environment named phishvlm is created in this step.
conda create -n phishvlm python=3.10 -y
conda activate phishvlm
# Python dependencies
pip install -r requirements.txt
# PyTorch (must match your CUDA version — example below is CUDA 11.3)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 \
--extra-index-url https://download.pytorch.org/whl/cu113
# detectron2 (used by the logo / layout detector)
pip install --no-build-isolation git+https://github.com/facebookresearch/detectron2.git
# Download the pretrained vision-model weights into scripts/phishintention/models/
cd scripts/phishintention
chmod +x setup.sh
./setup.sh
cd ../..setup.sh downloads the layout/logo detector, the OCR-aided siamese encoder and
the supporting reference files via gdown and places them under
scripts/phishintention/models/. Re-running it is safe — existing files are skipped.
PhishVLM drives a headless Chrome through Selenium. Install Chrome and a matching
driver (the driver is fetched automatically at runtime by webdriver-manager):
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install -y ./google-chrome-stable_current_amd64.deb-
🔑 OpenAI API key — tutorial. Paste the key into
./datasets/openai_key.txt:echo "sk-your-openai-key" > ./datasets/openai_key.txt
-
🔑 Google Programmable Search API key — tutorial. Put the API key on the first line and the Search Engine ID on the second line of
./datasets/google_api_key.txt:[API_KEY] [SEARCH_ENGINE_ID]
Both key files live under
datasets/and are git-ignored, so your secrets are never committed.
Organize the sites you want to test as one subfolder per website:
testing_dir/
├── aaa.com/
│ ├── shot.png # webpage screenshot
│ ├── info.txt # webpage URL
│ └── html.txt # webpage HTML source
├── bbb.com/
│ ├── shot.png
│ ├── info.txt
│ └── html.txt
└── ...
A ready-to-run example is provided in datasets/test_sites/.
Run from the project root as a module (this guarantees the scripts package is
importable):
conda activate phishvlm
python -m scripts.infer.run --folder ./datasets/test_sitesOptional arguments:
| Argument | Default | Description |
|---|---|---|
--folder |
./datasets/test_sites |
Folder of websites to test. |
--config |
./param_dict.yaml |
Pipeline hyper-parameter file. |
-
The console prints a live log, e.g.:
Expand to see a sample log
[PhishLLMLogger][DEBUG] Folder ./datasets/field_study/2023-09-01/device-...remotewd.com [PhishLLMLogger][DEBUG] Time taken for LLM brand prediction: 0.97 Detected brand: sonicwall.com [PhishLLMLogger][DEBUG] Domain sonicwall.com is valid and alive [PhishLLMLogger][DEBUG] Time taken for LLM CRP classification: 2.92 CRP prediction: A. This is a credential-requiring page. [❗️] Phishing discovered, phishing target is sonicwall.com -
A results file named
[today's date]_phishllm.txtis written to the working directory (tab-separated). When a site is flagged as phishing, an annotatedpredict.pngis saved inside that site's folder. Columns:Column Meaning folderWebsite subfolder name. phish_predictionphishorbenign.target_predictionPredicted target brand domain (e.g. paypal.com).brand_recog_timeTime spent on brand recognition + validation (s). crp_prediction_timeTime spent on CRP prediction (s). crp_transition_timeTime spent on CRP transition / ranking (s).
All pipeline knobs live in param_dict.yaml, including:
VLM_model— the OpenAI vision model used (defaultgpt-4o-mini-2024-07-18).brand_recog,crp_pred,rank— temperature, token limits and sleep/timeouts per step.brand_valid— whether to validate the predicted brand via logo matching, and the top-k/ similarity threshold to use.rank.depth_limit— maximum number of CRP transitions (clicks) before giving up.
Model weights and detection thresholds are configured in
scripts/phishintention/configs/configs.yaml.
ModuleNotFoundError: No module named 'scripts'— run the pipeline from the project root with the module form:python -m scripts.infer.run(notpython scripts/infer/run.py).FileNotFoundError: openai_key.txt/google_api_key.txt— create the key files underdatasets/as described in Step 3.- Chrome / driver errors — make sure Google Chrome is installed;
webdriver-managerdownloads the matching ChromeDriver on first run (requires network access). - CUDA / detectron2 build errors — verify that your installed PyTorch CUDA build matches the CUDA toolkit on your machine before installing detectron2.
@inproceedings{299838,
author = {Ruofan Liu and Yun Lin and Xiwen Teoh and Gongshen Liu and Zhiyong Huang and Jin Song Dong},
title = {Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {523--540},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/liu-ruofan},
publisher = {USENIX Association},
month = aug
}If you have any issues running our code, please open a GitHub issue or email us: liu.ruofan16@u.nus.edu, lin_yun@sjtu.edu.cn, dcsdjs@nus.edu.sg.
