Web Source by kashifkhan0771 · Pull Request #4848 · trufflesecurity/trufflehog

kashifkhan0771 · 2026-03-30T11:40:44Z

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces a new network-facing crawler source with configurable crawling/robots behavior and several new dependencies, which can affect runtime load, timeouts, and output volume.

Overview
Adds a new web scan mode (trufflehog web) that fetches one or more seed URLs and optionally crawls in-scope links to produce scan chunks from HTTP responses.

Implements the web source using Colly with controls for crawl enablement, max depth, per-domain delay, overall timeout, custom User-Agent, and optional robots.txt bypass, and attaches new Web metadata (URL/title/content-type/depth/timestamp) to emitted chunks.

Updates protobufs (sources.proto, source_metadata.proto) and generated code to include SOURCE_TYPE_WEB plus corresponding config and metadata messages, and adds Prometheus metric web_urls_scanned along with a comprehensive test suite and documentation.

^{Reviewed by Cursor Bugbot for commit 51abd9d. Bugbot is set up for automated code reviews on this repo. Configure here.}

pkg/engine/web.go

pkg/sources/web/web.go

main.go

pkg/sources/web/web.go

MuneebUllahKhan222

Just need address couple of small changes.

pkg/sources/web/web.go

MuneebUllahKhan222 · 2026-04-06T11:30:05Z

pkg/sources/web/web.go

+			ctx.Logger().Error(err, "Visit failed")
+		}
+		collector.Wait() // blocks until all requests finish
+		close(done)


it should be defer close(done) outside the go routine.

Outside go routine?? Any reason why?

It is generally a practice that the owner of the channel should be the one to close it.

pkg/sources/web/web.go

…vability

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.}

cursor · 2026-04-10T06:27:36Z

pkg/sources/web/web.go

+	// Apply depth limit only when crawling is enabled and a positive depth is set.
+	if s.conn.GetCrawl() && s.conn.GetDepth() > 0 {
+		collector.MaxDepth = int(s.conn.GetDepth())
+	}


Default depth makes --crawl flag ineffective

High Severity

The --depth flag defaults to 1, and Colly counts the seed page as depth 1. So with collector.MaxDepth = 1, only the seed page is visited — no discovered links are followed. This means running trufflehog web --url https://example.com --crawl has the exact same behavior as omitting --crawl entirely. Users must always manually specify --depth 2 or higher for crawling to actually work, which is unintuitive and contradicts the purpose of the --crawl flag.

Additional Locations (1)

main.go#L281-L282

^{Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.}

kashifkhan0771 requested a review from a team March 30, 2026 11:40

kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/engine/web.go Show resolved Hide resolved

pkg/sources/web/web.go Show resolved Hide resolved

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

pkg/sources/web/web.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

main.go Show resolved Hide resolved

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

kashifkhan0771 requested review from amanfcp, camgunz and rosecodym March 31, 2026 05:22

cursor bot reviewed Apr 2, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

MuneebUllahKhan222 requested changes Apr 6, 2026

View reviewed changes

MuneebUllahKhan222 reviewed Apr 6, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

cursor bot reviewed Apr 8, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

kashifkhan0771 added 15 commits April 9, 2026 12:31

basic structure for source

2d61587

it works end to end

9e3fbe5

some more enhancements + README.md

c46eca1

A simple working test

0150860

user-agent flag

9448016

made ignore-robots configurable

8f5a5fc

added metric

17a6e2c

detailed test cases

605c14e

fixed some comments

f8d7f4c

updated README.md

a94c20f

Added missed config in engine and rewrite timeout comment

717c2ea

fixed lint issues

3770cf5

fixed allowed domains validation

721505f

fixed comment

5c84a6b

fixed sub-domain filter

aade78b

kashifkhan0771 force-pushed the feature/web-source branch from 77a5cc2 to aade78b Compare April 9, 2026 07:31

fixed comments

725b65e

cursor bot reviewed Apr 9, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

fixed comment

40e8d6b

cursor bot reviewed Apr 9, 2026

View reviewed changes

pkg/sources/web/web.go Show resolved Hide resolved

fixed metric count

28c2a1c

kashifkhan0771 requested a review from MuneebUllahKhan222 April 9, 2026 10:10

kashifkhan0771 and others added 3 commits April 9, 2026 15:38

web: Fix timeouts, validation, concurrency, error handling, and obser…

4d7fa57

…vability

fixed linter

ada4aee

Merge branch 'main' into feature/web-source

51abd9d

cursor bot reviewed Apr 10, 2026

View reviewed changes

Conversation

kashifkhan0771 commented Mar 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kashifkhan0771 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 10, 2026

Choose a reason for hiding this comment

Default depth makes --crawl flag ineffective

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading

Default depth makes `--crawl` flag ineffective