Skip to content

Web Source#4848

Open
kashifkhan0771 wants to merge 21 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source
Open

Web Source#4848
kashifkhan0771 wants to merge 21 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source

Conversation

@kashifkhan0771
Copy link
Copy Markdown
Contributor

@kashifkhan0771 kashifkhan0771 commented Mar 30, 2026

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces a new network-facing crawler source with configurable crawling/robots behavior and several new dependencies, which can affect runtime load, timeouts, and output volume.

Overview
Adds a new web scan mode (trufflehog web) that fetches one or more seed URLs and optionally crawls in-scope links to produce scan chunks from HTTP responses.

Implements the web source using Colly with controls for crawl enablement, max depth, per-domain delay, overall timeout, custom User-Agent, and optional robots.txt bypass, and attaches new Web metadata (URL/title/content-type/depth/timestamp) to emitted chunks.

Updates protobufs (sources.proto, source_metadata.proto) and generated code to include SOURCE_TYPE_WEB plus corresponding config and metadata messages, and adds Prometheus metric web_urls_scanned along with a comprehensive test suite and documentation.

Reviewed by Cursor Bugbot for commit 51abd9d. Bugbot is set up for automated code reviews on this repo. Configure here.

@kashifkhan0771 kashifkhan0771 requested a review from a team March 30, 2026 11:40
@kashifkhan0771 kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40
Copy link
Copy Markdown
Contributor

@MuneebUllahKhan222 MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need address couple of small changes.

ctx.Logger().Error(err, "Visit failed")
}
collector.Wait() // blocks until all requests finish
close(done)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be defer close(done) outside the go routine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside go routine?? Any reason why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally a practice that the owner of the channel should be the one to close it.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.

// Apply depth limit only when crawling is enabled and a positive depth is set.
if s.conn.GetCrawl() && s.conn.GetDepth() > 0 {
collector.MaxDepth = int(s.conn.GetDepth())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default depth makes --crawl flag ineffective

High Severity

The --depth flag defaults to 1, and Colly counts the seed page as depth 1. So with collector.MaxDepth = 1, only the seed page is visited — no discovered links are followed. This means running trufflehog web --url https://example.com --crawl has the exact same behavior as omitting --crawl entirely. Users must always manually specify --depth 2 or higher for crawling to actually work, which is unintuitive and contradicts the purpose of the --crawl flag.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants