Conversation
MuneebUllahKhan222
left a comment
There was a problem hiding this comment.
Just need address couple of small changes.
pkg/sources/web/web.go
Outdated
| ctx.Logger().Error(err, "Visit failed") | ||
| } | ||
| collector.Wait() // blocks until all requests finish | ||
| close(done) |
There was a problem hiding this comment.
it should be defer close(done) outside the go routine.
There was a problem hiding this comment.
Outside go routine?? Any reason why?
There was a problem hiding this comment.
It is generally a practice that the owner of the channel should be the one to close it.
77a5cc2 to
aade78b
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.
| // Apply depth limit only when crawling is enabled and a positive depth is set. | ||
| if s.conn.GetCrawl() && s.conn.GetDepth() > 0 { | ||
| collector.MaxDepth = int(s.conn.GetDepth()) | ||
| } |
There was a problem hiding this comment.
Default depth makes --crawl flag ineffective
High Severity
The --depth flag defaults to 1, and Colly counts the seed page as depth 1. So with collector.MaxDepth = 1, only the seed page is visited — no discovered links are followed. This means running trufflehog web --url https://example.com --crawl has the exact same behavior as omitting --crawl entirely. Users must always manually specify --depth 2 or higher for crawling to actually work, which is unintuitive and contradicts the purpose of the --crawl flag.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 51abd9d. Configure here.


Description:
Adds a new
websource that crawls and scans websites for exposed secrets. The source uses theCollyframework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via--crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Introduces a new network-facing crawler source with configurable crawling/robots behavior and several new dependencies, which can affect runtime load, timeouts, and output volume.
Overview
Adds a new
webscan mode (trufflehog web) that fetches one or more seed URLs and optionally crawls in-scope links to produce scan chunks from HTTP responses.Implements the
websource using Colly with controls for crawl enablement, max depth, per-domain delay, overall timeout, custom User-Agent, and optionalrobots.txtbypass, and attaches newWebmetadata (URL/title/content-type/depth/timestamp) to emitted chunks.Updates protobufs (
sources.proto,source_metadata.proto) and generated code to includeSOURCE_TYPE_WEBplus corresponding config and metadata messages, and adds Prometheus metricweb_urls_scannedalong with a comprehensive test suite and documentation.Reviewed by Cursor Bugbot for commit 51abd9d. Bugbot is set up for automated code reviews on this repo. Configure here.