Conversation
📝 WalkthroughWalkthroughThis PR introduces CVPilot Job Scraper, a production-ready FastAPI microservice with a plugin-based source registry, a 5-stage relevance-filtering pipeline, async job fetching with resilience, and structured JSON logging. ChangesComplete Job Scraper Microservice
Sequence DiagramsequenceDiagram
participant Client as Client
participant API as FastAPI<br/>Ingest Route
participant Loader as Config<br/>Loader
participant Registry as Source<br/>Registry
participant Greenhouse as Greenhouse<br/>Source
participant Pipeline as Filtering<br/>Pipeline
participant Response as Response
Client->>API: POST /internal/ingest<br/>(sources?, companies?, limit_per_company?)
API->>Loader: load_companies()
Loader-->>API: {greenhouse: [stripe, ...]}
API->>Registry: get("greenhouse")
Registry-->>API: GreenhouseSource()
API->>Greenhouse: fetch_jobs(company=stripe, ...)
Greenhouse->>Greenhouse: list + cache<br/>(get cached job list)
Greenhouse->>Greenhouse: score_rank<br/>(filter/sort candidates)
Greenhouse->>Greenhouse: fetch details<br/>(parallel async requests)
Greenhouse-->>API: [{raw_job...}, ...]
API->>Greenhouse: normalize_job(raw_job)<br/>(per job)
Greenhouse-->>API: JobData{title, company, ...}
API->>Pipeline: filter_and_rank_jobs<br/>(jobs, user_context?, limit)
Pipeline->>Pipeline: Stage 1: cheap_filter
Pipeline->>Pipeline: Stage 2: score_job
Pipeline->>Pipeline: Stage 3: threshold_filter
Pipeline->>Pipeline: Stage 4: sort by score desc
Pipeline->>Pipeline: Stage 5: top-K slice
Pipeline-->>API: {total, jobs[], pipeline_summary}
API-->>Client: IngestionResponse{total, jobs[]}
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
🦈 PullShark AI AnalysisRisk Level: 🔴 High 🧪 Recommended Tests
|
There was a problem hiding this comment.
Actionable comments posted: 21
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@IMPLEMENTATION_SUMMARY.md`:
- Around line 6-7: The Test Coverage headline in IMPLEMENTATION_SUMMARY.md
("Test Coverage: 46/46 tests PASSING") is inconsistent with the listed
file-level totals; update the document so the headline total matches the sum of
the individual file totals (or vice versa) by recalculating and editing the
"Test Coverage" line and the per-file counts, and also reconcile the same
mismatch noted around lines 1037-1042; ensure the final numeric total and the
numerator/denominator values are consistent across the summary.
In `@scrapper/api/routes.py`:
- Around line 99-100: The handler collects an errors list but never uses it to
determine response status, so calls like the route function that populate errors
and jobs (look for the variables errors and jobs in the route handler) can
return 200 even if all source fetches failed; fix by checking errors after
processing: if errors.length === companies.length (i.e., no successful jobs)
return an error status (e.g., 500 or 400) with the errors payload; if some
succeeded and some failed return a 207 Multi-Status (or 200 with a clear
partial-success payload) including both jobs and errors; update all similar
blocks that populate errors/jobs (the other occurrences you flagged) to
implement this logic and replace the unconditional return of jobs with a
conditional return that sets the appropriate status and includes errors.
- Around line 165-179: The code is treating request.limit_per_company as a
global cap on merged results; instead preserve it as a per-company cap and
enforce it during per-company selection/merging. Remove the global application
of result_limit before calling filtering_service.filter_and_rank_jobs and
instead pass the original request.limit_per_company (validated/capped by
MAX_RESULT_LIMIT if needed) as a named per_company_limit argument to
filtering_service.filter_and_rank_jobs (or update filter_and_rank_jobs to accept
per_company_limit), so filtering logic (inside filter_and_rank_jobs) can limit
each company's entries individually before merging into the final result set;
keep MAX_RESULT_LIMIT enforcement only as validation of the requested
per-company value (use request.limit_per_company -> per_company_limit variable,
cap that at MAX_RESULT_LIMIT, log if capped, then pass per_company_limit into
filter_and_rank_jobs).
- Around line 208-213: The except Exception as e block in scrapper/api/routes.py
currently returns the raw exception to clients; instead, log the full error
details (use logger.exception or logger.error with exc_info=True) inside that
except block to preserve stacktrace, and raise HTTPException(status_code=500,
detail="Internal server error") (or another generic message) so clients do not
receive internal exception text; keep references to the existing logger and
HTTPException symbols and replace the current f"Job ingestion failed: {str(e)}"
detail with the generic message while retaining the detailed error in the logs.
In `@scrapper/companies.json`:
- Around line 2-198: The companies list in scrapper/companies.json contains
duplicate slugs (e.g., "stripe", "hashicorp", "datadog", "pinterest", "asana",
"notion", etc.) which will produce duplicate scrapes; remove duplicate entries
from the JSON source so each company slug appears only once, and also harden
load_companies() to dedupe on ingestion (e.g., convert the loaded array to a Set
or use Array.from(new Set(...))) to prevent future duplicates slipping in.
In `@scrapper/config/loader.py`:
- Around line 6-7: The imports in scrapper/config/loader.py are using top-level
module paths and will fail when scrapper.config is imported; update the import
statements that reference ConfigException and get_logger to use package-relative
imports (e.g., from ..utils.exceptions import ConfigException and from
..utils.logger import get_logger) or fully-qualified package paths
(scrapper.utils.exceptions.ConfigException, scrapper.utils.logger.get_logger) so
the symbols ConfigException and get_logger are resolved correctly when loader.py
is loaded.
- Around line 66-80: load_config currently converts env strings to int/float and
will raise raw ValueError on malformed values; wrap the conversions for
"HTTP_TIMEOUT", "MAX_RETRIES", "RETRY_BACKOFF_FACTOR", and "REQUESTS_PER_SECOND"
in a try/except that catches ValueError and raises ConfigException with a clear
message (include the env var name and original error) so startup fails with a
ConfigException instead of an unhandled ValueError; keep using getenv for
defaults and ensure the boolean DEBUG parsing remains unchanged.
In `@scrapper/main.py`:
- Around line 51-58: The CORS config currently uses app.add_middleware with
CORSMiddleware and allow_origins=["*"] plus allow_credentials=True; change this
to read allowed origins from configuration/env (e.g., a FRONTEND_ORIGINS or
CORS_ALLOWED_ORIGINS setting) and set allow_origins to that enumerated list,
ensuring allow_credentials remains true only when origins are explicit; update
the code path that constructs the CORSMiddleware (the app.add_middleware call)
to parse a comma-separated env var or config list and use it instead of ["*"],
and add fallback to a safe default (empty list or explicit localhost) for
non-production.
- Around line 79-86: The global_exception_handler currently returns the
exception text to clients; change it to return only a generic error payload
(e.g., {"error":"Internal server error"}) and stop including str(exc) in the
JSONResponse, and instead log the full traceback server-side using
logger.exception(...) or traceback.format_exc() before returning the response so
the details are captured in logs but not echoed to clients; update the
global_exception_handler function accordingly.
In `@scrapper/models/job_schema.py`:
- Around line 74-77: The docs/examples disagree with the declared max for
limit_per_company (description says max 12 but examples send 50) — fix by
enforcing the contract and keeping docs in sync: change the type of
limit_per_company to use a constrained int (e.g., conint(le=12) or a pydantic
validator in the Job schema) so values above 12 raise validation errors, and
update the Field description/examples mentioned at lines ~88-89 to reflect the
same max (or vice-versa if intended max is 50 — then update the Field
description to "max: 50" consistently). Reference: limit_per_company (and the
other similar field noted at 88-89) so both validation and description/examples
match.
In `@scrapper/service/job_filter.py`:
- Around line 138-141: The current truthy check treats limit=0 as False and
returns all jobs; change the condition in the block that assigns top_jobs (using
variables sorted_jobs, top_jobs, limit) to explicitly check for None (e.g., if
limit is not None) so that an integer 0 correctly results in an empty list slice
(top_jobs = sorted_jobs[:limit]) and does not fall back to returning the full
sorted_jobs; optionally validate that limit is an int >= 0 before slicing to
avoid negative or non-integer behavior.
In `@scrapper/service/scoring.py`:
- Around line 129-131: The current substring checks using keywords and
text_lower are too permissive; replace them with whole-word matching (e.g., use
regex word-boundaries or tokenization) so you only add a keyword to matched when
it appears as a distinct token. Concretely: for the loops that iterate over
keywords and check "if keyword in text_lower" (the matched set update using
variables keywords, text_lower, matched), change to either compile a
case-insensitive pattern using word boundaries around re.escape(keyword) and
test with re.search, or tokenize text_lower into words and check membership;
also treat very short keywords (<=2 chars) more strictly (require boundaries or
skip/validate) to avoid false positives. Apply the same change to the other
occurrences referenced (the checks at the other two locations).
- Around line 327-330: The "flexible remote" branch uses integer floor division
so remote_match_weight=1 yields zero; in the elif that checks user_remote_only
and job_remote (the block updating score and
job_score.breakdown["remote_flexible"]), change the calculation to ensure a
non-zero bonus for odd weights — e.g., use a proper half with rounding
(math.ceil(config.remote_match_weight / 2) or (config.remote_match_weight + 1)
// 2) or compute as float then cast — and apply that value both to score and
job_score.breakdown["remote_flexible"] instead of config.remote_match_weight //
2.
- Around line 271-273: Guard against None before constructing sets: replace the
direct conversions in scoring.py so user_skills and user_roles use a safe
default (e.g. user_skills = set(user_context.get("skills") or []) and user_roles
= set(user_context.get("preferred_roles") or [])); keep role_keywords logic
(role_keywords = user_roles if user_roles else config.role_keywords) so that an
empty or None input falls back to config.role_keywords.
In `@scrapper/sources/greenhouse.py`:
- Around line 43-46: The constructor currently creates a new HttpClient per
instance (in GreenhouseSource.__init__ via self.http_client = HttpClient()),
which can leak connection pools; change the constructor to accept a HttpClient
instance (e.g., http_client) and assign self.http_client = http_client (falling
back to the lifecycle-managed/shared client if none provided) so callers reuse
the shared, lifecycle-managed HttpClient instead of allocating a new one per
GreenhouseSource.
In `@scrapper/sources/README.md`:
- Around line 52-54: The README example raises SourceException with a single
positional argument but the SourceException constructor requires both source and
message; update the except block to call SourceException with both parameters
(e.g., SourceException(source="Workable", message=f"Failed to fetch from
Workable: {e}") or SourceException("Workable", f"Failed to fetch from Workable:
{e}")) so using the SourceException class signature matches its implementation.
In `@scrapper/tests/test_api.py`:
- Around line 34-71: The tests in scrapper/tests/test_api.py
(test_ingest_endpoint_no_request_body,
test_ingest_endpoint_response_schema_structure, and related ingestion tests) are
non-deterministic because they hit live downstream fetchers; replace that by
stubbing/mocking the downstream client/source used by the ingest endpoint (the
module/class/function the route calls to fetch companies/jobs) via the test
client/pytest monkeypatch or a fixture so the endpoint receives a deterministic
payload (e.g., a known companies/jobs response) and always returns 200; then
tighten assertions to require a 200 and validate the exact expected JSON schema
(presence and types for "total" and "jobs") instead of allowing transient
429/408/504—keep test_ingest_endpoint_invalid_source asserting 400 for unknown
sources.
In `@scrapper/tests/test_filtering.py`:
- Around line 233-244: The test test_score_combined is too permissive — replace
the loose assertion "assert scored.score >= 10" with an exact equality check to
catch weighting regressions (e.g. assert scored.score == 10); update the
assertion in scrapper/tests/test_filtering.py where score_job(test_job,
user_context) is invoked and the resulting scored.score is checked so the test
validates the precise combined total from score_job rather than allowing
accidental extra points.
In `@scrapper/utils/http_client.py`:
- Around line 95-97: The rate limiter is not concurrency-safe because
last_request_time is mutated without synchronization; add an asyncio.Lock (e.g.
self._rate_lock) in the HttpClient initializer and wrap the logic inside
_enforce_rate_limit with that lock: acquire the lock, compute now and required
wait = interval - (now - self.last_request_time), if wait > 0 await
asyncio.sleep(wait), then update self.last_request_time = time.monotonic() (or
now + wait) and release the lock so only one coroutine updates the timestamp at
a time; apply the same locking change wherever _enforce_rate_limit is called
(and ensure the lock is created and used consistently in the HttpClient class).
- Around line 45-46: The constructor currently computes
self.min_request_interval = 1.0 / requests_per_second which raises
ZeroDivisionError for requests_per_second <= 0; update the HttpClient __init__
to validate requests_per_second > 0 up front (e.g., if requests_per_second <= 0:
raise ValueError("requests_per_second must be > 0")) before computing
self.min_request_interval and setting self.last_request_time so initialization
fails fast with a clear error message.
In `@scrapper/utils/logger.py`:
- Around line 13-36: The JsonFormatter in scrapper/utils/logger.py currently
only serializes source/company/status/duration_ms/job_count and is dropping the
additional metadata emitted by log_ingest_start(), log_source_fetch(), and
log_ingest_complete() (ingest counters and error payloads); update JsonFormatter
to check for and include the ingest-related fields (e.g., any record attributes
like ingest_counters, ingest_stats, ingest_count, ingest_errors, error or
error_payload) and serialize them into log_data when present, or alternatively
adjust the helper functions
(log_ingest_start/log_source_fetch/log_ingest_complete) to use the existing
field names the formatter expects so the structured metadata is preserved.
Ensure you reference JsonFormatter in scrapper/utils/logger.py and the helper
functions log_ingest_start, log_source_fetch, and log_ingest_complete when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bf8965c8-ac6f-41f5-8116-cd19f1f98d04
📒 Files selected for processing (31)
IMPLEMENTATION_SUMMARY.mdscrapper/.gitignorescrapper/README.mdscrapper/__init__.pyscrapper/api/__init__.pyscrapper/api/routes.pyscrapper/companies.jsonscrapper/config/__init__.pyscrapper/config/loader.pyscrapper/conftest.pyscrapper/main.pyscrapper/models/__init__.pyscrapper/models/job_schema.pyscrapper/requirements.txtscrapper/service/__init__.pyscrapper/service/job_filter.pyscrapper/service/scoring.pyscrapper/sources/README.mdscrapper/sources/__init__.pyscrapper/sources/base.pyscrapper/sources/greenhouse.pyscrapper/tests/__init__.pyscrapper/tests/conftest.pyscrapper/tests/test_api.pyscrapper/tests/test_filtering.pyscrapper/tests/test_greenhouse.pyscrapper/tests/test_sources.pyscrapper/utils/__init__.pyscrapper/utils/exceptions.pyscrapper/utils/http_client.pyscrapper/utils/logger.py
| **Test Coverage**: 46/46 tests PASSING | ||
| **Date**: April 2026 |
There was a problem hiding this comment.
Test-count section is internally inconsistent.
The headline says 46/46, but the listed file-level totals exceed that. Please reconcile these numbers so the summary remains reliable.
Also applies to: 1037-1042
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@IMPLEMENTATION_SUMMARY.md` around lines 6 - 7, The Test Coverage headline in
IMPLEMENTATION_SUMMARY.md ("Test Coverage: 46/46 tests PASSING") is inconsistent
with the listed file-level totals; update the document so the headline total
matches the sum of the individual file totals (or vice versa) by recalculating
and editing the "Test Coverage" line and the per-file counts, and also reconcile
the same mismatch noted around lines 1037-1042; ensure the final numeric total
and the numerator/denominator values are consistent across the summary.
| errors: List[str] = [] | ||
|
|
There was a problem hiding this comment.
Ingestion can return success even when all source fetches fail.
errors are collected, but never influence response status. If all companies fail, this still returns 200 with empty jobs.
Suggested fix
logger.info(
"Fetch and normalization completed",
extra={
"total_jobs_fetched": len(all_jobs),
"fetch_duration_ms": fetch_duration_ms
}
)
+
+ if errors and not all_jobs:
+ raise HTTPException(
+ status_code=502,
+ detail="All source fetches failed"
+ )Also applies to: 127-143, 201-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/api/routes.py` around lines 99 - 100, The handler collects an errors
list but never uses it to determine response status, so calls like the route
function that populate errors and jobs (look for the variables errors and jobs
in the route handler) can return 200 even if all source fetches failed; fix by
checking errors after processing: if errors.length === companies.length (i.e.,
no successful jobs) return an error status (e.g., 500 or 400) with the errors
payload; if some succeeded and some failed return a 207 Multi-Status (or 200
with a clear partial-success payload) including both jobs and errors; update all
similar blocks that populate errors/jobs (the other occurrences you flagged) to
implement this logic and replace the unconditional return of jobs with a
conditional return that sets the appropriate status and includes errors.
| result_limit = request.limit_per_company | ||
| if result_limit is None: | ||
| result_limit = DEFAULT_RESULT_LIMIT | ||
| elif result_limit > MAX_RESULT_LIMIT: | ||
| result_limit = MAX_RESULT_LIMIT | ||
| logger.info( | ||
| f"Result limit capped at {MAX_RESULT_LIMIT} (requested: {request.limit_per_company})", | ||
| extra={"requested": request.limit_per_company, "capped_at": MAX_RESULT_LIMIT} | ||
| ) | ||
|
|
||
| filter_result = filtering_service.filter_and_rank_jobs( | ||
| all_jobs, | ||
| user_context=user_context_dict, | ||
| limit=result_limit | ||
| ) |
There was a problem hiding this comment.
limit_per_company is applied as a global cap, not per company.
Current behavior limits final merged results, so requests spanning multiple companies can return far fewer jobs than implied by the field name/contract.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 171-171: Logging statement uses f-string
(G004)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/api/routes.py` around lines 165 - 179, The code is treating
request.limit_per_company as a global cap on merged results; instead preserve it
as a per-company cap and enforce it during per-company selection/merging. Remove
the global application of result_limit before calling
filtering_service.filter_and_rank_jobs and instead pass the original
request.limit_per_company (validated/capped by MAX_RESULT_LIMIT if needed) as a
named per_company_limit argument to filtering_service.filter_and_rank_jobs (or
update filter_and_rank_jobs to accept per_company_limit), so filtering logic
(inside filter_and_rank_jobs) can limit each company's entries individually
before merging into the final result set; keep MAX_RESULT_LIMIT enforcement only
as validation of the requested per-company value (use request.limit_per_company
-> per_company_limit variable, cap that at MAX_RESULT_LIMIT, log if capped, then
pass per_company_limit into filter_and_rank_jobs).
| except Exception as e: | ||
| logger.error( | ||
| f"Job ingestion failed: {str(e)}", | ||
| extra={"error": str(e)} | ||
| ) | ||
| raise HTTPException(status_code=500, detail=f"Job ingestion failed: {str(e)}") |
There was a problem hiding this comment.
Raw exception text is returned to clients in 500 responses.
This leaks internal failure details and can expose sensitive internals. Return a generic message and keep specifics in logs.
Suggested fix
- except Exception as e:
- logger.error(
- f"Job ingestion failed: {str(e)}",
- extra={"error": str(e)}
- )
- raise HTTPException(status_code=500, detail=f"Job ingestion failed: {str(e)}")
+ except Exception as e:
+ logger.exception("Job ingestion failed", extra={"error": str(e)})
+ raise HTTPException(status_code=500, detail="Job ingestion failed")🧰 Tools
🪛 Ruff (0.15.12)
[warning] 208-208: Do not catch blind exception: Exception
(BLE001)
[warning] 209-212: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
[warning] 210-210: Logging statement uses f-string
(G004)
[warning] 210-210: Use explicit conversion flag
Replace with conversion flag
(RUF010)
[warning] 213-213: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 213-213: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/api/routes.py` around lines 208 - 213, The except Exception as e
block in scrapper/api/routes.py currently returns the raw exception to clients;
instead, log the full error details (use logger.exception or logger.error with
exc_info=True) inside that except block to preserve stacktrace, and raise
HTTPException(status_code=500, detail="Internal server error") (or another
generic message) so clients do not receive internal exception text; keep
references to the existing logger and HTTPException symbols and replace the
current f"Job ingestion failed: {str(e)}" detail with the generic message while
retaining the detailed error in the logs.
| "greenhouse": [ | ||
| "stripe", | ||
| "notion", | ||
| "figma", | ||
| "airbnb", | ||
| "robinhood", | ||
| "coinbase", | ||
| "discord", | ||
| "dropbox", | ||
| "instacart", | ||
| "databricks", | ||
| "scaleai", | ||
| "brex", | ||
| "gusto", | ||
| "rippling", | ||
| "benchling", | ||
| "plaid", | ||
| "asana", | ||
| "intercom", | ||
| "zapier", | ||
| "segment", | ||
| "cloudflare", | ||
| "hashicorp", | ||
| "snowflake", | ||
| "datadog", | ||
| "mongodb", | ||
| "elastic", | ||
| "fastly", | ||
| "canva", | ||
| "wise", | ||
| "revolut", | ||
| "klarna", | ||
| "n26", | ||
| "razorpay", | ||
| "cred", | ||
| "meesho", | ||
| "groww", | ||
| "zerodha", | ||
| "druva", | ||
| "digicert", | ||
| "stabilityai", | ||
| "freshworks", | ||
| "chargebee", | ||
| "browserstack", | ||
| "postman", | ||
| "inmobi", | ||
| "unacademy", | ||
| "sharechat", | ||
| "spinny", | ||
| "urbancompany", | ||
| "github", | ||
| "gitlab", | ||
| "slack", | ||
| "twilio", | ||
| "stripe", | ||
| "square", | ||
| "shopify", | ||
| "hashicorp", | ||
| "terraform", | ||
| "datadog", | ||
| "newrelic", | ||
| "splunk", | ||
| "salesforce", | ||
| "hubspot", | ||
| "zendesk", | ||
| "okta", | ||
| "auth0", | ||
| "twitch", | ||
| "reddit", | ||
| "pinterest", | ||
| "medium", | ||
| "substack", | ||
| "patreon", | ||
| "kickstarter", | ||
| "indiegogo", | ||
| "pebble", | ||
| "fitbit", | ||
| "garmin", | ||
| "sonos", | ||
| "oculus", | ||
| "htc", | ||
| "samsung", | ||
| "apple", | ||
| "google", | ||
| "microsoft", | ||
| "amazon", | ||
| "meta", | ||
| "netflix", | ||
| "disney", | ||
| "hulu", | ||
| "paramount", | ||
| "peacock", | ||
| "cbs", | ||
| "hbo", | ||
| "showtime", | ||
| "starz", | ||
| "apple-tv", | ||
| "youtube", | ||
| "twitch", | ||
| "dailymotion", | ||
| "vimeo", | ||
| "flickr", | ||
| "imgur", | ||
| "giphy", | ||
| "tenor", | ||
| "pinterest", | ||
| "tumblr", | ||
| "wix", | ||
| "squarespace", | ||
| "weebly", | ||
| "godaddy", | ||
| "bluehost", | ||
| "hostgator", | ||
| "namecheap", | ||
| "domain-com", | ||
| "aws", | ||
| "azure", | ||
| "gcp", | ||
| "digitalocean", | ||
| "heroku", | ||
| "vercel", | ||
| "netlify", | ||
| "render", | ||
| "fly-io", | ||
| "railway", | ||
| "dokku", | ||
| "linode", | ||
| "vultr", | ||
| "lightsail", | ||
| "rackspace", | ||
| "openstack", | ||
| "kubernetes", | ||
| "docker", | ||
| "jenkins", | ||
| "gitlab-ci", | ||
| "github-actions", | ||
| "circleci", | ||
| "travis-ci", | ||
| "appveyor", | ||
| "buildkite", | ||
| "codefresh", | ||
| "drone", | ||
| "harness", | ||
| "atlassian", | ||
| "jira", | ||
| "confluence", | ||
| "bitbucket", | ||
| "trello", | ||
| "asana", | ||
| "monday", | ||
| "notion", | ||
| "clickup", | ||
| "meistertask", | ||
| "wrike", | ||
| "smartsheet", | ||
| "airtable", | ||
| "typeform", | ||
| "typebot", | ||
| "jotform", | ||
| "formstack", | ||
| "wufoo", | ||
| "surveysparrow", | ||
| "qualtrics", | ||
| "alchemer", | ||
| "calendly", | ||
| "acuityscheduling", | ||
| "vcita", | ||
| "booksy", | ||
| "mindbody", | ||
| "maroochy", | ||
| "zoho", | ||
| "pipedrive", | ||
| "copper", | ||
| "freshsales", | ||
| "agilecrm", | ||
| "insightly", | ||
| "zohocrm", | ||
| "dynamic365", | ||
| "salesforcecrm", | ||
| "mailchimp", | ||
| "constantcontact", | ||
| "convertkit", | ||
| "activecampaign", | ||
| "klaviyo", | ||
| "braze", | ||
| "iterable", | ||
| "customer-io", | ||
| "amplitude", | ||
| "mixpanel", | ||
| "heap", | ||
| "fullstory", | ||
| "logrocket", | ||
| "sentry", | ||
| "rollbar", | ||
| "bugsnag", | ||
| "appinsights" | ||
| ] |
There was a problem hiding this comment.
Remove the repeated company slugs.
Because load_companies() preserves this list as-is, the repeated entries here will be scraped multiple times and can produce duplicate jobs plus extra network traffic. If the duplicates are intentional, please document the weighting explicitly; otherwise dedupe the source list before ingestion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/companies.json` around lines 2 - 198, The companies list in
scrapper/companies.json contains duplicate slugs (e.g., "stripe", "hashicorp",
"datadog", "pinterest", "asana", "notion", etc.) which will produce duplicate
scrapes; remove duplicate entries from the JSON source so each company slug
appears only once, and also harden load_companies() to dedupe on ingestion
(e.g., convert the loaded array to a Set or use Array.from(new Set(...))) to
prevent future duplicates slipping in.
| def test_ingest_endpoint_no_request_body(client): | ||
| """Test ingest endpoint with no request body - should work with live companies.json.""" | ||
| # This test uses the real companies.json file | ||
| response = client.post("/internal/ingest") | ||
|
|
||
| # Should succeed (200) even if it gets 0 jobs | ||
| # OR might get rate limited (429) or timeout, but not 500 | ||
| assert response.status_code in [200, 429, 408, 504] | ||
| if response.status_code == 200: | ||
| data = response.json() | ||
| assert "total" in data | ||
| assert "jobs" in data | ||
|
|
||
|
|
||
| def test_ingest_endpoint_invalid_source(client): | ||
| """Test ingest with invalid source.""" | ||
| response = client.post( | ||
| "/internal/ingest", | ||
| json={"sources": ["invalid_source"]} | ||
| ) | ||
|
|
||
| assert response.status_code == 400 | ||
| assert "Unknown source" in response.json()["detail"] | ||
|
|
||
|
|
||
| def test_ingest_endpoint_response_schema_structure(client): | ||
| """Test that response structure is correct.""" | ||
| response = client.post("/internal/ingest", json={"companies": ["stripe"]}) | ||
|
|
||
| # Either success or expected error (not 500) | ||
| if response.status_code == 200: | ||
| data = response.json() | ||
|
|
||
| # Check response schema | ||
| assert "total" in data | ||
| assert "jobs" in data | ||
| assert isinstance(data["total"], int) | ||
| assert isinstance(data["jobs"], list) |
There was a problem hiding this comment.
Make the ingest tests deterministic.
Both ingest checks still depend on live downstream fetching and allow transient failures to count as success, so CI can go green even when /internal/ingest is timing out or returning no usable payload. Stub the downstream client/source here and fail unless the endpoint returns the expected 200 payload and schema.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/tests/test_api.py` around lines 34 - 71, The tests in
scrapper/tests/test_api.py (test_ingest_endpoint_no_request_body,
test_ingest_endpoint_response_schema_structure, and related ingestion tests) are
non-deterministic because they hit live downstream fetchers; replace that by
stubbing/mocking the downstream client/source used by the ingest endpoint (the
module/class/function the route calls to fetch companies/jobs) via the test
client/pytest monkeypatch or a fixture so the endpoint receives a deterministic
payload (e.g., a known companies/jobs response) and always returns 200; then
tighten assertions to require a 200 and validate the exact expected JSON schema
(presence and types for "total" and "jobs") instead of allowing transient
429/408/504—keep test_ingest_endpoint_invalid_source asserting 400 for unknown
sources.
| def test_score_combined(self, test_job): | ||
| """Test combined scoring.""" | ||
| user_context = { | ||
| "preferred_roles": ["backend"], | ||
| "skills": ["python", "go"], | ||
| "preferred_location": "San Francisco", | ||
| "remote_only": True | ||
| } | ||
| scored = score_job(test_job, user_context) | ||
|
|
||
| # Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10 | ||
| assert scored.score >= 10 |
There was a problem hiding this comment.
Assert the exact combined score here.
>= 10 lets accidental bonus points or double-counting slip through without failing the test, so this won't catch regressions in the scoring weights.
✅ Suggested fix
- assert scored.score >= 10
+ assert scored.score == 10📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_score_combined(self, test_job): | |
| """Test combined scoring.""" | |
| user_context = { | |
| "preferred_roles": ["backend"], | |
| "skills": ["python", "go"], | |
| "preferred_location": "San Francisco", | |
| "remote_only": True | |
| } | |
| scored = score_job(test_job, user_context) | |
| # Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10 | |
| assert scored.score >= 10 | |
| def test_score_combined(self, test_job): | |
| """Test combined scoring.""" | |
| user_context = { | |
| "preferred_roles": ["backend"], | |
| "skills": ["python", "go"], | |
| "preferred_location": "San Francisco", | |
| "remote_only": True | |
| } | |
| scored = score_job(test_job, user_context) | |
| # Should have: +3 (title) + +2 (desc) + +3 (skills) + +1 (location) + +1 (remote) = 10 | |
| assert scored.score == 10 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/tests/test_filtering.py` around lines 233 - 244, The test
test_score_combined is too permissive — replace the loose assertion "assert
scored.score >= 10" with an exact equality check to catch weighting regressions
(e.g. assert scored.score == 10); update the assertion in
scrapper/tests/test_filtering.py where score_job(test_job, user_context) is
invoked and the resulting scored.score is checked so the test validates the
precise combined total from score_job rather than allowing accidental extra
points.
| self.min_request_interval = 1.0 / requests_per_second | ||
| self.last_request_time = None |
There was a problem hiding this comment.
Validate requests_per_second before interval math
requests_per_second <= 0 currently crashes with a ZeroDivisionError during initialization instead of a clear config error. Fail fast with an explicit validation exception.
Suggested fix
self.retry_backoff_factor = retry_backoff_factor
self.requests_per_second = requests_per_second
+ if self.requests_per_second <= 0:
+ raise ValueError("requests_per_second must be > 0")
# Rate limiting
self.min_request_interval = 1.0 / requests_per_second🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/utils/http_client.py` around lines 45 - 46, The constructor
currently computes self.min_request_interval = 1.0 / requests_per_second which
raises ZeroDivisionError for requests_per_second <= 0; update the HttpClient
__init__ to validate requests_per_second > 0 up front (e.g., if
requests_per_second <= 0: raise ValueError("requests_per_second must be > 0"))
before computing self.min_request_interval and setting self.last_request_time so
initialization fails fast with a clear error message.
| # Rate limiting: enforce minimum interval between requests | ||
| await self._enforce_rate_limit() | ||
|
|
There was a problem hiding this comment.
Rate limiter is not concurrency-safe
The current rate-limit state (last_request_time) is mutated without a lock, so concurrent requests can pass the check together and exceed the configured RPS. This breaks throttling guarantees under load.
Suggested fix
class HttpClient:
@@
- self.last_request_time = None
+ self.last_request_time = None
+ self._rate_limit_lock = asyncio.Lock()
@@
async def _enforce_rate_limit(self):
"""Enforce rate limiting by enforcing minimum interval between requests."""
- if self.last_request_time is not None:
- elapsed = (datetime.now() - self.last_request_time).total_seconds()
- if elapsed < self.min_request_interval:
- await asyncio.sleep(self.min_request_interval - elapsed)
-
- self.last_request_time = datetime.now()
+ async with self._rate_limit_lock:
+ if self.last_request_time is not None:
+ elapsed = (datetime.now() - self.last_request_time).total_seconds()
+ if elapsed < self.min_request_interval:
+ await asyncio.sleep(self.min_request_interval - elapsed)
+ self.last_request_time = datetime.now()Also applies to: 148-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/utils/http_client.py` around lines 95 - 97, The rate limiter is not
concurrency-safe because last_request_time is mutated without synchronization;
add an asyncio.Lock (e.g. self._rate_lock) in the HttpClient initializer and
wrap the logic inside _enforce_rate_limit with that lock: acquire the lock,
compute now and required wait = interval - (now - self.last_request_time), if
wait > 0 await asyncio.sleep(wait), then update self.last_request_time =
time.monotonic() (or now + wait) and release the lock so only one coroutine
updates the timestamp at a time; apply the same locking change wherever
_enforce_rate_limit is called (and ensure the lock is created and used
consistently in the HttpClient class).
| log_data = { | ||
| "timestamp": datetime.utcnow().isoformat(), | ||
| "level": record.levelname, | ||
| "module": record.name, | ||
| "message": record.getMessage(), | ||
| } | ||
|
|
||
| # Add extra fields if present | ||
| if hasattr(record, "source"): | ||
| log_data["source"] = record.source | ||
| if hasattr(record, "company"): | ||
| log_data["company"] = record.company | ||
| if hasattr(record, "status"): | ||
| log_data["status"] = record.status | ||
| if hasattr(record, "duration_ms"): | ||
| log_data["duration_ms"] = record.duration_ms | ||
| if hasattr(record, "job_count"): | ||
| log_data["job_count"] = record.job_count | ||
|
|
||
| # Add exception info if present | ||
| if record.exc_info: | ||
| log_data["exception"] = self.formatException(record.exc_info) | ||
|
|
||
| return json.dumps(log_data) |
There was a problem hiding this comment.
Preserve the structured fields emitted by the helpers.
JsonFormatter only serializes source/company/status/duration_ms/job_count, so the new ingest counters and error payloads added by log_ingest_start(), log_source_fetch(), and log_ingest_complete() are silently dropped. Expand the formatter or make the helper payloads match the formatter so the JSON logs actually contain the metadata they advertise.
🔧 Suggested fix
- # Add extra fields if present
- if hasattr(record, "source"):
- log_data["source"] = record.source
- if hasattr(record, "company"):
- log_data["company"] = record.company
- if hasattr(record, "status"):
- log_data["status"] = record.status
- if hasattr(record, "duration_ms"):
- log_data["duration_ms"] = record.duration_ms
- if hasattr(record, "job_count"):
- log_data["job_count"] = record.job_count
+ for field in (
+ "source",
+ "company",
+ "status",
+ "duration_ms",
+ "job_count",
+ "sources",
+ "companies",
+ "total_jobs",
+ "errors",
+ "error",
+ ):
+ if hasattr(record, field):
+ log_data[field] = getattr(record, field)Also applies to: 62-112
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 14-14: datetime.datetime.utcnow() used
(DTZ003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scrapper/utils/logger.py` around lines 13 - 36, The JsonFormatter in
scrapper/utils/logger.py currently only serializes
source/company/status/duration_ms/job_count and is dropping the additional
metadata emitted by log_ingest_start(), log_source_fetch(), and
log_ingest_complete() (ingest counters and error payloads); update JsonFormatter
to check for and include the ingest-related fields (e.g., any record attributes
like ingest_counters, ingest_stats, ingest_count, ingest_errors, error or
error_payload) and serialize them into log_data when present, or alternatively
adjust the helper functions
(log_ingest_start/log_source_fetch/log_ingest_complete) to use the existing
field names the formatter expects so the structured metadata is preserved.
Ensure you reference JsonFormatter in scrapper/utils/logger.py and the helper
functions log_ingest_start, log_source_fetch, and log_ingest_complete when
making the change.
TODOS
Summary by CodeRabbit
Release Notes
New Features
/healthendpoint to monitor service status and available job sources/internal/ingestendpoint to fetch, normalize, and rank jobs by relevance with optional filtering by sources, companies, and limitsDocumentation