Normalize OWASP cheat sheet references#865
Conversation
|
Requesting kind reviews and feedback for this feature from : @northdpole , @Pa04rth , @robvanderveer |
ef1baee to
901c9a0
Compare
10585d9 to
38051c9
Compare
386a106 to
20b0aab
Compare
bb7b759 to
b08e8e5
Compare
b08e8e5 to
2b18fb2
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
Summary by CodeRabbit
WalkthroughAdds bundled JSON datasets and parser implementations for multiple OWASP standards (AISVS, API Top10 2023, Kubernetes Top10 2022/2025, LLM Top10 2025, Top10 2025), enhances the Cheatsheets parser with supplemental data, updates unit tests for each parser, and adds CLI flags to import the datasets. ChangesOWASP External Project Parser Expansion
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (2)
application/utils/external_project_parsers/parsers/owasp_llm_top10_2025.py (1)
19-47: 🏗️ Heavy liftConsider extracting shared parse logic into a base class.
This
parseimplementation is identical toowasp_aisvs.pyand the other standard parsers (API Top 10, Kubernetes 2022, Top 10 2025) — onlynameanddata_filediffer. A small base class holding the load-JSON / build-Standard/ link-CRE loop, with subclasses supplying justname/data_file, would remove ~5 copies and keep behavior consistent as the logic evolves.♻️ Sketch of a shared base parser
class _JsonStandardParser(ParserInterface): name: str data_file: Path def parse(self, cache: db.Node_collection, ph: prompt_client.PromptHandler): with self.data_file.open("r", encoding="utf-8") as handle: raw_entries = json.load(handle) entries = [] for entry in raw_entries: standard = defs.Standard( name=self.name, sectionID=entry["section_id"], section=entry["section"], hyperlink=entry["hyperlink"], ) for cre_id in entry.get("cre_ids", []): cres = cache.get_CREs(external_id=cre_id) if not cres: continue standard.add_link( defs.Link( ltype=defs.LinkTypes.LinkedTo, document=cres[0].shallow_copy(), ) ) entries.append(standard) return ParseResult( results={self.name: entries}, calculate_gap_analysis=False, calculate_embeddings=False, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/external_project_parsers/parsers/owasp_llm_top10_2025.py` around lines 19 - 47, Create a shared base parser class (e.g., _JsonStandardParser) that implements the parse method and encapsulates the JSON load + build-Standard + link-CRE loop currently duplicated in owasp_llm_top10_2025.py: move the logic that opens self.data_file, iterates raw_entries, constructs defs.Standard (using name, section_id, section, hyperlink), calls cache.get_CREs and standard.add_link with defs.Link(defs.LinkTypes.LinkedTo, cres[0].shallow_copy()), and returns the ParseResult into the base class; then have the existing parser classes simply subclass _JsonStandardParser and define their name and data_file attributes so their parse methods can be removed.application/utils/external_project_parsers/parsers/owasp_top10_2025.py (1)
13-47: 🏗️ Heavy liftConsider consolidating the duplicated JSON-driven parsers.
This module is essentially identical to
owasp_api_top10_2023.py(and, per the PR stack, the AISVS / Kubernetes 2022 / LLM 2025 parsers) — onlynameanddata_filediffer. A single data-driven base parser parameterized byname+data_filewould remove ~5x duplication and keep linking/ParseResultbehavior in one place.♻️ Sketch of a shared base parser
class _JsonStandardParser(ParserInterface): name: str data_file: Path def parse(self, cache: db.Node_collection, ph: prompt_client.PromptHandler): with self.data_file.open("r", encoding="utf-8") as handle: raw_entries = json.load(handle) entries = [] for entry in raw_entries: standard = defs.Standard( name=self.name, sectionID=entry["section_id"], section=entry["section"], hyperlink=entry["hyperlink"], ) for cre_id in entry.get("cre_ids", []): cres = cache.get_CREs(external_id=cre_id) if not cres: continue standard.add_link( defs.Link(ltype=defs.LinkTypes.LinkedTo, document=cres[0].shallow_copy()) ) entries.append(standard) return ParseResult( results={self.name: entries}, calculate_gap_analysis=False, calculate_embeddings=False, ) class OwaspTop10_2025(_JsonStandardParser): name = "OWASP Top 10 2025" data_file = Path(__file__).resolve().parent.parent / "data" / "owasp_top10_2025.json"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/external_project_parsers/parsers/owasp_top10_2025.py` around lines 13 - 47, Consolidate duplicated JSON-driven parsers by introducing a shared base class (e.g. _JsonStandardParser inheriting ParserInterface) that implements parse(...) using self.name and self.data_file, move the JSON loading, Standard creation, CRE linking (using cache.get_CREs and defs.Link/defs.LinkTypes.LinkedTo) and ParseResult construction into that base, then change OwaspTop10_2025 to subclass the base and only set name = "OWASP Top 10 2025" and data_file = Path(...)/"owasp_top10_2025.json"; ensure existing symbols used in callers (OwaspTop10_2025.parse behavior, defs.Standard, ParseResult) remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/tests/owasp_aisvs_parser_test.py`:
- Around line 51-62: The list comprehensions use the ambiguous loop variable
name `l` causing a linter warning; update the comprehensions in the test (the
expressions `[l.document.id for l in entries[0].links]` and `[l.document.id for
l in entries[-1].links]`) to a clearer name like `link` (e.g. `[link.document.id
for link in entries[0].links]`) so the variables remain descriptive while
preserving the same behavior and references to `entries`, `.links`, and
`.document.id`.
In `@application/tests/owasp_api_top10_2023_parser_test.py`:
- Around line 39-43: The test uses the ambiguous loop variable name `l` in two
list comprehensions ([l.document.id for l in entries[0].links] and
[l.document.id for l in entries[-1].links]) which triggers Ruff E741; rename `l`
to a clearer name like `link` in both occurrences to match the sibling test
(owasp_top10_2025_parser_test.py) and satisfy the linter while keeping the
expressions behavior unchanged.
In `@application/tests/owasp_kubernetes_top10_2022_parser_test.py`:
- Around line 41-45: The list comprehensions in the test use the ambiguous loop
variable name "l" which Ruff flags; update both comprehensions to use a
descriptive name like "link" (e.g., [link.document.id for link in
entries[0].links] and [link.document.id for link in entries[-1].links]) so the
variables referenced around entries and links/document.id are clear and the lint
error (Ruff E741) is resolved.
In `@application/tests/owasp_kubernetes_top10_2025_parser_test.py`:
- Around line 45-52: The test uses an ambiguous single-letter loop variable `l`
in list comprehensions ([l.document.id for l in entries[0].links] and
[l.document.id for l in entries[-1].links]); rename `l` to a clearer name like
`link` in both comprehensions so they become [link.document.id for link in
entries[0].links] and [link.document.id for link in entries[-1].links] (update
occurrences in owasp_kubernetes_top10_2025_parser_test.py where `entries` and
`.links` are inspected).
In `@application/tests/owasp_llm_top10_2025_parser_test.py`:
- Around line 40-45: Rename the ambiguous loop variable `l` to `link` in the
three list comprehensions that build document id lists from entries[*].links
(i.e., change `[l.document.id for l in entries[0].links]`, `[l.document.id for l
in entries[4].links]`, and `[l.document.id for l in entries[-1].links]` to use
`link` instead of `l`) so the comprehensions read `[link.document.id for link in
entries[...] .links]`, preserving the existing assertions and behavior.
In `@application/utils/external_project_parsers/data/owasp_aisvs_1_0.json`:
- Line 5: The hyperlink value for the JSON key "hyperlink" uses a /tree/ URL;
update that string to the canonical /blob/ form (use
/blob/main/0x10-C01-Training-Data-Governance.md) so the OWASP markdown file link
does not rely on GitHub redirects and matches the project's OWASP reference
normalization; locate the "hyperlink" entry in owasp_aisvs_1_0.json and change
the path accordingly.
In
`@application/utils/external_project_parsers/data/owasp_kubernetes_top10_2022.json`:
- Around line 50-55: The entry with "section_id": "K09" currently has the same
cre_ids as "K01" which is likely incorrect; locate the JSON object where
"section_id": "K09" and either replace the duplicated cre_ids value ["233-748",
"486-813"] with the correct CRE id array for K09 or, if the duplication was
intentional, add a clarifying comment in your PR message confirming that K09
intentionally maps to the same CREs as K01; ensure you update the "cre_ids"
field for the K09 object (or document the intent) so the mapping is unambiguous.
In `@application/utils/external_project_parsers/parsers/cheatsheets_parser.py`:
- Around line 132-136: The deduplicate_entries method currently replaces earlier
defs.Standard objects when a (entry.section, entry.hyperlink) key repeats;
change it to merge duplicates instead: in deduplicate_entries iterate entries,
use the (section, hyperlink) key to find existing entries and, when found, union
the relevant list/collection fields (at minimum entry.links) into the existing
defs.Standard object (avoiding duplicates), and merge any other supplemental
fields that should be combined (e.g., tags or metadata) rather than overwriting;
return the values of this merged map. Ensure you keep the original defs.Standard
objects for the first-seen entry and update them in-place to preserve other
fields.
- Around line 116-129: The loop currently swallows all exceptions from
cs.add_link (for cre in cres) which hides broken CRE payloads or Link contract
changes; change the bare except to catch exceptions but log them with context
(include cre_id and cre identifiers and the exception stack/str) and set a local
failure flag when any add_link fails; after processing cres, only append cs to
standard_entries if cs.links is non-empty and no add_link failures occurred (use
symbols: cs.add_link, cache.get_CREs, defs.Link,
defs.LinkTypes.AutomaticallyLinkedTo) so parse failures are visible and faulty
cheatsheets aren’t treated as successful.
- Around line 54-67: The try currently wraps both git.clone(...) and
register_cheatsheets(...), so move the try/except to only cover git.clone
(git.clone), logging the clone-specific warning via self.logger.warning on
failure, and ensure that register_cheatsheets(repo=..., cache=...,
cheatsheets_path=..., repo_path=...) is called outside that try block only when
clone succeeds; also ensure cheatsheets is initialized (e.g., empty list) before
extending with self.register_supplemental_cheatsheets(cache=...) so that a
parsing/runtime error inside register_cheatsheets is not swallowed and will
surface rather than silently falling back to supplemental data.
---
Nitpick comments:
In `@application/utils/external_project_parsers/parsers/owasp_llm_top10_2025.py`:
- Around line 19-47: Create a shared base parser class (e.g.,
_JsonStandardParser) that implements the parse method and encapsulates the JSON
load + build-Standard + link-CRE loop currently duplicated in
owasp_llm_top10_2025.py: move the logic that opens self.data_file, iterates
raw_entries, constructs defs.Standard (using name, section_id, section,
hyperlink), calls cache.get_CREs and standard.add_link with
defs.Link(defs.LinkTypes.LinkedTo, cres[0].shallow_copy()), and returns the
ParseResult into the base class; then have the existing parser classes simply
subclass _JsonStandardParser and define their name and data_file attributes so
their parse methods can be removed.
In `@application/utils/external_project_parsers/parsers/owasp_top10_2025.py`:
- Around line 13-47: Consolidate duplicated JSON-driven parsers by introducing a
shared base class (e.g. _JsonStandardParser inheriting ParserInterface) that
implements parse(...) using self.name and self.data_file, move the JSON loading,
Standard creation, CRE linking (using cache.get_CREs and
defs.Link/defs.LinkTypes.LinkedTo) and ParseResult construction into that base,
then change OwaspTop10_2025 to subclass the base and only set name = "OWASP Top
10 2025" and data_file = Path(...)/"owasp_top10_2025.json"; ensure existing
symbols used in callers (OwaspTop10_2025.parse behavior, defs.Standard,
ParseResult) remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 1cbdca8b-5833-42a4-97b8-1f4c5c4727a6
📒 Files selected for processing (22)
application/tests/cheatsheets_parser_test.pyapplication/tests/owasp_aisvs_parser_test.pyapplication/tests/owasp_api_top10_2023_parser_test.pyapplication/tests/owasp_kubernetes_top10_2022_parser_test.pyapplication/tests/owasp_kubernetes_top10_2025_parser_test.pyapplication/tests/owasp_llm_top10_2025_parser_test.pyapplication/tests/owasp_top10_2025_parser_test.pyapplication/utils/external_project_parsers/data/owasp_aisvs_1_0.jsonapplication/utils/external_project_parsers/data/owasp_api_top10_2023.jsonapplication/utils/external_project_parsers/data/owasp_cheatsheets_supplement.jsonapplication/utils/external_project_parsers/data/owasp_kubernetes_top10_2022.jsonapplication/utils/external_project_parsers/data/owasp_kubernetes_top10_2025.jsonapplication/utils/external_project_parsers/data/owasp_llm_top10_2025.jsonapplication/utils/external_project_parsers/data/owasp_top10_2025.jsonapplication/utils/external_project_parsers/parsers/cheatsheets_parser.pyapplication/utils/external_project_parsers/parsers/owasp_aisvs.pyapplication/utils/external_project_parsers/parsers/owasp_api_top10_2023.pyapplication/utils/external_project_parsers/parsers/owasp_kubernetes_top10_2022.pyapplication/utils/external_project_parsers/parsers/owasp_kubernetes_top10_2025.pyapplication/utils/external_project_parsers/parsers/owasp_llm_top10_2025.pyapplication/utils/external_project_parsers/parsers/owasp_top10_2025.pycre.py


Summary
This PR addresses part of issue #471 by normalizing OWASP cheat sheet references used by the OWASP resource mapping flow.
This is the third upstream PR in the stacked #471 review series.
Problem Fixed
Cheat sheet references were inconsistent, which made later mapping and analysis less reliable.
Solution
Normalized cheat sheet references and updated parser expectations/tests.
Tests
Why this is split out
The full #471 work is too large to review effectively as one PR.
This PR isolates one OWASP resource family so the parser/data model can be reviewed independently before the later Kubernetes, cheat sheet, backend analysis, and frontend changes.