Summary
Add a composable, chainable query API on top of the existing CLDK analysis façade — the equivalent of what DataFrame.groupby(...).filter(...).agg(...) is to pandas. Today, answering a security/code-analysis question with CLDK requires manually composing the results of multiple get_* calls in user code; this proposal moves that composition into the library so the queries themselves become the unit of citation.
The dream
pa.callables()
.with_decorator("http.route", auth="public")
.reachable_to(sink_pattern("request.env[*].sudo().*"))
.without_passing_through(sanitizers=["check_access", "has_group"])
This single chain expresses: "every callable that is decorated as a public HTTP route, from which there exists a call/dataflow path to a sink matching this attribute-chain pattern, where no callable on the path is one of these named sanitizers."
For framework-shaped audits (web, RPC, CLI, message handlers), one such chain replaces dozens of hand-written analyses. The result is a citable fact: the chain itself is the evidence, not the prose around it.
Motivation (from a real audit)
I recently used CLDK to build a Proof-of-Exploitability report over 12 alerts in an Odoo addon, following a sources.json → reachability → taint → verdict pipeline (see the poe-with-cldk skill methodology). The most valuable thing CLDK gave me was authoritative decorator capture via pa.get_method(...).decorators — that single fact (auth='public' on a route) carried the entire severity of the top finding.
The bottleneck was composition: I had to manually walk get_methods(), filter by decorator predicate by reading the decorators list myself, then for each match run get_callers/get_callees to test reachability, then re-read the source body to check for sanitizers. That composition is the analysis. It belongs in the library, not in every user's notebook.
What the proposal contains
1. A Query / Selector chain over CLDK's existing graphs
Methods to start a query:
pa.callables() — every PyCallable in the analysis
pa.classes(), pa.fields(), pa.modules() — analogous starting points
pa.callsites(pattern) — every call site matching a structural pattern (sidesteps the dynamic-dispatch problem; see §3 below)
Methods to filter (return a narrowed query):
.with_decorator(name, **kwarg_predicates) — match decorator by name and named-argument value (auth="public")
.in_module(glob), .in_class(name), .in_directory(path)
.matching(name_pattern) — name/signature regex
.modified_since(commit_ref) — git-blame integration (later)
.not_in_tests() — convention-based test exclusion
.where(predicate) — escape hatch for arbitrary user predicates
Methods to relate (return a query over a related set):
.callers(), .callees() — one hop in the call graph
.transitive_callers(), .transitive_callees() — closure
.reachable_to(other_query, *, via=["call", "dataflow"], sanitizers=None) — path existence between two queries with optional sanitizer-mask
.without_passing_through(sanitizers=[...]) — narrow a reachability set by removing paths that touch any sanitizer
.subclasses(), .superclasses(), .implementers()
Methods to terminate (return concrete results):
.signatures() — list of FQ names
.objects() — list of PyCallable / PyClass / etc.
.paths() — list of resolved paths (for reachability queries), each carrying a confidence/visibility label
.count()
.explain() — the most important terminator: returns the queries CLDK ran, the backends that resolved them, and the unresolved-edge set. This is what makes the result citable.
2. Joins between graphs
The current get_* methods return one slice at a time. Good queries need joins and collects like a streaming API does in Java/Rust.
(pa.callables()
.with_decorator("http.route")
.calling(pa.callables().in_class("CrmNoteController"))
.where(lambda c: c.modified_since("HEAD~10"))
.signatures())
Call graph ⋈ inheritance ⋈ decorator-set ⋈ git-blame is the highest leverage join
3. Pattern-based sink matching on attribute chains
For dynamic-dispatch-heavy code (Odoo ORM, Django ORM, SQLAlchemy, message-bus dispatch), the resolved call graph cannot see the sinks. A sink_pattern("request.env[*].sudo().*") matcher operating at the AST / token level on attribute chains — usable inside .reachable_to(sink_pattern(...)) — is the escape hatch. It also bridges Bandit-/Semgrep-style pattern rules into CLDK without re-implementing them.
4. Provenance, lazily evaluated, cached
- Lazy: a chain is a query plan, not a list of results. Evaluation happens at the terminator.
- Cached: re-running the same chain hits the cache (the analysis backend is already cache-backed by
analysis_cache.json; queries should follow).
- Provenance: every result carries the chain + backend resolutions that produced it. `.explain()` surfaces this; for security reports it becomes the citation.
5. Honest visibility labels
Every reachability / dataflow result should carry a label distinguishing:
- resolved — CodeQL/Jedi confirmed the edge
- structural — pattern matched but not resolved through dispatch
- unresolved — analyzer can't see this edge, surfaced anyway
This is the single most valuable property for security work: it lets the consumer set their confidence tier based on the result, instead of being silently mislead by dropped edges.
Concrete example end-to-end
The 12-folder audit I did by hand would collapse to roughly:
public_unauth_sudo = (
pa.callables()
.with_decorator("http.route", auth="public")
.reachable_to(pa.callsites(sink_pattern("*.sudo().browse")))
.without_passing_through(sanitizers=["check_access", "has_group"])
)
inapp_sudo_no_gate = (
pa.callables()
.with_decorator("http.route", auth="user")
.reachable_to(pa.callsites(sink_pattern("*.sudo().*")))
.without_passing_through(sanitizers=[
"check_access", "has_group",
"owner_user_id.id == request.env.user.id", # value-predicate sanitizers
])
)
for c in public_unauth_sudo.objects():
print(c.signature, "-> CONFIRMED unauth sudo path")
for c in inapp_sudo_no_gate.objects():
print(c.signature, "-> CONDITIONALLY CONFIRMED (in_app_role)")
Twelve handwritten findings become two queries, each .explain()-able into the same evidence I produced by hand.
What this is not
- Not a new analysis backend. It is a query layer over the existing Jedi + CodeQL results plus AST/token access for pattern matching.
- Not a scanner. CLDK should remain the analyst-side tool that audits scanner alerts; this proposal makes that audit composable and citable, not automatic.
- Not an opinionated security framework. Sanitizer/source/sink dictionaries are user-supplied (with the same justification discipline the
sources.json validator already enforces); the library returns facts and provenance.
Relationship to existing stubs
Two stubs in the current SDK become natural starting points for the chain API once implemented:
get_methods_with_decorators(...) → pa.callables().with_decorator(...)
get_entry_point_methods(...) → pa.callables().is_entry_point() (with framework recipes deciding what counts)
get_calling_lines and get_call_targets would underlie .callers() / .callees().
Why this is the highest-leverage change
Pandas is my reference, it is great because it makes the composition of common slicing operations cheap and citable. CLDK already has the underlying analysis facts; what's missing is the surface that turns them into composable, citable queries.
For framework-shaped security audits specifically (web, RPC, CLI, async messaging — i.e., most real-world Python codebases), the marginal value of this change is larger than the marginal value of any analysis-engine improvement. A composable query layer with pattern-based escape hatches and honest visibility labels would be awesome.
Out of scope (suggested separate issues)
- Polyglot graphs that follow HTTP/gRPC/message-queue edges across languages
- Differential queries between commits / branches
- Bridge to coverage / dynamic instrumentation
- A framework-recipe registry (Flask, Django, FastAPI, Odoo, Express, Rails, Spring) shipping as data
Summary
Add a composable, chainable query API on top of the existing CLDK analysis façade — the equivalent of what
DataFrame.groupby(...).filter(...).agg(...)is to pandas. Today, answering a security/code-analysis question with CLDK requires manually composing the results of multipleget_*calls in user code; this proposal moves that composition into the library so the queries themselves become the unit of citation.The dream
This single chain expresses: "every callable that is decorated as a public HTTP route, from which there exists a call/dataflow path to a sink matching this attribute-chain pattern, where no callable on the path is one of these named sanitizers."
For framework-shaped audits (web, RPC, CLI, message handlers), one such chain replaces dozens of hand-written analyses. The result is a citable fact: the chain itself is the evidence, not the prose around it.
Motivation (from a real audit)
I recently used CLDK to build a Proof-of-Exploitability report over 12 alerts in an Odoo addon, following a
sources.json→ reachability → taint → verdict pipeline (see thepoe-with-cldkskill methodology). The most valuable thing CLDK gave me was authoritative decorator capture viapa.get_method(...).decorators— that single fact (auth='public'on a route) carried the entire severity of the top finding.The bottleneck was composition: I had to manually walk
get_methods(), filter by decorator predicate by reading thedecoratorslist myself, then for each match runget_callers/get_calleesto test reachability, then re-read the source body to check for sanitizers. That composition is the analysis. It belongs in the library, not in every user's notebook.What the proposal contains
1. A
Query/Selectorchain over CLDK's existing graphsMethods to start a query:
pa.callables()— every PyCallable in the analysispa.classes(),pa.fields(),pa.modules()— analogous starting pointspa.callsites(pattern)— every call site matching a structural pattern (sidesteps the dynamic-dispatch problem; see §3 below)Methods to filter (return a narrowed query):
.with_decorator(name, **kwarg_predicates)— match decorator by name and named-argument value (auth="public").in_module(glob),.in_class(name),.in_directory(path).matching(name_pattern)— name/signature regex.modified_since(commit_ref)— git-blame integration (later).not_in_tests()— convention-based test exclusion.where(predicate)— escape hatch for arbitrary user predicatesMethods to relate (return a query over a related set):
.callers(),.callees()— one hop in the call graph.transitive_callers(),.transitive_callees()— closure.reachable_to(other_query, *, via=["call", "dataflow"], sanitizers=None)— path existence between two queries with optional sanitizer-mask.without_passing_through(sanitizers=[...])— narrow a reachability set by removing paths that touch any sanitizer.subclasses(),.superclasses(),.implementers()Methods to terminate (return concrete results):
.signatures()— list of FQ names.objects()— list of PyCallable / PyClass / etc..paths()— list of resolved paths (for reachability queries), each carrying a confidence/visibility label.count().explain()— the most important terminator: returns the queries CLDK ran, the backends that resolved them, and the unresolved-edge set. This is what makes the result citable.2. Joins between graphs
The current
get_*methods return one slice at a time. Good queries need joins and collects like a streaming API does in Java/Rust.Call graph ⋈ inheritance ⋈ decorator-set ⋈ git-blame is the highest leverage join
3. Pattern-based sink matching on attribute chains
For dynamic-dispatch-heavy code (Odoo ORM, Django ORM, SQLAlchemy, message-bus dispatch), the resolved call graph cannot see the sinks. A
sink_pattern("request.env[*].sudo().*")matcher operating at the AST / token level on attribute chains — usable inside.reachable_to(sink_pattern(...))— is the escape hatch. It also bridges Bandit-/Semgrep-style pattern rules into CLDK without re-implementing them.4. Provenance, lazily evaluated, cached
analysis_cache.json; queries should follow).5. Honest visibility labels
Every reachability / dataflow result should carry a label distinguishing:
This is the single most valuable property for security work: it lets the consumer set their confidence tier based on the result, instead of being silently mislead by dropped edges.
Concrete example end-to-end
The 12-folder audit I did by hand would collapse to roughly:
Twelve handwritten findings become two queries, each
.explain()-able into the same evidence I produced by hand.What this is not
sources.jsonvalidator already enforces); the library returns facts and provenance.Relationship to existing stubs
Two stubs in the current SDK become natural starting points for the chain API once implemented:
get_methods_with_decorators(...)→pa.callables().with_decorator(...)get_entry_point_methods(...)→pa.callables().is_entry_point()(with framework recipes deciding what counts)get_calling_linesandget_call_targetswould underlie.callers()/.callees().Why this is the highest-leverage change
Pandas is my reference, it is great because it makes the composition of common slicing operations cheap and citable. CLDK already has the underlying analysis facts; what's missing is the surface that turns them into composable, citable queries.
For framework-shaped security audits specifically (web, RPC, CLI, async messaging — i.e., most real-world Python codebases), the marginal value of this change is larger than the marginal value of any analysis-engine improvement. A composable query layer with pattern-based escape hatches and honest visibility labels would be awesome.
Out of scope (suggested separate issues)