Skip to content

Composable, chainable query API: pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...) #155

@rahlk

Description

@rahlk

Summary

Add a composable, chainable query API on top of the existing CLDK analysis façade — the equivalent of what DataFrame.groupby(...).filter(...).agg(...) is to pandas. Today, answering a security/code-analysis question with CLDK requires manually composing the results of multiple get_* calls in user code; this proposal moves that composition into the library so the queries themselves become the unit of citation.

The dream

pa.callables()
  .with_decorator("http.route", auth="public")
  .reachable_to(sink_pattern("request.env[*].sudo().*"))
  .without_passing_through(sanitizers=["check_access", "has_group"])

This single chain expresses: "every callable that is decorated as a public HTTP route, from which there exists a call/dataflow path to a sink matching this attribute-chain pattern, where no callable on the path is one of these named sanitizers."

For framework-shaped audits (web, RPC, CLI, message handlers), one such chain replaces dozens of hand-written analyses. The result is a citable fact: the chain itself is the evidence, not the prose around it.

Motivation (from a real audit)

I recently used CLDK to build a Proof-of-Exploitability report over 12 alerts in an Odoo addon, following a sources.json → reachability → taint → verdict pipeline (see the poe-with-cldk skill methodology). The most valuable thing CLDK gave me was authoritative decorator capture via pa.get_method(...).decorators — that single fact (auth='public' on a route) carried the entire severity of the top finding.

The bottleneck was composition: I had to manually walk get_methods(), filter by decorator predicate by reading the decorators list myself, then for each match run get_callers/get_callees to test reachability, then re-read the source body to check for sanitizers. That composition is the analysis. It belongs in the library, not in every user's notebook.

What the proposal contains

1. A Query / Selector chain over CLDK's existing graphs

Methods to start a query:

  • pa.callables() — every PyCallable in the analysis
  • pa.classes(), pa.fields(), pa.modules() — analogous starting points
  • pa.callsites(pattern) — every call site matching a structural pattern (sidesteps the dynamic-dispatch problem; see §3 below)

Methods to filter (return a narrowed query):

  • .with_decorator(name, **kwarg_predicates) — match decorator by name and named-argument value (auth="public")
  • .in_module(glob), .in_class(name), .in_directory(path)
  • .matching(name_pattern) — name/signature regex
  • .modified_since(commit_ref) — git-blame integration (later)
  • .not_in_tests() — convention-based test exclusion
  • .where(predicate) — escape hatch for arbitrary user predicates

Methods to relate (return a query over a related set):

  • .callers(), .callees() — one hop in the call graph
  • .transitive_callers(), .transitive_callees() — closure
  • .reachable_to(other_query, *, via=["call", "dataflow"], sanitizers=None) — path existence between two queries with optional sanitizer-mask
  • .without_passing_through(sanitizers=[...]) — narrow a reachability set by removing paths that touch any sanitizer
  • .subclasses(), .superclasses(), .implementers()

Methods to terminate (return concrete results):

  • .signatures() — list of FQ names
  • .objects() — list of PyCallable / PyClass / etc.
  • .paths() — list of resolved paths (for reachability queries), each carrying a confidence/visibility label
  • .count()
  • .explain()the most important terminator: returns the queries CLDK ran, the backends that resolved them, and the unresolved-edge set. This is what makes the result citable.

2. Joins between graphs

The current get_* methods return one slice at a time. Good queries need joins and collects like a streaming API does in Java/Rust.

(pa.callables()
   .with_decorator("http.route")
   .calling(pa.callables().in_class("CrmNoteController"))
   .where(lambda c: c.modified_since("HEAD~10"))
   .signatures())

Call graph ⋈ inheritance ⋈ decorator-set ⋈ git-blame is the highest leverage join

3. Pattern-based sink matching on attribute chains

For dynamic-dispatch-heavy code (Odoo ORM, Django ORM, SQLAlchemy, message-bus dispatch), the resolved call graph cannot see the sinks. A sink_pattern("request.env[*].sudo().*") matcher operating at the AST / token level on attribute chains — usable inside .reachable_to(sink_pattern(...)) — is the escape hatch. It also bridges Bandit-/Semgrep-style pattern rules into CLDK without re-implementing them.

4. Provenance, lazily evaluated, cached

  • Lazy: a chain is a query plan, not a list of results. Evaluation happens at the terminator.
  • Cached: re-running the same chain hits the cache (the analysis backend is already cache-backed by analysis_cache.json; queries should follow).
  • Provenance: every result carries the chain + backend resolutions that produced it. `.explain()` surfaces this; for security reports it becomes the citation.

5. Honest visibility labels

Every reachability / dataflow result should carry a label distinguishing:

  • resolved — CodeQL/Jedi confirmed the edge
  • structural — pattern matched but not resolved through dispatch
  • unresolved — analyzer can't see this edge, surfaced anyway

This is the single most valuable property for security work: it lets the consumer set their confidence tier based on the result, instead of being silently mislead by dropped edges.

Concrete example end-to-end

The 12-folder audit I did by hand would collapse to roughly:

public_unauth_sudo = (
    pa.callables()
      .with_decorator("http.route", auth="public")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().browse")))
      .without_passing_through(sanitizers=["check_access", "has_group"])
)

inapp_sudo_no_gate = (
    pa.callables()
      .with_decorator("http.route", auth="user")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().*")))
      .without_passing_through(sanitizers=[
          "check_access", "has_group",
          "owner_user_id.id == request.env.user.id",   # value-predicate sanitizers
      ])
)

for c in public_unauth_sudo.objects():
    print(c.signature, "-> CONFIRMED unauth sudo path")
for c in inapp_sudo_no_gate.objects():
    print(c.signature, "-> CONDITIONALLY CONFIRMED (in_app_role)")

Twelve handwritten findings become two queries, each .explain()-able into the same evidence I produced by hand.

What this is not

  • Not a new analysis backend. It is a query layer over the existing Jedi + CodeQL results plus AST/token access for pattern matching.
  • Not a scanner. CLDK should remain the analyst-side tool that audits scanner alerts; this proposal makes that audit composable and citable, not automatic.
  • Not an opinionated security framework. Sanitizer/source/sink dictionaries are user-supplied (with the same justification discipline the sources.json validator already enforces); the library returns facts and provenance.

Relationship to existing stubs

Two stubs in the current SDK become natural starting points for the chain API once implemented:

  • get_methods_with_decorators(...)pa.callables().with_decorator(...)
  • get_entry_point_methods(...)pa.callables().is_entry_point() (with framework recipes deciding what counts)

get_calling_lines and get_call_targets would underlie .callers() / .callees().

Why this is the highest-leverage change

Pandas is my reference, it is great because it makes the composition of common slicing operations cheap and citable. CLDK already has the underlying analysis facts; what's missing is the surface that turns them into composable, citable queries.

For framework-shaped security audits specifically (web, RPC, CLI, async messaging — i.e., most real-world Python codebases), the marginal value of this change is larger than the marginal value of any analysis-engine improvement. A composable query layer with pattern-based escape hatches and honest visibility labels would be awesome.

Out of scope (suggested separate issues)

  • Polyglot graphs that follow HTTP/gRPC/message-queue edges across languages
  • Differential queries between commits / branches
  • Bridge to coverage / dynamic instrumentation
  • A framework-recipe registry (Flask, Django, FastAPI, Odoo, Express, Rails, Spring) shipping as data

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions