Composable, chainable query API: `pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...)`

## Summary

Add a **composable, chainable query API** on top of the existing CLDK analysis façade — the equivalent of what `DataFrame.groupby(...).filter(...).agg(...)` is to pandas. Today, answering a security/code-analysis question with CLDK requires manually composing the results of multiple `get_*` calls in user code; this proposal moves that composition into the library so the queries themselves become the unit of citation.

## The dream

```python
pa.callables()
  .with_decorator("http.route", auth="public")
  .reachable_to(sink_pattern("request.env[*].sudo().*"))
  .without_passing_through(sanitizers=["check_access", "has_group"])
```

This single chain expresses: *"every callable that is decorated as a public HTTP route, from which there exists a call/dataflow path to a sink matching this attribute-chain pattern, where no callable on the path is one of these named sanitizers."*

For framework-shaped audits (web, RPC, CLI, message handlers), one such chain replaces dozens of hand-written analyses. The result is a **citable** fact: the chain itself is the evidence, not the prose around it.

## Motivation (from a real audit)

I recently used CLDK to build a Proof-of-Exploitability report over 12 alerts in an Odoo addon, following a `sources.json` → reachability → taint → verdict pipeline (see the `poe-with-cldk` skill methodology). The most valuable thing CLDK gave me was authoritative decorator capture via `pa.get_method(...).decorators` — that single fact (`auth='public'` on a route) carried the entire severity of the top finding.

The bottleneck was *composition*: I had to manually walk `get_methods()`, filter by decorator predicate by reading the `decorators` list myself, then for each match run `get_callers`/`get_callees` to test reachability, then re-read the source body to check for sanitizers. That composition is the analysis. It belongs in the library, not in every user's notebook.

## What the proposal contains

### 1. A `Query` / `Selector` chain over CLDK's existing graphs

Methods to start a query:
- `pa.callables()` — every PyCallable in the analysis
- `pa.classes()`, `pa.fields()`, `pa.modules()` — analogous starting points
- `pa.callsites(pattern)` — every call site matching a structural pattern (sidesteps the dynamic-dispatch problem; see §3 below)

Methods to filter (return a narrowed query):
- `.with_decorator(name, **kwarg_predicates)` — match decorator by name and named-argument value (`auth="public"`)
- `.in_module(glob)`, `.in_class(name)`, `.in_directory(path)`
- `.matching(name_pattern)` — name/signature regex
- `.modified_since(commit_ref)` — git-blame integration (later)
- `.not_in_tests()` — convention-based test exclusion
- `.where(predicate)` — escape hatch for arbitrary user predicates

Methods to relate (return a query over a related set):
- `.callers()`, `.callees()` — one hop in the call graph
- `.transitive_callers()`, `.transitive_callees()` — closure
- `.reachable_to(other_query, *, via=["call", "dataflow"], sanitizers=None)` — path existence between two queries with optional sanitizer-mask
- `.without_passing_through(sanitizers=[...])` — narrow a reachability set by removing paths that touch any sanitizer
- `.subclasses()`, `.superclasses()`, `.implementers()`

Methods to terminate (return concrete results):
- `.signatures()` — list of FQ names
- `.objects()` — list of PyCallable / PyClass / etc.
- `.paths()` — list of resolved paths (for reachability queries), each carrying a confidence/visibility label
- `.count()`
- `.explain()` — *the most important terminator*: returns the queries CLDK ran, the backends that resolved them, and the unresolved-edge set. This is what makes the result citable.

### 2. Joins between graphs

The current `get_*` methods return one slice at a time. Good queries need joins and collects like a streaming API does in Java/Rust.

```python
(pa.callables()
   .with_decorator("http.route")
   .calling(pa.callables().in_class("CrmNoteController"))
   .where(lambda c: c.modified_since("HEAD~10"))
   .signatures())
```

Call graph ⋈ inheritance ⋈ decorator-set ⋈ git-blame is the highest leverage join

### 3. Pattern-based sink matching on attribute chains

For dynamic-dispatch-heavy code (Odoo ORM, Django ORM, SQLAlchemy, message-bus dispatch), the resolved call graph cannot see the sinks. A `sink_pattern("request.env[*].sudo().*")` matcher operating at the AST / token level on attribute chains — usable inside `.reachable_to(sink_pattern(...))` — is the escape hatch. It also bridges Bandit-/Semgrep-style pattern rules into CLDK without re-implementing them.

### 4. Provenance, lazily evaluated, cached

- Lazy: a chain is a query plan, not a list of results. Evaluation happens at the terminator.
- Cached: re-running the same chain hits the cache (the analysis backend is already cache-backed by `analysis_cache.json`; queries should follow).
- Provenance: every result carries the chain + backend resolutions that produced it. \`.explain()\` surfaces this; for security reports it becomes the citation.

### 5. Honest visibility labels

Every reachability / dataflow result should carry a label distinguishing:
- **resolved** — CodeQL/Jedi confirmed the edge
- **structural** — pattern matched but not resolved through dispatch
- **unresolved** — analyzer can't see this edge, surfaced anyway

This is the single most valuable property for security work: it lets the *consumer* set their confidence tier based on the result, instead of being silently mislead by dropped edges.

## Concrete example end-to-end

The 12-folder audit I did by hand would collapse to roughly:

```python
public_unauth_sudo = (
    pa.callables()
      .with_decorator("http.route", auth="public")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().browse")))
      .without_passing_through(sanitizers=["check_access", "has_group"])
)

inapp_sudo_no_gate = (
    pa.callables()
      .with_decorator("http.route", auth="user")
      .reachable_to(pa.callsites(sink_pattern("*.sudo().*")))
      .without_passing_through(sanitizers=[
          "check_access", "has_group",
          "owner_user_id.id == request.env.user.id",   # value-predicate sanitizers
      ])
)

for c in public_unauth_sudo.objects():
    print(c.signature, "-> CONFIRMED unauth sudo path")
for c in inapp_sudo_no_gate.objects():
    print(c.signature, "-> CONDITIONALLY CONFIRMED (in_app_role)")
```

Twelve handwritten findings become two queries, each `.explain()`-able into the same evidence I produced by hand.

## What this is *not*

- Not a new analysis backend. It is a query layer over the existing Jedi + CodeQL results plus AST/token access for pattern matching.
- Not a scanner. CLDK should remain the analyst-side tool that audits scanner alerts; this proposal makes that audit composable and citable, not automatic.
- Not an opinionated security framework. Sanitizer/source/sink dictionaries are user-supplied (with the same justification discipline the `sources.json` validator already enforces); the library returns facts and provenance.

## Relationship to existing stubs

Two stubs in the current SDK become natural starting points for the chain API once implemented:

- `get_methods_with_decorators(...)` → `pa.callables().with_decorator(...)`
- `get_entry_point_methods(...)` → `pa.callables().is_entry_point()` (with framework recipes deciding what counts)

`get_calling_lines` and `get_call_targets` would underlie `.callers()` / `.callees()`.

## Why this is the highest-leverage change

Pandas is my reference, it is great because it makes the composition of common slicing operations cheap and citable. CLDK already has the underlying analysis facts; what's missing is the surface that turns them into composable, citable queries.

For framework-shaped security audits specifically (web, RPC, CLI, async messaging — i.e., most real-world Python codebases), the marginal value of this change is larger than the marginal value of any analysis-engine improvement. A composable query layer with pattern-based escape hatches and honest visibility labels would be awesome.

## Out of scope (suggested separate issues)

- Polyglot graphs that follow HTTP/gRPC/message-queue edges across languages
- Differential queries between commits / branches
- Bridge to coverage / dynamic instrumentation
- A framework-recipe registry (Flask, Django, FastAPI, Odoo, Express, Rails, Spring) shipping as data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composable, chainable query API: `pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...)` #155

Summary

The dream

Motivation (from a real audit)

What the proposal contains

1. A `Query` / `Selector` chain over CLDK's existing graphs

2. Joins between graphs

3. Pattern-based sink matching on attribute chains

4. Provenance, lazily evaluated, cached

5. Honest visibility labels

Concrete example end-to-end

What this is not

Relationship to existing stubs

Why this is the highest-leverage change

Out of scope (suggested separate issues)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Composable, chainable query API: pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...) #155

Description

Summary

The dream

Motivation (from a real audit)

What the proposal contains

1. A Query / Selector chain over CLDK's existing graphs

2. Joins between graphs

3. Pattern-based sink matching on attribute chains

4. Provenance, lazily evaluated, cached

5. Honest visibility labels

Concrete example end-to-end

What this is not

Relationship to existing stubs

Why this is the highest-leverage change

Out of scope (suggested separate issues)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Composable, chainable query API: `pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...)` #155

1. A `Query` / `Selector` chain over CLDK's existing graphs