Skip to content

Commit f0a79ff

Browse files
tausbnCopilot
andcommitted
yeast: Add yeast documentation
Covers architecture, query language, template language (tree!/trees!/rule!), capture semantics, fresh identifiers, and extractor integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent e71b633 commit f0a79ff

1 file changed

Lines changed: 329 additions & 0 deletions

File tree

shared/yeast/doc/yeast.md

Lines changed: 329 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
# YEAST — YEAST Elaborates Abstract Syntax Trees
2+
3+
YEAST is a framework for transforming tree-sitter parse trees before they are
4+
extracted into a CodeQL database. It sits between the tree-sitter parser and
5+
the TRAP extractor, rewriting parts of the AST according to declarative rules.
6+
7+
## Motivation
8+
9+
Tree-sitter grammars describe the **concrete syntax** of a language — every
10+
keyword, operator, and punctuation token appears in the parse tree. CodeQL
11+
analyses often prefer a **simplified abstract syntax** where syntactic sugar
12+
has been removed. YEAST bridges this gap by desugaring the tree-sitter output
13+
into a cleaner form before extraction.
14+
15+
For example, Ruby's `for x in list do ... end` is syntactic sugar for
16+
`list.each { |x| ... }`. A YEAST rule can rewrite the former into the latter
17+
so that CodeQL queries only need to reason about the `.each` form.
18+
19+
## Architecture
20+
21+
```
22+
Source code
23+
24+
25+
┌──────────────┐
26+
│ tree-sitter │ Parse source into a concrete syntax tree
27+
│ parser │
28+
└──────┬───────┘
29+
│ tree_sitter::Tree
30+
31+
┌──────────────┐
32+
│ YEAST │ Apply desugaring rules, producing a new AST
33+
│ Runner │
34+
└──────┬───────┘
35+
│ yeast::Ast
36+
37+
┌──────────────┐
38+
│ TRAP │ Walk the (possibly rewritten) AST and emit TRAP tuples
39+
│ extractor │
40+
└──────────────┘
41+
```
42+
43+
The entry point is `extract()` in the shared tree-sitter extractor. When
44+
called with a non-empty `rules` vector, the parsed tree is run through the
45+
YEAST `Runner` before TRAP extraction; with an empty `rules` vector the
46+
tree is extracted unchanged.
47+
48+
## How desugaring works
49+
50+
A YEAST `Rule` has two parts:
51+
52+
1. A **query** that matches nodes in the AST using a tree-sitter-inspired
53+
pattern language.
54+
2. A **transform** that produces replacement nodes from the match captures.
55+
56+
The `Runner` applies rules by walking the tree top-down. At each node, it
57+
tries each rule in order. If a rule's query matches, the node is replaced by
58+
the transform's output, and the rules are re-applied to the result. If no
59+
rule matches, the node is kept and its children are processed recursively.
60+
61+
A rule can replace one node with zero nodes (deletion), one node (rewriting),
62+
or multiple nodes (expansion).
63+
64+
## Query language
65+
66+
Queries use a syntax inspired by
67+
[tree-sitter queries](https://tree-sitter.github.io/tree-sitter/using-parsers/queries/index.html),
68+
written inside the `yeast::query!()` proc macro.
69+
70+
### Node patterns
71+
72+
```rust
73+
// Match any named node
74+
(_)
75+
76+
// Match a node of a specific kind
77+
(assignment)
78+
79+
// Match an unnamed token by its text
80+
("end")
81+
```
82+
83+
### Fields
84+
85+
```rust
86+
// Match a node with specific fields
87+
(assignment
88+
left: (identifier) @lhs
89+
right: (_) @rhs
90+
)
91+
```
92+
93+
Fields are matched by name. Unmentioned fields are ignored — the pattern
94+
`(assignment left: (_) @x)` matches any `assignment` node regardless of
95+
what's in `right`.
96+
97+
### Captures
98+
99+
Captures bind matched nodes to names for use in the transform. A capture
100+
`@name` always follows the pattern it captures:
101+
102+
```rust
103+
(identifier) @name // capture an identifier node
104+
(_) @value // capture any named node
105+
(identifier)* @items // capture each repeated match
106+
```
107+
108+
### Unnamed children
109+
110+
Patterns that appear after all named fields match unnamed (positional)
111+
children. Named node patterns like `(_)` automatically skip unnamed tokens
112+
(keywords, operators, punctuation), matching tree-sitter semantics:
113+
114+
```rust
115+
(for
116+
pattern: (_) @pat // named field
117+
value: (in (_) @val) // "in" token is skipped automatically
118+
body: (do (_)* @body) // "do" and "end" tokens skipped
119+
)
120+
```
121+
122+
### Repetitions
123+
124+
```rust
125+
(_)* // zero or more
126+
(_)+ // one or more
127+
(_)? // zero or one
128+
(identifier)* @names // capture each repeated match
129+
```
130+
131+
## Template language
132+
133+
Templates construct new AST nodes using the `tree!` and `trees!` macros.
134+
All children in a template must be in named fields — output AST nodes are
135+
always fully fielded.
136+
137+
When used inside a `rule!` macro, the context is implicit — no explicit
138+
`BuildCtx` argument is needed. When used standalone, they take a `BuildCtx`
139+
as the first argument:
140+
141+
```rust
142+
// Inside rule! — implicit context, captures are Rust variables
143+
yeast::rule!(
144+
(assignment left: (_) @left right: (_) @right)
145+
=>
146+
(assignment left: {right} right: {left})
147+
);
148+
149+
// Standalone — explicit context
150+
let fresh = yeast::tree_builder::FreshScope::new();
151+
let mut ctx = BuildCtx::new(ast, &captures, &fresh);
152+
let id = yeast::tree!(ctx,
153+
(assignment
154+
left: {ctx.capture("lhs")}
155+
right: {ctx.capture("rhs")}
156+
)
157+
);
158+
```
159+
160+
### `tree!` — build a single node
161+
162+
`tree!(...)` returns a single node `Id`:
163+
164+
```rust
165+
yeast::tree!(ctx,
166+
(assignment
167+
left: {ctx.capture("lhs")}
168+
right: {ctx.capture("rhs")}
169+
)
170+
)
171+
```
172+
173+
### `trees!` — build multiple nodes
174+
175+
`trees!(...)` returns `Vec<Id>`:
176+
177+
```rust
178+
yeast::trees!(ctx,
179+
(assignment left: {tmp} right: {right})
180+
{..body}
181+
)
182+
```
183+
184+
### Literal nodes
185+
186+
`(kind "text")` creates a leaf node with fixed text content:
187+
188+
```rust
189+
(identifier "each") // an identifier node whose text is "each"
190+
```
191+
192+
### Computed literals
193+
194+
`(kind #{expr})` creates a leaf node whose content is `expr.to_string()`:
195+
196+
```rust
197+
(integer #{i}) // an integer node with the value of i
198+
(identifier #{name}) // an identifier from a Rust variable
199+
```
200+
201+
### Fresh identifiers
202+
203+
`(kind $name)` creates a leaf node with an auto-generated unique name. All
204+
occurrences of the same `$name` within one `BuildCtx` share the same value:
205+
206+
```rust
207+
(block
208+
parameters: (block_parameters
209+
(identifier $tmp) // generates e.g. "$tmp-0"
210+
)
211+
body: (block_body
212+
(assignment
213+
left: {pat}
214+
right: (identifier $tmp) // same "$tmp-0" value
215+
)
216+
)
217+
)
218+
```
219+
220+
### Embedded Rust expressions
221+
222+
`{expr}` embeds a Rust expression that returns a single node `Id`:
223+
224+
```rust
225+
(assignment
226+
left: {some_node_id} // insert a pre-built node
227+
right: {rhs} // insert a captured value (inside rule!)
228+
)
229+
```
230+
231+
`{..expr}` splices a `Vec<Id>` (or any iterable of `Id`):
232+
233+
```rust
234+
yeast::trees!(ctx,
235+
(assignment left: {tmp} right: {right})
236+
{..extra_nodes} // splice a Vec<Id>
237+
)
238+
```
239+
240+
Inside `rule!`, captures are Rust variables, so `{name}` inserts a
241+
single capture (`Id`) and `{..name}` splices a repeated capture
242+
(`Vec<Id>`).
243+
244+
## Complete example: for-loop desugaring
245+
246+
This rule rewrites Ruby's `for pat in val do body end` into
247+
`val.each { |tmp| pat = tmp; body }`:
248+
249+
```rust
250+
let for_rule = yeast::rule!(
251+
(for
252+
pattern: (_) @pat
253+
value: (in (_) @val)
254+
body: (do (_)* @body)
255+
)
256+
=>
257+
(call
258+
receiver: {val}
259+
method: (identifier "each")
260+
block: (block
261+
parameters: (block_parameters
262+
(identifier $tmp)
263+
)
264+
body: (block_body
265+
(assignment
266+
left: {pat}
267+
right: (identifier $tmp)
268+
)
269+
{..body}
270+
)
271+
)
272+
)
273+
);
274+
```
275+
276+
Captures from the query (`@pat`, `@val`, `@body`) become Rust variables
277+
automatically: single captures bind as `Id`, repeated captures (after
278+
`*` or `+`) as `Vec<Id>`, and optional captures (after `?`) as
279+
`Option<Id>`.
280+
281+
## The `rule!` macro
282+
283+
`rule!` combines a query and a transform into a single declaration:
284+
285+
```rust
286+
// Full template form
287+
yeast::rule!(
288+
(query_pattern field: (_) @capture)
289+
=>
290+
(output_template field: {capture})
291+
)
292+
293+
// Shorthand form — captures become fields on the output node
294+
yeast::rule!(
295+
(query_pattern field: (_) @capture)
296+
=> output_kind
297+
)
298+
```
299+
300+
The shorthand `=> kind` form auto-generates the template, mapping each
301+
capture name to a field of the same name on the output node.
302+
303+
## Integration with the extractor
304+
305+
A YEAST desugaring pass is configured with a [`DesugaringConfig`], which
306+
carries the rules and an optional output node-types schema (in YAML
307+
format). Attach it to a language spec to enable rewriting:
308+
309+
```rust
310+
let desugar = yeast::DesugaringConfig::new(my_rules)
311+
.with_output_node_types_yaml(include_str!("output-node-types.yml"));
312+
313+
let lang = simple::LanguageSpec {
314+
prefix: "ruby",
315+
ts_language: tree_sitter_ruby::LANGUAGE.into(),
316+
node_types: tree_sitter_ruby::NODE_TYPES,
317+
desugar: Some(desugar),
318+
file_globs: vec!["*.rb".into()],
319+
};
320+
```
321+
322+
The same YAML node-types is used for both the runtime yeast `Schema` (so
323+
rules can refer to output-only kinds and fields) and TRAP validation (it
324+
is converted to JSON internally).
325+
326+
For the dbscheme/QL code generator, set `Language::desugar` to a
327+
`DesugaringConfig` carrying the same YAML; the generator converts it to
328+
JSON for downstream code generation. The `rules` field of the config is
329+
unused at code-generation time.

0 commit comments

Comments
 (0)