Skip to content

Commit 5eb6cfd

Browse files
tausbnCopilot
andcommitted
yeast: Add yeast documentation
Covers architecture, query language, template language (tree!/trees!/rule!), capture semantics, fresh identifiers, and extractor integration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent e6ebb9d commit 5eb6cfd

1 file changed

Lines changed: 314 additions & 0 deletions

File tree

shared/yeast/doc/yeast.md

Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
# YEAST — YEAST Elaborates Abstract Syntax Trees
2+
3+
YEAST is a framework for transforming tree-sitter parse trees before they are
4+
extracted into a CodeQL database. It sits between the tree-sitter parser and
5+
the TRAP extractor, rewriting parts of the AST according to declarative rules.
6+
7+
## Motivation
8+
9+
Tree-sitter grammars describe the **concrete syntax** of a language — every
10+
keyword, operator, and punctuation token appears in the parse tree. CodeQL
11+
analyses often prefer a **simplified abstract syntax** where syntactic sugar
12+
has been removed. YEAST bridges this gap by desugaring the tree-sitter output
13+
into a cleaner form before extraction.
14+
15+
For example, Ruby's `for x in list do ... end` is syntactic sugar for
16+
`list.each { |x| ... }`. A YEAST rule can rewrite the former into the latter
17+
so that CodeQL queries only need to reason about the `.each` form.
18+
19+
## Architecture
20+
21+
```
22+
Source code
23+
24+
25+
┌──────────────┐
26+
│ tree-sitter │ Parse source into a concrete syntax tree
27+
│ parser │
28+
└──────┬───────┘
29+
│ tree_sitter::Tree
30+
31+
┌──────────────┐
32+
│ YEAST │ Apply desugaring rules, producing a new AST
33+
│ Runner │
34+
└──────┬───────┘
35+
│ yeast::Ast
36+
37+
┌──────────────┐
38+
│ TRAP │ Walk the (possibly rewritten) AST and emit TRAP tuples
39+
│ extractor │
40+
└──────────────┘
41+
```
42+
43+
The entry point is `extract_and_desugar()` in the shared tree-sitter
44+
extractor, which passes a set of rules to the YEAST `Runner`. The original
45+
`extract()` function passes empty rules, leaving the tree unchanged.
46+
47+
## How desugaring works
48+
49+
A YEAST `Rule` has two parts:
50+
51+
1. A **query** that matches nodes in the AST using a tree-sitter-inspired
52+
pattern language.
53+
2. A **transform** that produces replacement nodes from the match captures.
54+
55+
The `Runner` applies rules by walking the tree top-down. At each node, it
56+
tries each rule in order. If a rule's query matches, the node is replaced by
57+
the transform's output, and the rules are re-applied to the result. If no
58+
rule matches, the node is kept and its children are processed recursively.
59+
60+
A rule can replace one node with zero nodes (deletion), one node (rewriting),
61+
or multiple nodes (expansion).
62+
63+
## Query language
64+
65+
Queries use a syntax inspired by
66+
[tree-sitter queries](https://tree-sitter.github.io/tree-sitter/using-parsers/queries/index.html),
67+
written inside the `yeast::query!()` proc macro.
68+
69+
### Node patterns
70+
71+
```rust
72+
// Match any named node
73+
(_)
74+
75+
// Match a node of a specific kind
76+
(assignment)
77+
78+
// Match an unnamed token by its text
79+
("end")
80+
```
81+
82+
### Fields
83+
84+
```rust
85+
// Match a node with specific fields
86+
(assignment
87+
left: (identifier) @lhs
88+
right: (_) @rhs
89+
)
90+
```
91+
92+
Fields are matched by name. Unmentioned fields are ignored — the pattern
93+
`(assignment left: (_) @x)` matches any `assignment` node regardless of
94+
what's in `right`.
95+
96+
### Captures
97+
98+
Captures bind matched nodes to names for use in the transform. A capture
99+
`@name` always follows the pattern it captures:
100+
101+
```rust
102+
(identifier) @name // capture an identifier node
103+
(_) @value // capture any named node
104+
(identifier)* @items // capture each repeated match
105+
```
106+
107+
### Unnamed children
108+
109+
Patterns that appear after all named fields match unnamed (positional)
110+
children. Named node patterns like `(_)` automatically skip unnamed tokens
111+
(keywords, operators, punctuation), matching tree-sitter semantics:
112+
113+
```rust
114+
(for
115+
pattern: (_) @pat // named field
116+
value: (in (_) @val) // "in" token is skipped automatically
117+
body: (do (_)* @body) // "do" and "end" tokens skipped
118+
)
119+
```
120+
121+
### Repetitions
122+
123+
```rust
124+
(_)* // zero or more
125+
(_)+ // one or more
126+
(_)? // zero or one
127+
(identifier)* @names // capture each repeated match
128+
```
129+
130+
## Template language
131+
132+
Templates construct new AST nodes using the `tree!` and `trees!` macros.
133+
All children in a template must be in named fields — output AST nodes are
134+
always fully fielded.
135+
136+
When used inside a `rule!` macro, the context is implicit — no explicit
137+
`BuildCtx` argument is needed. When used standalone, they take a `BuildCtx`
138+
as the first argument:
139+
140+
```rust
141+
// Inside rule! — implicit context, captures are Rust variables
142+
yeast::rule!(
143+
(assignment left: (_) @left right: (_) @right)
144+
=>
145+
(assignment left: {right} right: {left})
146+
);
147+
148+
// Standalone — explicit context
149+
let fresh = yeast::tree_builder::FreshScope::new();
150+
let mut ctx = BuildCtx::new(ast, &captures, &fresh);
151+
let id = yeast::tree!(ctx,
152+
(assignment
153+
left: {ctx.capture("lhs")}
154+
right: {ctx.capture("rhs")}
155+
)
156+
);
157+
```
158+
159+
### `tree!` — build a single node
160+
161+
`tree!(...)` returns a single node `Id`:
162+
163+
```rust
164+
yeast::tree!(ctx,
165+
(assignment
166+
left: {ctx.capture("lhs")}
167+
right: {ctx.capture("rhs")}
168+
)
169+
)
170+
```
171+
172+
### `trees!` — build multiple nodes
173+
174+
`trees!(...)` returns `Vec<Id>`:
175+
176+
```rust
177+
yeast::trees!(ctx,
178+
(assignment left: {tmp} right: {right})
179+
{..body}
180+
)
181+
```
182+
183+
### Literal nodes
184+
185+
`(kind "text")` creates a leaf node with fixed text content:
186+
187+
```rust
188+
(identifier "each") // an identifier node whose text is "each"
189+
```
190+
191+
### Computed literals
192+
193+
`(kind #{expr})` creates a leaf node whose content is `expr.to_string()`:
194+
195+
```rust
196+
(integer #{i}) // an integer node with the value of i
197+
(identifier #{name}) // an identifier from a Rust variable
198+
```
199+
200+
### Fresh identifiers
201+
202+
`(kind $name)` creates a leaf node with an auto-generated unique name. All
203+
occurrences of the same `$name` within one `BuildCtx` share the same value:
204+
205+
```rust
206+
(block
207+
parameters: (block_parameters
208+
(identifier $tmp) // generates e.g. "$tmp-0"
209+
)
210+
body: (block_body
211+
(assignment
212+
left: {pat}
213+
right: (identifier $tmp) // same "$tmp-0" value
214+
)
215+
)
216+
)
217+
```
218+
219+
### Embedded Rust expressions
220+
221+
`{expr}` embeds a Rust expression that returns a single node `Id`:
222+
223+
```rust
224+
(assignment
225+
left: {some_node_id} // insert a pre-built node
226+
right: {rhs} // insert a captured value (inside rule!)
227+
)
228+
```
229+
230+
`{..expr}` splices a `Vec<Id>` (or any iterable of `Id`):
231+
232+
```rust
233+
yeast::trees!(ctx,
234+
(assignment left: {tmp} right: {right})
235+
{..extra_nodes} // splice a Vec<Id>
236+
)
237+
```
238+
239+
Inside `rule!`, captures are Rust variables, so `{name}` inserts a
240+
single capture (`Id`) and `{..name}` splices a repeated capture
241+
(`Vec<Id>`).
242+
243+
## Complete example: for-loop desugaring
244+
245+
This rule rewrites Ruby's `for pat in val do body end` into
246+
`val.each { |tmp| pat = tmp; body }`:
247+
248+
```rust
249+
let for_rule = yeast::rule!(
250+
(for
251+
pattern: (_) @pat
252+
value: (in (_) @val)
253+
body: (do (_)* @body)
254+
)
255+
=>
256+
(call
257+
receiver: {val}
258+
method: (identifier "each")
259+
block: (block
260+
parameters: (block_parameters
261+
(identifier $tmp)
262+
)
263+
body: (block_body
264+
(assignment
265+
left: {pat}
266+
right: (identifier $tmp)
267+
)
268+
{..body}
269+
)
270+
)
271+
)
272+
);
273+
```
274+
275+
Captures from the query (`@pat`, `@val`, `@body`) become Rust variables
276+
automatically: single captures bind as `Id`, repeated captures (after
277+
`*` or `+`) as `Vec<Id>`, and optional captures (after `?`) as
278+
`Option<Id>`.
279+
280+
## The `rule!` macro
281+
282+
`rule!` combines a query and a transform into a single declaration:
283+
284+
```rust
285+
// Full template form
286+
yeast::rule!(
287+
(query_pattern field: (_) @capture)
288+
=>
289+
(output_template field: {capture})
290+
)
291+
292+
// Shorthand form — captures become fields on the output node
293+
yeast::rule!(
294+
(query_pattern field: (_) @capture)
295+
=> output_kind
296+
)
297+
```
298+
299+
The shorthand `=> kind` form auto-generates the template, mapping each
300+
capture name to a field of the same name on the output node.
301+
302+
## Integration with the extractor
303+
304+
YEAST integrates with the shared tree-sitter extractor via two mechanisms:
305+
306+
1. **`extract_and_desugar()`** — like `extract()`, but takes a
307+
`Vec<yeast::Rule>` to apply before TRAP extraction.
308+
309+
2. **`LanguageSpec::output_node_types`** — when desugaring produces an AST
310+
with different node types than the tree-sitter grammar, this field points
311+
to a separate `node-types.json` describing the output schema.
312+
313+
Languages that don't use desugaring simply call `extract()`, which passes
314+
empty rules internally.

0 commit comments

Comments
 (0)