Skip to content

Switch pod-scheduler configs from JSON to YAML; add JSON Schema#2170

Open
LoopedBard3 wants to merge 8 commits into
aspnet:mainfrom
LoopedBard3:loopedbard3/pod-config-format-options
Open

Switch pod-scheduler configs from JSON to YAML; add JSON Schema#2170
LoopedBard3 wants to merge 8 commits into
aspnet:mainfrom
LoopedBard3:loopedbard3/pod-config-format-options

Conversation

@LoopedBard3
Copy link
Copy Markdown
Contributor

@LoopedBard3 LoopedBard3 commented May 7, 2026

Why

Follow-up to #2167. The pod-scheduler configs introduced there were JSON, which makes them awkward to read and edit — no inline comments, lots of quote/comma noise, and type: 3 is a magic integer you have to grep models.py to understand. The named-key machines: {sut: ..., load: ..., db: ...} shape also turned out to be more friction than information.

This PR switches the source-of-truth configs to YAML, ships a JSON Schema, and simplifies the pod definition along the way.

What

Format

  • Move build/benchmarks_ci{,_azure,_cobalt}_pods.json.yml. Each opens with:
    # yaml-language-server: $schema=../scripts/pod-scheduler/pod-config.schema.json
    so VS Code / Cursor / the Red Hat YAML extension provide autocomplete, hover docs, and inline validation while editing.
  • Block style throughout; inline comments document the things that used to live only in PR review threads (e.g., that gold-lin and gold-win share gold-db).

Pod shape

machines and profiles are now parallel arrays of 1–3 unique non-empty strings. machines[i] runs profiles[i], and the role of slot i is implicit by length (1 = SUT, 2 = SUT + load, 3 = SUT + load + db). No sut/load/db keys anywhere.

- name: gold-lin
  machines:
  - gold-lin
  - gold-load
  - gold-db
  profiles:
  - gold-lin-app
  - gold-load-load
  - gold-db-db

The loader enforces, per pod:

  • len(machines) == len(profiles) — each machine pairs with exactly one profile,
  • both lists are unique within the pod,
  • bools (the YAML yes/no trap) and other non-strings are rejected.

Global machine → one profile is intentionally not enforced — azure2-db legitimately serves the db role in azure2-amd64 and the load role in idna-amd-win with different profiles in each. A regression test pins that down.

Self-documenting scenario type

type: 1 | 2 | 3type: single | dual | triple (case-insensitive). Integer form is still accepted for back-compat; bools and unknown strings raise ConfigError.

Loader

  • yaml.safe_load for any path. YAML is a strict superset of JSON, so existing .json configs continue to parse transparently — no extension dispatch.
  • Numeric fields (timeout, target_yaml_count, schedule_offset_hours, estimated_runtime) go through helpers that reject bools and enforce range minimums so a timeout: yes can't quietly resolve to 1.

Schema

  • New scripts/pod-scheduler/pod-config.schema.json (Draft 2020-12). Schema descriptions double as grounding for LLM-driven edits.
  • Validated locally with the jsonschema package; all three configs validate clean.
  • additionalProperties: false on pods and scenarios catches misspelled keys at edit time (e.g. mahcines: instead of machines:).

Model

  • Scenario.type is a plain int (1/2/3 with SCENARIO_TYPE_* named constants). The previous ScenarioType IntEnum is gone.
  • Pod stores machines: List[str] and profiles: List[str] directly — no per-role fields. machines_for_type(t) and profiles_for_type(t) are just slices.
  • Adding a 4th role later only requires bumping the schema's maxItems, appending to ROLE_NAMES, and adding one alias entry. The rest of the code derives everything from len(pod.machines).

Generated pipelines

  • benchmarks-ci-01.yml, -02.yml, -azure.yml, -cobalt.yml regenerated. The only delta vs HEAD is the embedded regen-command line in each file's header (.json.yml); scheduling output is byte-identical, enforced by the snapshot test.

Dependencies and tooling

  • New scripts/pod-scheduler/requirements.txt declares the runtime PyYAML dependency. The README gets a Prerequisites subsection pointing at it so a clean checkout has a documented pip install -r ... step.

Upstream sync

Tests

  • 68 unit tests pass. New coverage includes:
    • All three string aliases for type (case variations) + integer back-compat.
    • Bool rejection on every numeric field (type, timeout, estimated_runtime, target_yaml_count, schedule_offset_hours) and on entries inside machines/profiles.
    • Pod construction invariants: length mismatch, empty list, too-long list, duplicate entries.
    • Same physical machine playing different roles across pods (global non-1:1 invariant).
    • .json back-compat via the YAML parser.

Docs

  • Top-level README.md, build/README.md, and scripts/pod-scheduler/README.md rewritten with YAML examples, the schema-directive convention, and the parallel-array pod shape.

Verification

python -m pip install -r scripts/pod-scheduler/requirements.txt
cd scripts/pod-scheduler
python -m unittest discover tests
# Ran 68 tests in ~0.3s — OK

The snapshot test enforces that regenerating from the new YAML configs produces byte-identical pipeline YAML, so this is a config/format refactor with no behavioral change.

Notes for reviewers

  • _format_source_path keeps the repo-relative URL behavior; it now expects .yml paths to match how main.py is invoked from CI.
  • _parse_scenario_type explicitly rejects bool because yaml.safe_load turns yes/no/true/false into Python bools, which would otherwise quietly resolve to 1/0 via the integer alias. The same bool-guard runs on every numeric field.
  • The schema's maxItems: 3 is the single remaining hardcoded role-count cap; everything else in the code path derives from len(pod.machines).

The three pod configs in build/ were JSON, which made them hard to read
and didn't allow inline comments. This converts them to YAML, ships a
JSON Schema describing the shape, and changes scenario.type from a
magic 1/2/3 integer to a self-documenting single/dual/triple string.

* build/benchmarks_ci_pods.yml, benchmarks_ci_azure_pods.yml, and
  benchmarks_ci_cobalt_pods.yml replace their .json predecessors. Each
  opens with a yaml-language-server # yaml-language-server: schema=...
  directive so VS Code / Cursor / the YAML LSP provide autocomplete,
  hover docs, and inline validation while editing.
* scripts/pod-scheduler/pod-config.schema.json (new) is the schema. The
  schema descriptions also serve as reliable grounding for LLM-driven
  edits.
* config_loader.py now dispatches on file extension and accepts both
  YAML and JSON. scenario.type accepts single|dual|triple
  (case-insensitive) plus the legacy integer 1|2|3 for back-compat;
  bools (which YAML parsers happily produce from yes/no) are rejected
  explicitly. Unknown extensions and unknown type strings raise
  ConfigError so typos can't silently drop scenarios.
* Generated benchmarks-ci-*.yml files only change the embedded regen
  command in the file header (.json -> .yml). Schedule data is
  byte-identical, verified by the existing snapshot test.
* Tests: 5 new cases covering YAML loading, type-string aliases, type
  back-compat, bool/invalid rejection, and unknown-extension rejection.
  Default fixture writer now produces YAML so the YAML path is
  exercised on every test.
* Docs: README.md, build/README.md, and scripts/pod-scheduler/README.md
  are updated with YAML examples and the schema-directive convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the pod-scheduler’s source-of-truth configs in build/ from JSON to YAML to improve readability (comments) and editor support, and introduces a JSON Schema to validate/configure those YAML files while keeping JSON loading for backward compatibility.

Changes:

  • Converted the three build/benchmarks_ci_*_pods.json configs to .yml equivalents and updated references/docs accordingly.
  • Added scripts/pod-scheduler/pod-config.schema.json and adopted yaml-language-server $schema directives in shipped configs.
  • Updated the loader to dispatch by extension (YAML/JSON) and to accept scenario.type as single|dual|triple (case-insensitive) while still allowing legacy 1|2|3.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/pod-scheduler/config_loader.py Load YAML/JSON by extension; add scenario type aliasing and validation.
scripts/pod-scheduler/pod-config.schema.json New JSON Schema describing the YAML/JSON config shape for LSP validation.
scripts/pod-scheduler/models.py Update docs/comments to reflect YAML/JSON config sources.
scripts/pod-scheduler/generator.py Update generated header regen command to reference .yml.
scripts/pod-scheduler/scheduler.py Docstring wording updated from JSON-specific to config-generic.
scripts/pod-scheduler/main.py CLI help/usage updated to YAML/JSON.
scripts/pod-scheduler/README.md Docs updated with YAML examples and schema-driven editing guidance.
scripts/pod-scheduler/tests/test_config_loader.py Add YAML-path tests + scenario.type alias/back-compat coverage.
scripts/pod-scheduler/tests/test_generator.py Update expected filenames/paths from .json to .yml.
scripts/pod-scheduler/tests/test_snapshots.py Snapshot inputs updated to new .yml configs.
build/benchmarks_ci_pods.yml New YAML config replacing JSON for on-prem CI pods.
build/benchmarks_ci_azure_pods.yml New YAML config replacing JSON for Azure CI pods.
build/benchmarks_ci_cobalt_pods.yml New YAML config replacing JSON for Cobalt CI pods.
build/benchmarks_ci_pods.json Removed legacy JSON config (replaced by YAML).
build/benchmarks_ci_azure_pods.json Removed legacy JSON config (replaced by YAML).
build/benchmarks_ci_cobalt_pods.json Removed legacy JSON config (replaced by YAML).
build/benchmarks-ci-01.yml Regenerated header updated to .yml config path.
build/benchmarks-ci-02.yml Regenerated header updated to .yml config path.
build/benchmarks-ci-azure.yml Regenerated header updated to .yml config path.
build/benchmarks-ci-cobalt.yml Regenerated header updated to .yml config path.
build/README.md Docs updated to reference YAML configs and schema directive.
README.md Top-level doc updated to say pipelines are generated from YAML configs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/pod-scheduler/config_loader.py
Comment thread scripts/pod-scheduler/config_loader.py Outdated
Comment thread scripts/pod-scheduler/README.md
Comment thread build/benchmarks_ci_azure_pods.yml Outdated
LoopedBard3 and others added 2 commits May 7, 2026 14:45
Three of the review comments on aspnet#2170 flagged the same
class of issue around YAML/Python edge cases. Addressing them all here:

* scripts/pod-scheduler/requirements.txt (new) declares the runtime
  PyYAML dependency that landed with the YAML migration. Without this,
  a clean checkout fails with ModuleNotFoundError.
* scripts/pod-scheduler/README.md gains a Prerequisites subsection
  pointing at requirements.txt so contributors know what to install.
* config_loader._reject_bool / _coerce_int / _coerce_float (new
  helpers) explicitly reject Python bools in numeric fields. Previously
  `timeout: yes` would slip through `int(timeout)` and resolve to
  1 because bool is a subclass of int. The new helpers also enforce
  range minimums (target_yaml_count >= 1, schedule_offset_hours >= 0,
  timeout >= 1, estimated_runtime >= 0) so out-of-band values fail
  loudly at load time.
* test_config_loader.py adds 6 cases covering bool rejection on each
  numeric field plus negative-runtime and below-minimum target-count.
  All 55 tests pass.

The schema already used "type": "integer" for these fields (which in
JSON Schema 2020-12 explicitly excludes booleans), so editor-side
validation was already correct; this commit closes the runtime gap.

Generated pipeline YAML still regenerates byte-identical to HEAD.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per @sebastienros's review feedback on aspnet#2170, the
named-key form for `machines` / `profiles` was visually noisy when
almost every pod follows the same SUT-load-DB convention. Both blocks
now accept either:

* a positional list of 1-3 entries (`[sut]` / `[sut, load]` /
  `[sut, load, db]`), preferred for new pods, or
* the existing named-dict form (`{sut: ..., load: ..., db: ...}`),
  still supported.

The two blocks for a single pod must use the same form and declare the
same set of roles; mixing list and dict, or differing lengths, raises
`ConfigError` at load time so the role mapping can't drift silently.
Schema (pod-config.schema.json) gets a `oneOf` for both blocks; the
YAML LSP catches shape errors at edit time.

* `config_loader._normalize_roles` resolves either form to a canonical
  ordered dict and rejects unknown roles, bools, non-strings, and empty
  values. The pod loader then enforces same-shape and same-role-set
  between machines and profiles before constructing the Pod.
* All three committed configs (`benchmarks_ci_pods.yml`,
  `benchmarks_ci_azure_pods.yml`, `benchmarks_ci_cobalt_pods.yml`)
  switch to the shorthand form. Each gets a header comment above its
  pods block explaining the convention. Pods that re-use a machine
  across roles (e.g. `cobalt-cloud-lin-azl3-dual` puts the db box in
  the `load` slot) keep the inline comment that flags the reuse.
* `scripts/pod-scheduler/README.md` gains a Pod Definition section
  showing both forms and recommending the list form for new pods.
* test_config_loader.py adds 10 cases covering shorthand happy paths
  (1/2/3 entries), mixed-shape rejection, length mismatch, empty/too-long
  lists, bool entries, and unknown-role keys. All 65 tests pass.

Generated pipeline YAMLs are byte-identical to HEAD: this is purely an
input-format change, the schedule data is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread build/benchmarks_ci_pods.yml Outdated
Comment thread build/benchmarks_ci_pods.yml Outdated
LoopedBard3 and others added 5 commits May 7, 2026 15:48
Two simplifications from @sebastienros's round 3 review on
aspnet#2170:

* Drop the per-extension dispatch in config_loader._load_raw. YAML is a
  strict superset of JSON, so yaml.safe_load handles both. Old .json
  configs still load via the YAML parser, and the loader no longer cares
  about the file extension. _load_raw is a one-liner now.
* Reshape pod definitions into parallel arrays. Both `machines` and
  `profiles` are flat 1-3 entry lists; machines[i] runs profiles[i].
  The role of slot i is implicit by length (1=SUT, 2=SUT+load,
  3=SUT+load+db) so neither list carries explicit `sut`/`load`/`db`
  keys. The named-dict form is no longer accepted.

The loader now enforces, per pod:

* len(machines) == len(profiles), so each machine pairs with exactly one
  profile.
* machines and profiles are both lists of 1-3 unique non-empty strings.
* bools (yes/no/true/false from YAML) are rejected explicitly.

Per-pod uniqueness is enforced; global `machine -> single profile`
isn't, because real configs legitimately reuse a physical box across
roles in different pods (e.g. azure2-db is db for one pod and load for
another). Test `test_global_machine_can_have_different_profiles_across_pods`
documents that as intended.

Configs:

* All three .yml configs reformatted from inline flow style
  `[a, b, c]` to block style and converted to the parallel-array
  form. Each retains the existing inline comment for pods that re-use
  machines across roles. The header comment over each file's pods block
  now describes the parallel-array semantics.
* Schema (`pod-config.schema.json`) tightens machines/profiles to
  `array of unique strings, minItems 1, maxItems 3`; the old oneOf
  with the dict variant is gone.
* README's Pod Definition section rewritten around the new shape.

Tests:

* Drop tests for removed behavior (extension rejection, dict form,
  named-role rejection, mixed-shape rejection).
* Add `TestPodRoles` covering single/dual/triple happy paths,
  length-mismatch, machines uniqueness, profiles uniqueness, empty/too-
  long lists, dict form rejection, bool entries, and the global
  same-machine/different-profile case.
* 66 tests pass (was 65).

Generated pipeline YAMLs are byte-identical to HEAD; this is purely an
input-format simplification.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Future-proofing tweak following the round-3 review on
aspnet#2170. Removes three of the four `maxItems: 3`
hardcoding sites so adding a fourth role later (e.g. a second load
machine) only requires a schema bump.

* models.ScenarioType IntEnum is gone. `Scenario.type` is now a plain
  `int` (1/2/3 == single/dual/triple). `SCENARIO_TYPE_SINGLE` /
  `DUAL` / `TRIPLE` module constants stay for readable references.
  `DEFAULT_RUNTIMES` is keyed by int.
* `Pod` no longer has `sut` / `load` / `db` / `*_profile`
  fields. The dataclass stores `machines: List[str]` and
  `profiles: List[str]` (parallel arrays). Backward-compat properties
  (`pod.sut`, `pod.load`, etc.) read from the lists, so call sites
  that need a named slot (the conflicts diagnostic) keep working
  unchanged.
* `pod.machines_for_type(t)` and `profiles_for_type(t)` are slices
  (`self.machines[:t]`); `validate(t)` is just `t > len(machines)`.
  No more 2/3 magic numbers branching on which fields are populated.
* `__post_init__` enforces `len(machines) == len(profiles)` and
  non-empty machines, so an invalid Pod can't be constructed.
* `main.print_pod_conflicts` iterates `pod.machines` against
  `ROLE_NAMES` (a new module-level constant) for the diagnostic
  output, so adding a 4th role just needs a name appended to
  `ROLE_NAMES` -- everything else falls out of `len(pod.machines)`.
* All call sites and tests updated. `test_models` gains 4 cases
  exercising the new constructor invariants and the backward-compat
  property accessors. 70 tests pass (was 66).

The schema's `maxItems: 3` is now the single remaining `3`
hardcoding -- that's the user-facing constraint, which is the right
place for it.

Generated pipeline YAMLs are byte-identical to HEAD; this is purely an
internal model refactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#2169)

Upstream PR aspnet#2169 removed the four pods whose SUT machines were
end-of-lifed (intel-lin, intel-win, intel-perflin, amd-lin2), leaving
only gold-lin and gold-win on the on-prem CI fleet. This merge applies
the same prune to our YAML-format pod config and regenerates the
pipeline YAMLs.

Conflict resolution:

* build/benchmarks_ci_pods.json -- upstream modified the (now-deleted)
  JSON config; we kept our deletion. The equivalent edits are applied
  directly to build/benchmarks_ci_pods.yml: pods is trimmed to
  [gold-lin, gold-win] and every scenario's pods: list is pruned to
  match upstream's post-aspnet#2169 contents.
* build/benchmarks-ci-01.yml / -02.yml -- regenerated from the updated
  YAML config. Output matches upstream's regenerated pipelines exactly,
  modulo the embedded regen-command header which (correctly) now
  points at the .yml config instead of the deleted .json.

The Azure and Cobalt pod configs are unaffected by aspnet#2169 -- they don't
reference any of the EOL'd machines -- and their pipeline YAMLs are
unchanged.

All 70 pod-scheduler unit tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were added in 5b0bb62 as `backward-compat single-role accessors`
for callers that `need a named slot without caring how many roles a pod
has`. In the same commit I refactored the one real caller
(`main.print_pod_conflicts`) to iterate `pod.machines` against
`ROLE_NAMES` instead, so the safety net never had anything to catch.

Grep confirmed the only remaining callers were tests asserting that the
properties returned what we'd expect -- circular coverage. Removing the
properties and updating the tests to read `pod.machines` /
`pod.profiles` directly. The new `test_role_count_reflects_machines_length`
covers the only computed property worth keeping (`role_count`).

69 tests pass (was 70 -- replaced two named-accessor tests with one
role_count test). Generated pipeline YAMLs unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Audit follow-up to baa02f7. Four small items found:

* models.Pod.role_count -- another property that only existed to be
  tested. Production code uses len(pod.machines) directly. Removed
  along with its test (test_role_count_reflects_machines_length).
* config_loader._coerce_role_list docstring referenced _ROLE_ORDER,
  a constant that was removed in 9828898 when the named-dict form
  was dropped. Updated to point at models.ROLE_NAMES instead.
* config_loader._reject_bool docstring said it covered
  `numeric/string fields`, but it's only called from _coerce_int
  and _coerce_float. The string-list bool check in _coerce_role_list
  is inline. Trimmed the docstring to `numeric fields`.
* generator._render_yaml / generate_yamls -- `source_config` was
  Optional[str] = None with an else-branch that emitted a placeholder
  `<config>` regen command. Every caller (main.py and
  test_snapshots.py) always passes a concrete value, so the
  placeholder branch was unreachable. Tightened to `str` with no
  fallback rendering.

68 tests pass (was 69 with role_count removal). Generated pipeline
YAMLs byte-identical to HEAD.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@LoopedBard3 LoopedBard3 requested a review from DrewScoggins May 14, 2026 17:52
@LoopedBard3 LoopedBard3 marked this pull request as ready for review May 14, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants