feat: add Custom Engine with local transport#24
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 477d2a1a82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e1950d882
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 469dd61d12
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67ba0076b3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc1421f0dc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef5174722c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f583aa05df
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 189a278edf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0974c1d479
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7347ec23ba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 899560be8e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 515054cafa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 201dd2fbbc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 978997c1bd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6d3972627
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 546ca961e4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e282bffb2b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
zpzjzj
left a comment
There was a problem hiding this comment.
代码评审报告(自动)
| 项目 | 内容 |
|---|---|
| 评审人 | @zpzjzj (via Claude Code github-code-review skill) |
| 评审分支 | feat/custom-engine-local → main |
| HEAD | 8984f3a6 |
| 规模 | 27 files, +3925 / -31 |
问题汇总
- 🔴 Blocker: 1
- 🟠 Major: 14
- 🔵 Info: 11
整体评价
❌ 不建议合并 —— 存在 1 个 Blocker(output_file / input_file 路径穿越,可让框架在 clearStaleOutputFile 阶段对 host 任意文件 os.Remove)。需修复后才能进入下一轮评审。
CI 维度评审时 8 个 CI run / 3 个 CodeQL run 仍处 QUEUED,未给出失败信号;待 CI 完成后请复核。license/cla 已通过。
关键改进方向
- 安全护栏:完成
workspacePath的边界校验是合并的硬门槛;同时建议把containsAPIKey扩展为「所有已知凭据字符串」的 mask 器,覆盖 OTel span / stderr 日志 / 进程列表三条泄漏路径。 - transport 抽象前置:在 Phase 1 落地
customTransport接口,避免 http transport PR 大改CustomAgent主体;同时把sessionInput升级为公开契约SessionInput。 - OOM 收口:output_file 读取链路(custom.go:340)和
NoneRuntime.Exec的bytes.Buffer(runtime/none.go:196)都缺上限;并发评测时是真实风险,建议同步引入MaxStdout/StderrBytes。 - 架构债:
CustomAgent嵌入CLIAgent、builtin engine 常量在 config/agent 复制、ArtifactFile「exactly one of」无校验、Check永远返回 nil —— 趁 Phase 1 一起收掉成本最低。 - 用户向文档:README / README.zh / CHANGELOG / docs/guide/writing-evals.md / docs/index.md 完全没动,这能力对外发现度近 0。
已具备的防护(无需更改)
shellQuote单引号转义到位- 配置加载期 strict 模式拒绝在
local.command/args/cwd/input_file/output_file中引用 secret-name env 或${api_key}/${kwargs}聚合变量 - 迭代式 env 展开防止 wrapper-var 走私
containsAPIKey在 args/command 中拦截Cfg.APIKey字面量clearStaleOutputFile只在显式配置output_file时触发- http transport 在 validator 阶段被拒绝
- 进程组 + WaitDelay 让 timeout 能真正生效
行评论无法挂载的 Major(PR 未触碰原行)
- 🟠 OTel span 泄漏 + bash -c 进程列表暴露 —
internal/runtime/none.go:174,189。问题是已存在代码,但 CustomAgent 把"用户提供的任意命令/env"变成主要入口,secret 泄漏面被显著放大。建议把 OTelprocess.command字符串与 stderr 落日志(none.go:215)一并过 mask 处理。 - 🟠 NoneRuntime.Exec stdout/stderr 无上限 —
internal/runtime/none.go:196-205。同样旧代码,但 CustomAgent 让脚本完全用户可控,OOM 风险首次具备触发条件。
Major(已发行评论)
| # | 文件:行 | 问题 |
|---|---|---|
| 1 | custom.go:41 |
嵌入 CLIAgent 但 override 大多方法,脆弱继承 |
| 2 | custom.go:67 |
Check 永远返回 nil,validate 裸奔 |
| 3 | custom.go:79 |
transport 分发用 switch 而非策略接口 |
| 4 | custom.go:113 |
sessionInput 被双重 JSON 序列化 |
| 5 | custom.go:223 |
finishLocal 控制流分裂、参数臃肿 |
| 6 | custom.go:340 |
output_file 全量加载 + 多余落盘往返 |
| 7 | custom.go:588 |
sessionInput 应升级为公开契约 |
| 8 | agent.go:52 |
ArtifactFile 「exactly one of」无运行时校验 |
| 9 | validator.go:32 |
builtin engine 常量在 config / agent 重复 |
| 10 | custom.go:692 |
🔴 Blocker:workspacePath 路径穿越 |
Info 级建议(汇总)
代码质量:
containsAPIKey用裸子串匹配,短 API key 易误报;建议加最小长度门槛(custom.go:564)readRawResult中outputFile == ""是死分支(custom.go:341);DownloadFile失败 vsReadFile失败的produced语义不一致- 常量名
customResponseSessionJSON值是"session_result",命名不符(custom.go:25),建议改customResponseSessionResult - 模板解析(
${X} / ${X:-default} / ${X?msg})在custom.go:756-818与customengine.go:348-429几乎重复实现;建议抽internal/template子包
安全:
renderTemplate在 vars 命中但值为空时落回os.Getenv(custom.go:801),绕过 strict 模式;建议同样对命中名做isSensitiveEnvName校验或直接去掉 fallback- stderr 原文以 WARN 落日志(none.go:215);建议长度截断 + 凭据 mask,或仅 ExitCode≠0 时打印
custom.Envkey 未做危险变量过滤(custom.go:212);建议 deny-listLD_PRELOAD/DYLD_INSERT_LIBRARIES/BASH_ENV/PYTHONPATH
架构:
ResolveRunnerInitParams取整个EngineConfig过宽(credential/agent_init.go:74),内部只读 Model + Custom;建议改显式参数- http transport「not yet implemented」文案在
custom.go:83和validator.go:221双源;建议抽常量 - 设计文档第 96 行处建议补一句「未配置
output_file时结果取自 stdout,默认路径不被读取」
文档:
CHANGELOG.md完全未更新(Unreleased / Added 段应记 Custom Engine + local transport)README.md/README.zh.mdFeatures 区仍只列三个内置引擎docs/guide/writing-evals.mdSection 5 没有engine.custom示例 / linkdocs/index.mdfeatures 区未露出 Custom Engine- 设计文档字段表(
docs/design/custom-engine.md:196-200)未在custom.http.*行内标注「Phase 2; rejected today」 custom.http.method描述「First version only supports POST」措辞陈旧- 缺 quickstart / minimal 完整示例的用户向文档(仅
e2e/testdata/custom-engine/fixture,文档未引用) CustomEngineConfig字段级 godoc 较薄(schema.go:106-140)- 设计文档错误分类(728-740)未列对应 sentinel 常量名
说明:本评审通过 6 个并行子 Agent 完成代码质量、安全性、性能、CI、架构、文档六个维度的检查。如对某条结论有异议或想深入讨论,欢迎在对应行评论下回复。
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fb11ba4ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 406660c075
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aba832cb71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de0a540919
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Adds a Custom Engine mechanism so users can plug in their own agent
executors purely through `engine.custom` in `eval.yaml`, without
modifying `skill-up`. Previously any `engine.name` not matching a
built-in agent (`claude_code`, `codex`, `qodercli`) failed with
`unsupported agent`.
This is Phase 1 — the `local` transport: `skill-eval` runs a
user-provided command inside the current runtime via `runtime.Exec`,
hands it a standard `SessionInput`, and parses a standard
`SessionResult`. The `http` transport is fully designed (see
`docs/design/custom-engine.md`) and its config schema is
parsed/validated, but selecting it returns an explicit "not yet
implemented" error — HTTP lands in a follow-up PR.
Changes:
- docs: `docs/design/custom-engine.md` — full design (local + HTTP),
config schema, `SessionInput`/`SessionResult` contracts, env/template
variables, artifact rules.
- config: `CustomEngineConfig` schema (`engine.custom`); load-time
resolution of `${VAR}` / `${VAR:-default}` / `${VAR?msg}` env
references over the `engine.custom` tree (built-in template vars
left intact for run time); validation of transport and required
fields; strict mode rejects secret-shaped env names and credential
literals from reaching a command line.
- agent: `CustomAgent` — builds `SessionInput`, renders template
variables, executes the command via `runtime.Exec`, parses
`SessionResult` from stdout or `output_file`, supports
`session_result` and `text` response formats, and maps structured
artifacts back. `SessionArtifacts` gains a structured `files` field.
Path I/O is confined to the runtime workspace through `workspacePath`
which resolves symlinks on every call to close the TOCTOU window;
inline artifact size is capped per-file (50 MiB) and per-result
(200 MiB).
- factory: non-built-in engine names with an `engine.custom` block
dispatch to `CustomAgent`; otherwise the error is
`unsupported agent "x": missing engine.custom`. Built-in engine
names are listed in `internal/agentkind` as a single source of truth
shared by the agent factory and config validator.
- wiring: the custom config and per-case metadata (`case_id`,
`variant`, `max_turns`, timeout) are threaded through `credential`
→ `cli` → `runtime.ExecOptions` → `evaluator` into the agent.
`ResolveRunnerInitParams` now takes `config.EngineConfig` instead of
`config.ModelConfig` so it can carry the custom block.
- runtime/none: process-group + `WaitDelay` so a timed-out command's
descendants are reliably killed and a clean exit whose pipes
outlive the parent is not mis-reported as a failure.
Security hardening: API key and `engine.custom.env` values are masked
in `FinalMessage`, `Stderr`, and `Transcript` content before they
enter reports or judges; the rendered command line is rejected if it
embeds the configured API key; the custom timeout is clamped to the
outer case deadline so the value advertised to the agent matches the
real wall-clock budget.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sport
Two narrow follow-ups surfaced by a fresh self-review against the
squashed PR. None are field-reported bugs; both are defense-in-depth
on the masking and API-misuse boundaries introduced by this PR.
1. internal/agent/custom.go — mask parsed.Engine / parsed.Model.
`parseSessionResult` was already passing the other engine-controlled
string fields (FinalMessage, Stderr, Transcript content via
maskTranscript) through maskAPIKey before they reach reports,
judges, and OTel span attributes. The Engine and Model fields were
the only ones taken verbatim:
Engine: firstNonEmpty(parsed.Engine, a.Name()),
Model: firstNonEmpty(parsed.Model, ...),
A custom engine that echoes a credential into its self-identification
fields (e.g. a misconfigured wrapper that builds the engine name
from a banner string, or a hostile engine returning
`Model: "anthropic/sk-realkey..."`) would have bypassed the masking
discipline applied elsewhere in the same struct literal. Route both
through maskAPIKey to close the gap.
2. internal/config/validator.go — doc Validator.ValidateAll's split
with ResolveCustomEngineConfig.
The split exists for a real reason (the final engine name is only
known after CLI overrides like --engine settle, and engine.custom
must only be enforced for non-built-in engines), but the function
name `ValidateAll` invites future API consumers to assume it
validates everything. Today cli/run.go and cli/validate.go both
call ResolveCustomEngineConfig after ValidateAll; document that
contract on ValidateAll itself so a new caller does not silently
skip engine.custom validation.
A third self-review finding (re-order ResolveCustomEngineConfig so a
`transport: http` config fails fast with "not yet implemented" rather
than a confusing env-missing diagnostic on http.url) was deliberately
not applied: removing the resolveHTTPEnv call site leaves the
function unused (lint failure under revive's strict rules), and
keeping it alive with //nolint trades one minor UX wart for code
churn that Phase 2 will need to undo. Leaving as documented Info.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9e9ccfa to
989c115
Compare
# Conflicts: # docs/guide/writing-evals.md # internal/agent/agent.go # internal/agent/factory.go # internal/config/schema.go
Summary
Adds a Custom Engine mechanism so users can plug in their own agent executors purely through
engine.customineval.yaml, without modifyingskill-up. Previously anyengine.namenot matching a built-in agent (claude_code,codex,qodercli) failed withunsupported agent.This is Phase 1 — the
localtransport:skill-evalruns a user-provided command inside the current runtime viaruntime.Exec, hands it a standardSessionInput, and parses a standardSessionResult. Thehttptransport is fully designed (seedocs/design/custom-engine.md) and its config schema is parsed/validated, but selecting it returns an explicit "not yet implemented" error — HTTP lands in a follow-up PR.Related issues
Changes
docs/design/custom-engine.md— full design (local + HTTP), config schema,SessionInput/SessionResultcontracts, env/template variables, artifact rules.CustomEngineConfigschema (engine.custom); load-time resolution of${VAR}/${VAR:-default}/${VAR?msg}env references over theengine.customtree (built-in template vars left intact for run time); validation of transport and required fields.CustomAgent— buildsSessionInput, renders template variables, executes the command viaruntime.Exec, parsesSessionResultfrom stdout oroutput_file, supportssession_resultandtextresponse formats, and maps structured artifacts back.SessionArtifactsgains a structuredfilesfield.engine.customblock dispatch toCustomAgent; otherwise the error isunsupported agent "x": missing engine.custom.case_id,variant,max_turns, timeout) are threaded throughcredential→cli→runtime.ExecOptions→evaluatorinto the agent.ResolveRunnerInitParamsnow takesconfig.EngineConfiginstead ofconfig.ModelConfigso it can carry the custom block.Test plan
make testpasses (go test ./...— all 13 packages green)make verifypasses (fmt + vet +golangci-lint— 0 issues)SessionResult) throughskill-eval run— the case executed and passed,final_messageflowed into the report.${VAR}env reference fails at config load with a clear error.output_file/text/non-zero-exit/unparseable/missing-exit_code, HTTP not-implemented, template rendering, and factory dispatch.Notes for reviewers
internal/config/validator.go(with an inline comment) to avoid aconfig→agentimport cycle, sinceagentnow importsconfigforCustomEngineConfig.transport: httpis intentionally a no-op stub in this PR — config is parsed and validated, butRunreturns "not yet implemented".TestResolveJudgeInitParams_FallsBackToRunnerBaseURLBeforeCredentialFallbackis environment-sensitive (fails whenANTHROPIC_BASE_URLis set in the ambient env); unrelated to this change.🤖 Generated with Claude Code