Releases: justi/ruby_llm-contract
0.8.0 — Contracts + Evals for RubyLLM
Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.
Added
thinking(effort:, budget:)class macro onStep::Base— mirrorsRubyLLM::Agent.thinkingsignature exactly. Stored as{ effort:, budget: }hash; reader returns the hash; supports:defaultreset semantics; superclass inheritance likemodel/temperature. The convenience aliasreasoning_effort(:low)is implemented asthinking(effort: :low)— single normalized state, not separate ivar.- Adapter wiring for
with_thinking— whenthinkingis set on the Step class, OR whenreasoning_effort:is passed through context, OR when an attempt config inretry_policy escalate(...)carriesreasoning_effort:, the RubyLLM adapter resolves the effective{ effort:, budget: }hash and forwards it viachat.with_thinking(**)— provider-agnostic (supports OpenAIreasoning_effortAND Anthropic extended-thinking budget). Precedence: per-attempt / contextreasoning_effortoverrides class-levelthinking[:effort]; budget is taken from class-levelthinking[:budget]. Behavioural change vs 0.7.x:reasoning_effortis now forwarded viawith_thinkinginstead ofwith_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.
Dependencies
ruby_llmconstraint bumped from~> 1.0to~> 1.12—Chat#with_thinkingis the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters onruby_llm < 1.12need to bump RubyLLM before upgrading this gem to 0.8.0.
Changed
- Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is
Step → Runner → Adapters::RubyLLM → RubyLLM.chatdirectly. TokenEstimatordocumented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages fromLimitCheckernow include(heuristic ±30%)suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer;RubyLLM::Tokensis post-hoc only.CostCalculatorrepositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost,token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged:register_model,unregister_model,reset_custom_models!,calculate.output_schemareframed in docs — described as "wrapper aroundRubyLLM::Schema+ client-side validation step", not a standalone feature. The schema language is identical to whatRubyLLM::Agent.schemaaccepts; the difference is what wraps it.- README retry framing —
retry_policy escalate(...)(model escalation on validation failure) is the marketed default.retry_policy attempts: N(same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.
Documentation
- New disambiguation paragraphs in
prompt_ast.md(Step.input_typevsRubyLLM::Agent.inputs;Prompt::Buildermulti-role DSL vs Agent ERB single-string template loader),testing.md(Step.observevsChat#on_end_message/on_tool_call),output_schema.md(relation toAgent.schema), andoptimizing_retry_policy.md(orthogonal model + thinking dimensions). getting_started.mdrefusal message example updated to include the new(heuristic ±30%)suffix.
Issues closed
- #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments.
attempts: Nretry stays in API; not marketed as a default. - #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.
Not in this release (deferred)
output_schemaProc form for runtime-input-aware schemas (parity withAgent.schemaProc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.- H4 (Step composing
RubyLLM::Agentinternally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.
0.7.3 — Adoption-friction release (docs + examples consolidation)
0.7.3 (2026-04-24)
Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
Documentation
- New guide:
docs/guide/why.md— four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves. - New guide:
docs/guide/rails_integration.md— seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs,around_callobservability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring. - README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (
README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.). - TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (
eval_first.md,testing.md,migration.md). - API coverage gaps closed —
estimate_cost/estimate_eval_cost,max_cost on_unknown_pricing: :warn,run_eval(..., concurrency:),around_calltesting patterns now documented ingetting_started.md,eval_first.md,testing.md. - Industry-standard terminology —
temperature-locked→fixed-temperature,variance-induced→sampling variance,severity signals→severity keywords,takeaway drift→tone/takeaways mismatch. docs/architecture.mdrefresh — diagram now reflects the current class layout: addedStep::RetryPolicy,Pipeline::Result,Eval::AggregatedReport,Eval::BaselineDiff,Eval::PromptDiffComparator,Eval::EvalHistory,Eval::RetryOptimizer,OptimizeRakeTask. Replaced the outdatedEval::TraitEvaluatorentry withEval::ExpectationEvaluator.- Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
Examples — consolidated on SummarizeArticle, renumbered 00-06
The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.
| # | File | Answers |
|---|---|---|
| 00 | 00_basics.rb |
How do I start? (seven incremental layers + real-LLM pointer) |
| 01 | 01_fallback_showcase.rb |
Show me the gem in 30 seconds (zero API keys) |
| 02 | 02_real_llm_minimal.rb |
How do I plug in a real LLM? (~30 lines) |
| 03 | 03_summarize_with_keywords.rb |
How does the contract evolve? (growing prompt) |
| 04 | 04_summarize_and_translate.rb |
Pipeline composition + pipeline-level run_eval |
| 05 | 05_eval_dataset.rb |
How do I stop silent prompt regressions? |
| 06 | 06_retry_variants.rb |
attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI) |
Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).
Examples — bug fixes carried along
- Schema pitfall fixed in 5 files —
array :x do; string :y; ...; endsilently producesitems: stringand drops every declaration after the first, matching the documented pitfall inspec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped inobject do...end. examples/05_eval_dataset.rb(pre-renumber:09_eval_dataset.rb)result[:passed]→result.passed?— the previous code called[]on anEval::CaseResultand raisedNoMethodErrorat runtime.
Testing
- New
spec/integration/pipeline_eval_spec.rb— three cases guaranteeing pipeline-levelrun_evalstays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediatevalidaterejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case assertsstep_status == :validation_failedand the validate's label indetails, so a regression that short-circuits on schema instead of validate would fail loudly.
Deleted (private-project cleanup)
examples/01_classify_threads.rb,02_generate_comment.rb,03_target_audience.rb,10_reddit_full_showcase.rb,spec/integration/reddit_pipeline_spec.rb— Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.examples/02_output_schema.rb— fully covered bydocs/guide/output_schema.md; deleting avoids duplication.
0.7.1 — Narrow run_once ArgumentError rescue
Behavioral change (follow-up to v0.7.0)
Closes the known limitation called out in the v0.7.0 CHANGELOG.
Before: Step::Base#run_once wrapped the entire Runner chain in rescue ArgumentError to convert DSL misconfiguration (e.g. prompt has not been set) into :input_error. Side effect: any ArgumentError raised from adapter code during Runner#call — wrong arity, bad config arg, any programmer bug — was silently coerced into :input_error and re-tried as if the user had supplied bad input.
After: the rescue is scoped to the Runner-construction phase only. DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested). ArgumentError raised during Runner#call propagates to the caller.
Input-type validation failures continue to produce :input_error via InputValidator's own scoped rescue (Dry::Types::CoercionError, TypeError, ArgumentError around the type-check boundary) — unchanged.
Why it matters
v0.7.0's narrative was "programmer errors propagate, provider errors become :adapter_error". AdapterCaller already respected that (narrowed to RubyLLM::Error + Faraday::Error). But run_once's broader rescue ArgumentError was a backdoor that let adapter-code ArgumentError bugs slip back into :input_error and become retry targets.
This release closes that backdoor. Programmer bugs raised during an adapter call now surface loudly instead of being disguised as "user gave bad input".
Compatibility
Technically a behavioral change — callers previously relying on adapter-code ArgumentError to produce :input_error results will now see the exception propagate. If your adapter deliberately raises ArgumentError for expected validation flows, wrap that in RubyLLM::Error (becomes :adapter_error, respected by retry) or add explicit handling at the call site.
Test plan
bundle exec rspec— 1341 examples, 0 failures, 8 pending (all pending are API-key-gated live LLM tests).- New regression specs in
retry_integration_spec.rb:propagates ArgumentError from adapter code (programmer bug, not bad input)— adapter raisingArgumentErrornow propagates.still converts DSL misconfiguration ArgumentError to :input_error (prompt missing)—prompt has not been setstill becomes:input_error.
- Existing BUG 48 adversarial spec (step without prompt →
:input_error) continues to pass.
0.7.0 — Remove :adapter_error default retry, narrow AdapterCaller rescue
Breaking changes
Both changes target redundancy between ruby_llm-contract and upstream ruby_llm 1.14.x.
1. :adapter_error removed from DEFAULT_RETRY_ON
New default: [:validation_failed, :parse_error].
ruby_llm's Faraday middleware already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) with backoff. Retrying on :adapter_error against the same model re-ran what transport had already retried — retry × retry with no change in context.
:adapter_error remains available as explicit opt-in. It is meaningful primarily paired with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.
2. AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error
Provider errors (the RubyLLM::Error hierarchy) and transport errors that escape ruby_llm's Faraday retry middleware after exhaustion (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before.
Programmer errors that are neither — NoMethodError, adapter-code bugs — now propagate instead of being silently converted to :adapter_error and retried. Bugs should be fixed, not retried.
Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor; tracked as a follow-up.
Migration
Restore pre-0.7 behavior:
retry_policy do
attempts 3
retry_on :validation_failed, :parse_error, :adapter_error
endPreferred — pair with a model fallback chain:
retry_policy do
escalate "gpt-4.1-nano", "gpt-4.1-mini"
retry_on :validation_failed, :parse_error, :adapter_error
endWhy the narrative matters
Post-0.7, DEFAULT_RETRY_ON = [:validation_failed, :parse_error] reads cleanly as the gem's core value proposition: retry in ruby_llm-contract is against LLM output variance (malformed JSON, business-rule violations), not against transport or infrastructure. Transport concerns live in ruby_llm/Faraday where they belong; programmer bugs propagate for quick detection.
v0.6.4 — production_mode: retry-aware cost
Highlights
production_mode: { fallback: "..." }oncompare_models/optimize_retry_policy— measures retry-aware, end-to-end cost per successful output. Each candidate runs with a runtime-injected[candidate, fallback]retry chain.- New metrics:
escalation_rate,single_shot_cost,effective_cost,single_shot_latency_ms,effective_latency_ms,latency_percentiles— on bothReportandAggregatedReport(averaged acrossruns:). - Extended
ModelComparison#table:Chain / single-shot / escalation / effective cost / latency / score. Edge casecandidate == fallback→ em-dash (not0%), retry injection skipped soeffective == single-shotby construction. context[:retry_policy_override]— new context key for transient per-call retry-policy overrides without mutating the step class.
Scope
- Single-fallback (2-tier) chains only.
- Step-only: raises
ArgumentErrorif used onPipeline::Basesubclasses (pipeline-wide fallback semantics are a separate design question).