Skip to content

Releases: justi/ruby_llm-contract

0.8.0 — Contracts + Evals for RubyLLM

26 Apr 16:47
3ada33a

Choose a tag to compare

Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.

Added

  • thinking(effort:, budget:) class macro on Step::Base — mirrors RubyLLM::Agent.thinking signature exactly. Stored as { effort:, budget: } hash; reader returns the hash; supports :default reset semantics; superclass inheritance like model/temperature. The convenience alias reasoning_effort(:low) is implemented as thinking(effort: :low) — single normalized state, not separate ivar.
  • Adapter wiring for with_thinking — when thinking is set on the Step class, OR when reasoning_effort: is passed through context, OR when an attempt config in retry_policy escalate(...) carries reasoning_effort:, the RubyLLM adapter resolves the effective { effort:, budget: } hash and forwards it via chat.with_thinking(**) — provider-agnostic (supports OpenAI reasoning_effort AND Anthropic extended-thinking budget). Precedence: per-attempt / context reasoning_effort overrides class-level thinking[:effort]; budget is taken from class-level thinking[:budget]. Behavioural change vs 0.7.x: reasoning_effort is now forwarded via with_thinking instead of with_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.

Dependencies

  • ruby_llm constraint bumped from ~> 1.0 to ~> 1.12Chat#with_thinking is the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters on ruby_llm < 1.12 need to bump RubyLLM before upgrading this gem to 0.8.0.

Changed

  • Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is Step → Runner → Adapters::RubyLLM → RubyLLM.chat directly.
  • TokenEstimator documented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages from LimitChecker now include (heuristic ±30%) suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer; RubyLLM::Tokens is post-hoc only.
  • CostCalculator repositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost, token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged: register_model, unregister_model, reset_custom_models!, calculate.
  • output_schema reframed in docs — described as "wrapper around RubyLLM::Schema + client-side validation step", not a standalone feature. The schema language is identical to what RubyLLM::Agent.schema accepts; the difference is what wraps it.
  • README retry framingretry_policy escalate(...) (model escalation on validation failure) is the marketed default. retry_policy attempts: N (same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.

Documentation

  • New disambiguation paragraphs in prompt_ast.md (Step.input_type vs RubyLLM::Agent.inputs; Prompt::Builder multi-role DSL vs Agent ERB single-string template loader), testing.md (Step.observe vs Chat#on_end_message / on_tool_call), output_schema.md (relation to Agent.schema), and optimizing_retry_policy.md (orthogonal model + thinking dimensions).
  • getting_started.md refusal message example updated to include the new (heuristic ±30%) suffix.

Issues closed

  • #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments. attempts: N retry stays in API; not marketed as a default.
  • #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.

Not in this release (deferred)

  • output_schema Proc form for runtime-input-aware schemas (parity with Agent.schema Proc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.
  • H4 (Step composing RubyLLM::Agent internally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.

0.7.3 — Adoption-friction release (docs + examples consolidation)

24 Apr 04:59
c101cfa

Choose a tag to compare

0.7.3 (2026-04-24)

Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.

Documentation

  • New guide: docs/guide/why.md — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
  • New guide: docs/guide/rails_integration.md — seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs, around_call observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
  • README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
  • TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (eval_first.md, testing.md, migration.md).
  • API coverage gaps closedestimate_cost / estimate_eval_cost, max_cost on_unknown_pricing: :warn, run_eval(..., concurrency:), around_call testing patterns now documented in getting_started.md, eval_first.md, testing.md.
  • Industry-standard terminologytemperature-lockedfixed-temperature, variance-inducedsampling variance, severity signalsseverity keywords, takeaway drifttone/takeaways mismatch.
  • docs/architecture.md refresh — diagram now reflects the current class layout: added Step::RetryPolicy, Pipeline::Result, Eval::AggregatedReport, Eval::BaselineDiff, Eval::PromptDiffComparator, Eval::EvalHistory, Eval::RetryOptimizer, OptimizeRakeTask. Replaced the outdated Eval::TraitEvaluator entry with Eval::ExpectationEvaluator.
  • Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.

Examples — consolidated on SummarizeArticle, renumbered 00-06

The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.

# File Answers
00 00_basics.rb How do I start? (seven incremental layers + real-LLM pointer)
01 01_fallback_showcase.rb Show me the gem in 30 seconds (zero API keys)
02 02_real_llm_minimal.rb How do I plug in a real LLM? (~30 lines)
03 03_summarize_with_keywords.rb How does the contract evolve? (growing prompt)
04 04_summarize_and_translate.rb Pipeline composition + pipeline-level run_eval
05 05_eval_dataset.rb How do I stop silent prompt regressions?
06 06_retry_variants.rb attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI)

Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).

Examples — bug fixes carried along

  • Schema pitfall fixed in 5 filesarray :x do; string :y; ...; end silently produces items: string and drops every declaration after the first, matching the documented pitfall in spec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped in object do...end.
  • examples/05_eval_dataset.rb (pre-renumber: 09_eval_dataset.rb) result[:passed]result.passed? — the previous code called [] on an Eval::CaseResult and raised NoMethodError at runtime.

Testing

  • New spec/integration/pipeline_eval_spec.rb — three cases guaranteeing pipeline-level run_eval stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate validate rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts step_status == :validation_failed and the validate's label in details, so a regression that short-circuits on schema instead of validate would fail loudly.

Deleted (private-project cleanup)

  • examples/01_classify_threads.rb, 02_generate_comment.rb, 03_target_audience.rb, 10_reddit_full_showcase.rb, spec/integration/reddit_pipeline_spec.rb — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
  • examples/02_output_schema.rb — fully covered by docs/guide/output_schema.md; deleting avoids duplication.

0.7.1 — Narrow run_once ArgumentError rescue

22 Apr 09:13
3fe3c86

Choose a tag to compare

Behavioral change (follow-up to v0.7.0)

Closes the known limitation called out in the v0.7.0 CHANGELOG.

Before: Step::Base#run_once wrapped the entire Runner chain in rescue ArgumentError to convert DSL misconfiguration (e.g. prompt has not been set) into :input_error. Side effect: any ArgumentError raised from adapter code during Runner#call — wrong arity, bad config arg, any programmer bug — was silently coerced into :input_error and re-tried as if the user had supplied bad input.

After: the rescue is scoped to the Runner-construction phase only. DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested). ArgumentError raised during Runner#call propagates to the caller.

Input-type validation failures continue to produce :input_error via InputValidator's own scoped rescue (Dry::Types::CoercionError, TypeError, ArgumentError around the type-check boundary) — unchanged.

Why it matters

v0.7.0's narrative was "programmer errors propagate, provider errors become :adapter_error". AdapterCaller already respected that (narrowed to RubyLLM::Error + Faraday::Error). But run_once's broader rescue ArgumentError was a backdoor that let adapter-code ArgumentError bugs slip back into :input_error and become retry targets.

This release closes that backdoor. Programmer bugs raised during an adapter call now surface loudly instead of being disguised as "user gave bad input".

Compatibility

Technically a behavioral change — callers previously relying on adapter-code ArgumentError to produce :input_error results will now see the exception propagate. If your adapter deliberately raises ArgumentError for expected validation flows, wrap that in RubyLLM::Error (becomes :adapter_error, respected by retry) or add explicit handling at the call site.

Test plan

  • bundle exec rspec — 1341 examples, 0 failures, 8 pending (all pending are API-key-gated live LLM tests).
  • New regression specs in retry_integration_spec.rb:
    • propagates ArgumentError from adapter code (programmer bug, not bad input) — adapter raising ArgumentError now propagates.
    • still converts DSL misconfiguration ArgumentError to :input_error (prompt missing)prompt has not been set still becomes :input_error.
  • Existing BUG 48 adversarial spec (step without prompt → :input_error) continues to pass.

0.7.0 — Remove :adapter_error default retry, narrow AdapterCaller rescue

21 Apr 13:53
0d6ed4b

Choose a tag to compare

Breaking changes

Both changes target redundancy between ruby_llm-contract and upstream ruby_llm 1.14.x.

1. :adapter_error removed from DEFAULT_RETRY_ON

New default: [:validation_failed, :parse_error].

ruby_llm's Faraday middleware already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) with backoff. Retrying on :adapter_error against the same model re-ran what transport had already retried — retry × retry with no change in context.

:adapter_error remains available as explicit opt-in. It is meaningful primarily paired with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.

2. AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error

Provider errors (the RubyLLM::Error hierarchy) and transport errors that escape ruby_llm's Faraday retry middleware after exhaustion (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before.

Programmer errors that are neither — NoMethodError, adapter-code bugs — now propagate instead of being silently converted to :adapter_error and retried. Bugs should be fixed, not retried.

Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor; tracked as a follow-up.

Migration

Restore pre-0.7 behavior:

retry_policy do
  attempts 3
  retry_on :validation_failed, :parse_error, :adapter_error
end

Preferred — pair with a model fallback chain:

retry_policy do
  escalate "gpt-4.1-nano", "gpt-4.1-mini"
  retry_on :validation_failed, :parse_error, :adapter_error
end

Why the narrative matters

Post-0.7, DEFAULT_RETRY_ON = [:validation_failed, :parse_error] reads cleanly as the gem's core value proposition: retry in ruby_llm-contract is against LLM output variance (malformed JSON, business-rule violations), not against transport or infrastructure. Transport concerns live in ruby_llm/Faraday where they belong; programmer bugs propagate for quick detection.

v0.6.4 — production_mode: retry-aware cost

19 Apr 19:11
34a4697

Choose a tag to compare

Highlights

  • production_mode: { fallback: "..." } on compare_models / optimize_retry_policy — measures retry-aware, end-to-end cost per successful output. Each candidate runs with a runtime-injected [candidate, fallback] retry chain.
  • New metrics: escalation_rate, single_shot_cost, effective_cost, single_shot_latency_ms, effective_latency_ms, latency_percentiles — on both Report and AggregatedReport (averaged across runs:).
  • Extended ModelComparison#table: Chain / single-shot / escalation / effective cost / latency / score. Edge case candidate == fallback → em-dash (not 0%), retry injection skipped so effective == single-shot by construction.
  • context[:retry_policy_override] — new context key for transient per-call retry-policy overrides without mutating the step class.

Scope

  • Single-fallback (2-tier) chains only.
  • Step-only: raises ArgumentError if used on Pipeline::Base subclasses (pipeline-wide fallback semantics are a separate design question).

Docs