Skip to content

Elevate VLM action with four-signal grounding and multipass pipeline#23

Merged
BillJr99 merged 7 commits into
mainfrom
claude/vlm-prompt-optimization-PDPzf
May 17, 2026
Merged

Elevate VLM action with four-signal grounding and multipass pipeline#23
BillJr99 merged 7 commits into
mainfrom
claude/vlm-prompt-optimization-PDPzf

Conversation

@BillJr99
Copy link
Copy Markdown
Owner

Send the VLM every signal the project already computes on the same step —
the screenshot plus the accessibility tree, OCR text, ASCII sketch (with
tab-index badges and legend), and focused-element hint — instead of the
screenshot alone. Each block is wrapped in its own ... envelope,
independently size-capped, gated by a ground_with_* config flag, and
silently omitted on failure so the prompt remains valid screenshot-only.

Adds a multipass mode (scene → controls → next-actions, with optional
verify pass) that emits a strict JSON envelope with structured fields
(summary, app, screen_type, focused, controls, next_actions, modal_open,
sensitive_regions, confidence, discrepancies, per-pass timings).
Tolerant JSON parsing handles fenced/garbled responses gracefully; a
failed pass leaves null fields rather than aborting the call.

Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation
in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the
fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b
premium and llama3.2-vision:11b verify configurations.

The vlm_setup model picker now optionally prompts for model_fast,
model_actions, and model_verify when mode=multipass; save_model_to_config
takes a key= parameter to persist the auxiliary slots atomically. The
web inspector renders the JSON envelope as a definition list. The legacy
prompt key is honoured as prompt_single for back-compat.

Adds tests/test_description_vlm.py covering JSON tolerance, image
downscaling, context-block assembly with each ground_with_* gate, and
multipass pass ordering / image reuse / failed-pass tolerance using a
mocked urllib opener. Extends tests/test_vlm_setup.py for the new key=
parameter.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

claude added 2 commits May 17, 2026 03:56
Send the VLM every signal the project already computes on the same step —
the screenshot plus the accessibility tree, OCR text, ASCII sketch (with
tab-index badges and legend), and focused-element hint — instead of the
screenshot alone. Each block is wrapped in its own <X>...</X> envelope,
independently size-capped, gated by a ground_with_* config flag, and
silently omitted on failure so the prompt remains valid screenshot-only.

Adds a multipass mode (scene → controls → next-actions, with optional
verify pass) that emits a strict JSON envelope with structured fields
(summary, app, screen_type, focused, controls, next_actions, modal_open,
sensitive_regions, confidence, discrepancies, per-pass timings).
Tolerant JSON parsing handles fenced/garbled responses gracefully; a
failed pass leaves null fields rather than aborting the call.

Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation
in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the
fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b
premium and llama3.2-vision:11b verify configurations.

The vlm_setup model picker now optionally prompts for model_fast,
model_actions, and model_verify when mode=multipass; save_model_to_config
takes a key= parameter to persist the auxiliary slots atomically. The
web inspector renders the JSON envelope as a definition list. The legacy
prompt key is honoured as prompt_single for back-compat.

Adds tests/test_description_vlm.py covering JSON tolerance, image
downscaling, context-block assembly with each ground_with_* gate, and
multipass pass ordering / image reuse / failed-pass tolerance using a
mocked urllib opener. Extends tests/test_vlm_setup.py for the new key=
parameter.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Adds ollama_setup.py with three responsibilities wired into the
--mode inspect startup in main.py:

1. Runner detection: asks the user (once) how to invoke the Ollama CLI
   — native 'ollama', 'docker exec <container> ollama', or any custom
   prefix. Running Docker containers with 'ollama' in their name are
   auto-discovered and offered as quick-pick options. The chosen prefix
   is saved as vlm.ollama_runner in config.json so subsequent launches
   skip the question.

2. Model-role summary: prints which of the four VLM model slots are
   configured and what each slot is used for:
     model         — primary (Pass 2 / single-shot)
     model_fast    — Pass 1 scene tag + per-widget crop labels
     model_actions — Pass 3 next-action candidates (text-only LLM OK)
     model_verify  — optional verify pass (second model family)

3. Auto-pull: runs '<runner> list', compares against configured model
   IDs, and pulls any missing Ollama models while streaming pull
   progress to stderr. Cloud-namespaced IDs (anthropic/, openai/, etc.)
   are skipped. Duplicate model IDs across slots are de-duplicated.
   Pull failures print a warning but do not abort startup.

Adds vlm.ollama_runner to config.json.example with documentation.
Adds tests/test_ollama_setup.py covering all deterministic paths
(runner detection, model collection, cloud-model skipping, list
parsing, deduplication, pull targeting, disabled/empty runner bypass).

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Copilot AI review requested due to automatic review settings May 17, 2026 04:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

claude and others added 5 commits May 17, 2026 10:43
Flips the argparse default from 'both' to 'inspect' so first-run users
get the interactive VLM model picker and Ollama auto-pull. In 'mcp' and
'both' modes stdin is owned by the MCP framing channel, so the picker
silently disables VLM if no model is configured — the new default ensures
the setup paths added in the prior commits actually fire.

vlm.enabled in config.json.example is already true, so out-of-the-box a
fresh checkout runs: load config → ask for the Ollama runner → pull
configured models → start the web inspector on http://127.0.0.1:5001.

Updates the docstring, argparse help, and README examples to match.
The Claude Desktop MCP integration block in README still shows
--mode both since that's the only mode that makes sense in that context.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Each launcher detects missing dependencies, prompts before installing,
sets up a .venv, installs requirements.txt, and starts the server in the
default mode (inspect — interactive VLM setup runs only in this mode).

Dependencies checked per platform:
  Linux:   python3 (+ python3-venv/pip on apt), tesseract-ocr, wmctrl,
           ollama (via official install.sh). Uses apt/dnf/pacman/zypper
           as appropriate.
  macOS:   Homebrew (offers to install if missing), python3, tesseract,
           ollama. Prints the Accessibility/Screen Recording permission
           note that AX adapter + screenshot capture both require.
  Windows: Python 3.12, Tesseract (UB-Mannheim build), Ollama via winget
           on Windows 10 1809+ / Windows 11. Falls back to printing the
           download URLs on older systems.

Each prompt defaults to Yes (Enter to accept) so the happy path needs no
typing. Skipping any prompt continues without that capability — for
example skipping Ollama is fine if the user plans to point vlm.base_url
at a remote endpoint or run with vlm.enabled=false.

Adds a "Quick start — automated launchers" section to README pointing
at the three scripts, ahead of the existing manual-install instructions
which remain for users who prefer to manage their own environment.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Previously the choice between 'inspect' (web UI + interactive VLM setup,
no MCP) and 'both' (MCP + web, no interactive setup) had to be made by
the user — and each had a silent failure mode in the wrong environment:

  • 'both' from a TTY: the MCP server blocks reading framing bytes from
    a keyboard nobody types on, and the VLM picker is suppressed because
    stdin "belongs" to MCP.
  • 'inspect' from a pipe (Claude Desktop launches us): an MCP client
    waiting on stdout never sees a framed message.

The new 'auto' mode (now the default) resolves this at startup by
checking sys.stdin.isatty():

  • TTY     → runs as 'inspect'. Interactive VLM setup + Ollama auto-pull
              prompts work; we don't start MCP because it would just
              block. This is the graceful fallback for the missing
              capability (MCP framing).
  • Non-TTY → runs as 'both'. MCP framing has a real client on stdio;
              interactive setup is skipped (config-driven model picks
              still apply). The web inspector also comes up on :5001.

Either branch logs a one-line stderr notice naming the resolved mode and
the reason. The explicit modes (mcp/inspect/both) still work for users
who want to force one regardless of the environment.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
…ipts

Adds setup_config.py — a small one-shot helper that each launcher now
runs after the tesseract install step:

  1. If config.json does not exist, copies it from config.json.example so
     the user has a real file (gitignored) to edit, rather than relying
     on main.py's lazier bootstrap which fires only when load_config
     reads the missing file.

  2. Checks ocr.tesseract_cmd. The bundled example ships the Windows
     path 'c:\Program Files\Tesseract-OCR\tesseract.exe', which on
     Linux/macOS is definitely wrong. If the configured path doesn't
     exist (or is unset AND tesseract is not on PATH), searches PATH
     plus common install locations ('/usr/bin', '/usr/local/bin',
     '/opt/homebrew/bin', the Windows Program Files variants) and
     offers to update the config to point at the discovered binary.

  Y-by-default prompt, atomic write via tempfile+rename, and no-ops
  silently when stdin is not a TTY or when tesseract is truly missing
  (the launcher's install step has already warned about that case).

start.sh, start-mac.sh, and start.bat all call 'python setup_config.py'
right before launching main.py. Adds tests/test_setup_config.py covering
all six branches (configured-and-exists, unset-but-on-PATH, broken,
user-declined, totally-missing, missing-ocr-section) plus a check that
the atomic save leaves no .tmp leftovers.

On the related question of whether --mode should default to 'both'
instead: no. 'auto' (the current default) already picks 'both' whenever
stdin is not a TTY — exactly the environment in which 'both' makes
sense. Forcing 'both' as the default would re-introduce the silent
TTY-blocked-on-stdin failure that 'auto' was added to avoid.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
mss.mss is deprecated and will be removed in a future release.
All six call sites in observer.py updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@BillJr99 BillJr99 merged commit 7200066 into main May 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants