Elevate VLM action with four-signal grounding and multipass pipeline#23
Merged
Conversation
Send the VLM every signal the project already computes on the same step — the screenshot plus the accessibility tree, OCR text, ASCII sketch (with tab-index badges and legend), and focused-element hint — instead of the screenshot alone. Each block is wrapped in its own <X>...</X> envelope, independently size-capped, gated by a ground_with_* config flag, and silently omitted on failure so the prompt remains valid screenshot-only. Adds a multipass mode (scene → controls → next-actions, with optional verify pass) that emits a strict JSON envelope with structured fields (summary, app, screen_type, focused, controls, next_actions, modal_open, sensitive_regions, confidence, discrepancies, per-pass timings). Tolerant JSON parsing handles fenced/garbled responses gracefully; a failed pass leaves null fields rather than aborting the call. Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b premium and llama3.2-vision:11b verify configurations. The vlm_setup model picker now optionally prompts for model_fast, model_actions, and model_verify when mode=multipass; save_model_to_config takes a key= parameter to persist the auxiliary slots atomically. The web inspector renders the JSON envelope as a definition list. The legacy prompt key is honoured as prompt_single for back-compat. Adds tests/test_description_vlm.py covering JSON tolerance, image downscaling, context-block assembly with each ground_with_* gate, and multipass pass ordering / image reuse / failed-pass tolerance using a mocked urllib opener. Extends tests/test_vlm_setup.py for the new key= parameter. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Adds ollama_setup.py with three responsibilities wired into the
--mode inspect startup in main.py:
1. Runner detection: asks the user (once) how to invoke the Ollama CLI
— native 'ollama', 'docker exec <container> ollama', or any custom
prefix. Running Docker containers with 'ollama' in their name are
auto-discovered and offered as quick-pick options. The chosen prefix
is saved as vlm.ollama_runner in config.json so subsequent launches
skip the question.
2. Model-role summary: prints which of the four VLM model slots are
configured and what each slot is used for:
model — primary (Pass 2 / single-shot)
model_fast — Pass 1 scene tag + per-widget crop labels
model_actions — Pass 3 next-action candidates (text-only LLM OK)
model_verify — optional verify pass (second model family)
3. Auto-pull: runs '<runner> list', compares against configured model
IDs, and pulls any missing Ollama models while streaming pull
progress to stderr. Cloud-namespaced IDs (anthropic/, openai/, etc.)
are skipped. Duplicate model IDs across slots are de-duplicated.
Pull failures print a warning but do not abort startup.
Adds vlm.ollama_runner to config.json.example with documentation.
Adds tests/test_ollama_setup.py covering all deterministic paths
(runner detection, model collection, cloud-model skipping, list
parsing, deduplication, pull targeting, disabled/empty runner bypass).
https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Flips the argparse default from 'both' to 'inspect' so first-run users get the interactive VLM model picker and Ollama auto-pull. In 'mcp' and 'both' modes stdin is owned by the MCP framing channel, so the picker silently disables VLM if no model is configured — the new default ensures the setup paths added in the prior commits actually fire. vlm.enabled in config.json.example is already true, so out-of-the-box a fresh checkout runs: load config → ask for the Ollama runner → pull configured models → start the web inspector on http://127.0.0.1:5001. Updates the docstring, argparse help, and README examples to match. The Claude Desktop MCP integration block in README still shows --mode both since that's the only mode that makes sense in that context. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Each launcher detects missing dependencies, prompts before installing,
sets up a .venv, installs requirements.txt, and starts the server in the
default mode (inspect — interactive VLM setup runs only in this mode).
Dependencies checked per platform:
Linux: python3 (+ python3-venv/pip on apt), tesseract-ocr, wmctrl,
ollama (via official install.sh). Uses apt/dnf/pacman/zypper
as appropriate.
macOS: Homebrew (offers to install if missing), python3, tesseract,
ollama. Prints the Accessibility/Screen Recording permission
note that AX adapter + screenshot capture both require.
Windows: Python 3.12, Tesseract (UB-Mannheim build), Ollama via winget
on Windows 10 1809+ / Windows 11. Falls back to printing the
download URLs on older systems.
Each prompt defaults to Yes (Enter to accept) so the happy path needs no
typing. Skipping any prompt continues without that capability — for
example skipping Ollama is fine if the user plans to point vlm.base_url
at a remote endpoint or run with vlm.enabled=false.
Adds a "Quick start — automated launchers" section to README pointing
at the three scripts, ahead of the existing manual-install instructions
which remain for users who prefer to manage their own environment.
https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
Previously the choice between 'inspect' (web UI + interactive VLM setup,
no MCP) and 'both' (MCP + web, no interactive setup) had to be made by
the user — and each had a silent failure mode in the wrong environment:
• 'both' from a TTY: the MCP server blocks reading framing bytes from
a keyboard nobody types on, and the VLM picker is suppressed because
stdin "belongs" to MCP.
• 'inspect' from a pipe (Claude Desktop launches us): an MCP client
waiting on stdout never sees a framed message.
The new 'auto' mode (now the default) resolves this at startup by
checking sys.stdin.isatty():
• TTY → runs as 'inspect'. Interactive VLM setup + Ollama auto-pull
prompts work; we don't start MCP because it would just
block. This is the graceful fallback for the missing
capability (MCP framing).
• Non-TTY → runs as 'both'. MCP framing has a real client on stdio;
interactive setup is skipped (config-driven model picks
still apply). The web inspector also comes up on :5001.
Either branch logs a one-line stderr notice naming the resolved mode and
the reason. The explicit modes (mcp/inspect/both) still work for users
who want to force one regardless of the environment.
https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
…ipts
Adds setup_config.py — a small one-shot helper that each launcher now
runs after the tesseract install step:
1. If config.json does not exist, copies it from config.json.example so
the user has a real file (gitignored) to edit, rather than relying
on main.py's lazier bootstrap which fires only when load_config
reads the missing file.
2. Checks ocr.tesseract_cmd. The bundled example ships the Windows
path 'c:\Program Files\Tesseract-OCR\tesseract.exe', which on
Linux/macOS is definitely wrong. If the configured path doesn't
exist (or is unset AND tesseract is not on PATH), searches PATH
plus common install locations ('/usr/bin', '/usr/local/bin',
'/opt/homebrew/bin', the Windows Program Files variants) and
offers to update the config to point at the discovered binary.
Y-by-default prompt, atomic write via tempfile+rename, and no-ops
silently when stdin is not a TTY or when tesseract is truly missing
(the launcher's install step has already warned about that case).
start.sh, start-mac.sh, and start.bat all call 'python setup_config.py'
right before launching main.py. Adds tests/test_setup_config.py covering
all six branches (configured-and-exists, unset-but-on-PATH, broken,
user-declined, totally-missing, missing-ocr-section) plus a check that
the atomic save leaves no .tmp leftovers.
On the related question of whether --mode should default to 'both'
instead: no. 'auto' (the current default) already picks 'both' whenever
stdin is not a TTY — exactly the environment in which 'both' makes
sense. Forcing 'both' as the default would re-introduce the silent
TTY-blocked-on-stdin failure that 'auto' was added to avoid.
https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD
mss.mss is deprecated and will be removed in a future release. All six call sites in observer.py updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Send the VLM every signal the project already computes on the same step —
the screenshot plus the accessibility tree, OCR text, ASCII sketch (with
tab-index badges and legend), and focused-element hint — instead of the
screenshot alone. Each block is wrapped in its own ... envelope,
independently size-capped, gated by a ground_with_* config flag, and
silently omitted on failure so the prompt remains valid screenshot-only.
Adds a multipass mode (scene → controls → next-actions, with optional
verify pass) that emits a strict JSON envelope with structured fields
(summary, app, screen_type, focused, controls, next_actions, modal_open,
sensitive_regions, confidence, discrepancies, per-pass timings).
Tolerant JSON parsing handles fenced/garbled responses gracefully; a
failed pass leaves null fields rather than aborting the call.
Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation
in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the
fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b
premium and llama3.2-vision:11b verify configurations.
The vlm_setup model picker now optionally prompts for model_fast,
model_actions, and model_verify when mode=multipass; save_model_to_config
takes a key= parameter to persist the auxiliary slots atomically. The
web inspector renders the JSON envelope as a definition list. The legacy
prompt key is honoured as prompt_single for back-compat.
Adds tests/test_description_vlm.py covering JSON tolerance, image
downscaling, context-block assembly with each ground_with_* gate, and
multipass pass ordering / image reuse / failed-pass tolerance using a
mocked urllib opener. Extends tests/test_vlm_setup.py for the new key=
parameter.
https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD