Elevate VLM action with four-signal grounding and multipass pipeline by BillJr99 · Pull Request #23 · BillJr99/OSScreenObserver

BillJr99 · 2026-05-17T04:04:35Z

Send the VLM every signal the project already computes on the same step —
the screenshot plus the accessibility tree, OCR text, ASCII sketch (with
tab-index badges and legend), and focused-element hint — instead of the
screenshot alone. Each block is wrapped in its own ... envelope,
independently size-capped, gated by a ground_with_* config flag, and
silently omitted on failure so the prompt remains valid screenshot-only.

Adds a multipass mode (scene → controls → next-actions, with optional
verify pass) that emits a strict JSON envelope with structured fields
(summary, app, screen_type, focused, controls, next_actions, modal_open,
sensitive_regions, confidence, discrepancies, per-pass timings).
Tolerant JSON parsing handles fenced/garbled responses gracefully; a
failed pass leaves null fields rather than aborting the call.

Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation
in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the
fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b
premium and llama3.2-vision:11b verify configurations.

The vlm_setup model picker now optionally prompts for model_fast,
model_actions, and model_verify when mode=multipass; save_model_to_config
takes a key= parameter to persist the auxiliary slots atomically. The
web inspector renders the JSON envelope as a definition list. The legacy
prompt key is honoured as prompt_single for back-compat.

Adds tests/test_description_vlm.py covering JSON tolerance, image
downscaling, context-block assembly with each ground_with_* gate, and
multipass pass ordering / image reuse / failed-pass tolerance using a
mocked urllib opener. Extends tests/test_vlm_setup.py for the new key=
parameter.

https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

Send the VLM every signal the project already computes on the same step — the screenshot plus the accessibility tree, OCR text, ASCII sketch (with tab-index badges and legend), and focused-element hint — instead of the screenshot alone. Each block is wrapped in its own <X>...</X> envelope, independently size-capped, gated by a ground_with_* config flag, and silently omitted on failure so the prompt remains valid screenshot-only. Adds a multipass mode (scene → controls → next-actions, with optional verify pass) that emits a strict JSON envelope with structured fields (summary, app, screen_type, focused, controls, next_actions, modal_open, sensitive_regions, confidence, discrepancies, per-pass timings). Tolerant JSON parsing handles fenced/garbled responses gracefully; a failed pass leaves null fields rather than aborting the call. Recommends Ollama models tuned for a 24 GB 4090 + 128 GB RAM workstation in the config and README: qwen2.5vl:7b primary, qwen2.5vl:3b for the fast pass and per-widget crop labels, with notes on the qwen2.5vl:32b premium and llama3.2-vision:11b verify configurations. The vlm_setup model picker now optionally prompts for model_fast, model_actions, and model_verify when mode=multipass; save_model_to_config takes a key= parameter to persist the auxiliary slots atomically. The web inspector renders the JSON envelope as a definition list. The legacy prompt key is honoured as prompt_single for back-compat. Adds tests/test_description_vlm.py covering JSON tolerance, image downscaling, context-block assembly with each ground_with_* gate, and multipass pass ordering / image reuse / failed-pass tolerance using a mocked urllib opener. Extends tests/test_vlm_setup.py for the new key= parameter. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

Adds ollama_setup.py with three responsibilities wired into the --mode inspect startup in main.py: 1. Runner detection: asks the user (once) how to invoke the Ollama CLI — native 'ollama', 'docker exec <container> ollama', or any custom prefix. Running Docker containers with 'ollama' in their name are auto-discovered and offered as quick-pick options. The chosen prefix is saved as vlm.ollama_runner in config.json so subsequent launches skip the question. 2. Model-role summary: prints which of the four VLM model slots are configured and what each slot is used for: model — primary (Pass 2 / single-shot) model_fast — Pass 1 scene tag + per-widget crop labels model_actions — Pass 3 next-action candidates (text-only LLM OK) model_verify — optional verify pass (second model family) 3. Auto-pull: runs '<runner> list', compares against configured model IDs, and pulls any missing Ollama models while streaming pull progress to stderr. Cloud-namespaced IDs (anthropic/, openai/, etc.) are skipped. Duplicate model IDs across slots are de-duplicated. Pull failures print a warning but do not abort startup. Adds vlm.ollama_runner to config.json.example with documentation. Adds tests/test_ollama_setup.py covering all deterministic paths (runner detection, model collection, cloud-model skipping, list parsing, deduplication, pull targeting, disabled/empty runner bypass). https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Flips the argparse default from 'both' to 'inspect' so first-run users get the interactive VLM model picker and Ollama auto-pull. In 'mcp' and 'both' modes stdin is owned by the MCP framing channel, so the picker silently disables VLM if no model is configured — the new default ensures the setup paths added in the prior commits actually fire. vlm.enabled in config.json.example is already true, so out-of-the-box a fresh checkout runs: load config → ask for the Ollama runner → pull configured models → start the web inspector on http://127.0.0.1:5001. Updates the docstring, argparse help, and README examples to match. The Claude Desktop MCP integration block in README still shows --mode both since that's the only mode that makes sense in that context. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

Each launcher detects missing dependencies, prompts before installing, sets up a .venv, installs requirements.txt, and starts the server in the default mode (inspect — interactive VLM setup runs only in this mode). Dependencies checked per platform: Linux: python3 (+ python3-venv/pip on apt), tesseract-ocr, wmctrl, ollama (via official install.sh). Uses apt/dnf/pacman/zypper as appropriate. macOS: Homebrew (offers to install if missing), python3, tesseract, ollama. Prints the Accessibility/Screen Recording permission note that AX adapter + screenshot capture both require. Windows: Python 3.12, Tesseract (UB-Mannheim build), Ollama via winget on Windows 10 1809+ / Windows 11. Falls back to printing the download URLs on older systems. Each prompt defaults to Yes (Enter to accept) so the happy path needs no typing. Skipping any prompt continues without that capability — for example skipping Ollama is fine if the user plans to point vlm.base_url at a remote endpoint or run with vlm.enabled=false. Adds a "Quick start — automated launchers" section to README pointing at the three scripts, ahead of the existing manual-install instructions which remain for users who prefer to manage their own environment. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

Previously the choice between 'inspect' (web UI + interactive VLM setup, no MCP) and 'both' (MCP + web, no interactive setup) had to be made by the user — and each had a silent failure mode in the wrong environment: • 'both' from a TTY: the MCP server blocks reading framing bytes from a keyboard nobody types on, and the VLM picker is suppressed because stdin "belongs" to MCP. • 'inspect' from a pipe (Claude Desktop launches us): an MCP client waiting on stdout never sees a framed message. The new 'auto' mode (now the default) resolves this at startup by checking sys.stdin.isatty(): • TTY → runs as 'inspect'. Interactive VLM setup + Ollama auto-pull prompts work; we don't start MCP because it would just block. This is the graceful fallback for the missing capability (MCP framing). • Non-TTY → runs as 'both'. MCP framing has a real client on stdio; interactive setup is skipped (config-driven model picks still apply). The web inspector also comes up on :5001. Either branch logs a one-line stderr notice naming the resolved mode and the reason. The explicit modes (mcp/inspect/both) still work for users who want to force one regardless of the environment. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

…ipts Adds setup_config.py — a small one-shot helper that each launcher now runs after the tesseract install step: 1. If config.json does not exist, copies it from config.json.example so the user has a real file (gitignored) to edit, rather than relying on main.py's lazier bootstrap which fires only when load_config reads the missing file. 2. Checks ocr.tesseract_cmd. The bundled example ships the Windows path 'c:\Program Files\Tesseract-OCR\tesseract.exe', which on Linux/macOS is definitely wrong. If the configured path doesn't exist (or is unset AND tesseract is not on PATH), searches PATH plus common install locations ('/usr/bin', '/usr/local/bin', '/opt/homebrew/bin', the Windows Program Files variants) and offers to update the config to point at the discovered binary. Y-by-default prompt, atomic write via tempfile+rename, and no-ops silently when stdin is not a TTY or when tesseract is truly missing (the launcher's install step has already warned about that case). start.sh, start-mac.sh, and start.bat all call 'python setup_config.py' right before launching main.py. Adds tests/test_setup_config.py covering all six branches (configured-and-exists, unset-but-on-PATH, broken, user-declined, totally-missing, missing-ocr-section) plus a check that the atomic save leaves no .tmp leftovers. On the related question of whether --mode should default to 'both' instead: no. 'auto' (the current default) already picks 'both' whenever stdin is not a TTY — exactly the environment in which 'both' makes sense. Forcing 'both' as the default would re-introduce the silent TTY-blocked-on-stdin failure that 'auto' was added to avoid. https://claude.ai/code/session_01VhYzhCbZ5qvmBThCH8cxLD

mss.mss is deprecated and will be removed in a future release. All six call sites in observer.py updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude added 2 commits May 17, 2026 03:56

Copilot AI review requested due to automatic review settings May 17, 2026 04:04

Copilot started reviewing on behalf of BillJr99 May 17, 2026 04:05 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

claude and others added 5 commits May 17, 2026 10:43

Replace deprecated mss.mss() with mss.MSS()

7f3afe5

mss.mss is deprecated and will be removed in a future release. All six call sites in observer.py updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BillJr99 merged commit 7200066 into main May 17, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elevate VLM action with four-signal grounding and multipass pipeline#23

Elevate VLM action with four-signal grounding and multipass pipeline#23
BillJr99 merged 7 commits into
mainfrom
claude/vlm-prompt-optimization-PDPzf

BillJr99 commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BillJr99 commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants