BillJr99 · BillJr99 · May 17, 2026 · May 17, 2026 · May 17, 2026 · May 17, 2026
diff --git a/README.md b/README.md
@@ -44,9 +44,11 @@ OSScreenObserver exposes a full REST API on port `5001` (configurable). Most `/a
 ### Startup modes
 
 ```bash
-python main.py --mode inspect          # HTTP server only (web UI + REST API)
-python main.py --mode both             # REST API + MCP stdio simultaneously
-python main.py --mock --mode inspect   # Mock mode with synthetic data (no OS access)
+python main.py                         # Default: auto — TTY → inspect (web UI + interactive setup); piped (Claude Desktop) → both
+python main.py --mode both             # Force REST API + MCP stdio simultaneously
+python main.py --mode inspect          # Force web UI only
+python main.py --mode mcp              # Force MCP stdio only
+python main.py --mock                  # Mock mode with synthetic data (no OS access)
 python main.py --mock --scenario scenarios_examples/login.yaml  # Scenario-driven mock
 ```
 
@@ -125,6 +127,31 @@ The REST API endpoints map directly to the `SCREEN_TOOLS` OpenAI/OpenWebUI funct
 
 ## Installation
 
+### Quick start — automated launchers
+
+The fastest path is the platform launcher, which detects missing
+dependencies (Python, Tesseract, Ollama, wmctrl on Linux), asks before
+installing each one, sets up a `.venv/`, installs `requirements.txt`, and
+starts the server:
+
+```bash
+# Linux
+./start.sh
+
+# macOS
+./start-mac.sh
+
+# Windows  (Command Prompt or PowerShell)
+start.bat
+```
+
+The scripts use `winget` on Windows, Homebrew on macOS, and the native
+package manager on Linux (apt / dnf / pacman / zypper). Skip any prompt
+to install manually later; the launcher will still bring up whatever is
+already working.
+
+For a manual install, follow the steps below.
+
 ### 1. Python environment
 
 ```bash
@@ -211,42 +238,74 @@ chat-completions endpoint. Two common setups:
 | Setup | `base_url` | notes |
 |-------|-----------|-------|
 | **OpenWebUI** | `http://localhost:3000` | fronts Ollama, Anthropic, OpenAI, etc. |
-| **Ollama direct** | `http://localhost:11434` | use a vision model such as `llava` or `llama3.2-vision` |
+| **Ollama direct** | `http://localhost:11434` | use a vision model such as `qwen2.5vl:7b`, `llama3.2-vision`, or `minicpm-v` |
 | **OpenAI / LiteLLM / other** | your endpoint URL | standard `/v1` path |
 
 OSScreenObserver automatically probes `/api/v1/models` first (OpenWebUI
 convention) and falls back to `/v1/models` (Ollama / OpenAI convention),
 so pointing `base_url` straight at Ollama works without any extra
 configuration.
 
-```jsonc
-// config.json — OpenWebUI example
-"vlm": {
-  "enabled":  true,
-  "base_url": "http://localhost:3000",   // OpenWebUI URL
-  "api_key":  null,                       // or set $OWUI_API_KEY
-  "model":    null,                       // null → pick interactively on first launch
-  "max_tokens": 1500
-}
-```
+The VLM channel has two operating modes:
+
+* **`single`** — one screenshot + one prompt, optionally grounded with the
+  accessibility tree, OCR text, ASCII sketch, and focused-element hint as
+  `<X>...</X>` envelopes appended to the prompt. Cheap (one HTTP call) and
+  back-compatible with prior versions.
+* **`multipass`** — a three-pass pipeline (scene → controls → next-actions)
+  with an optional verify pass. Returns a structured JSON envelope with
+  `summary`, `app`, `screen_type`, `focused`, `controls`, `next_actions`,
+  `modal_open`, `sensitive_regions`, and per-pass timings. The envelope
+  travels in the legacy `description` field as pretty-printed JSON and is
+  also exposed parsed under the new `vlm_structured` field for callers that
+  prefer not to re-parse.
 
 ```jsonc
-// config.json — Ollama direct example
+// config.json — Ollama direct, recommended starting configuration
 "vlm": {
   "enabled":  true,
-  "base_url": "http://localhost:11434",  // Ollama's native API port
+  "base_url": "http://localhost:11434",
   "api_key":  null,
-  "model":    "llama3.2-vision:11b",      // any vision-capable model pulled in Ollama
-  "max_tokens": 1500
+
+  "model":         "qwen2.5vl:7b",       // primary (Pass 2 / single-shot)
+  "model_fast":    "qwen2.5vl:3b",       // Pass 1 + per-widget crop labels
+  "model_actions": null,                 // Pass 3 (no image); falls back to primary
+  "model_verify":  null,                 // optional second opinion
+
+  "mode":          "multipass",          // or "single" for legacy one-shot
+  "output_format": "json",
+  "max_tokens":    2000,
+  "temperature":   0.1,
+
+  "ground_with_tree":   true,            // inject <ACCESSIBILITY_TREE>
+  "ground_with_ocr":    true,            // inject <OCR_TEXT>
+  "ground_with_sketch": true,            // inject <ASCII_SKETCH> with tab badges
+  "ground_with_focus":  true             // inject <FOCUSED_ELEMENT>
 }
 ```
 
+**Recommended Ollama models (24 GB RTX 4090, 128 GB RAM):**
+
+| Role | Model | Tag | ~VRAM | Notes |
+|---|---|---|---|---|
+| Primary, best overall | Qwen2.5-VL 7B | `qwen2.5vl:7b` | ~8 GB | SOTA small open VLM for UI/document tasks; strong at small fonts. |
+| Primary, premium | Qwen2.5-VL 32B (Q4_K_M) | `qwen2.5vl:32b` | ~20 GB | Top-tier reasoning; fits 24 GB at Q4; slower per image. |
+| Different family (verify) | Llama 3.2 Vision 11B | `llama3.2-vision:11b` | ~9 GB | Good `model_verify` pair for genuine second opinion. |
+| OCR-heavy screens | MiniCPM-V 2.6 | `minicpm-v:8b` | ~7 GB | Excellent on dense text and forms. |
+| Pass 1 / crop labels | Qwen2.5-VL 3B | `qwen2.5vl:3b` | ~4 GB | Cheap scene tagging; reused for the ASCII renderer's crop labeller. |
+| Pass 3 (text-only) | Qwen2.5 14B | `qwen2.5:14b` | ~9 GB | Pass 3 has no image; a text-only LLM is faster than a VLM. |
+
+Set `OLLAMA_KEEP_ALIVE=30m` and `OLLAMA_MAX_LOADED_MODELS=2` so the primary
+and fast models stay resident across multipass calls.
+
 The first time you run `python main.py --mode inspect` with
 `vlm.enabled=true` and `vlm.model=null`, OSScreenObserver fetches the
 model list from the endpoint, shows a paginated picker, and saves your
-choice back to `config.json`. In `mcp`/`both` mode the picker is
-suppressed (stdin is owned by the MCP framing channel); set `vlm.model`
-directly in `config.json` for non-interactive use.
+choice back to `config.json`. When `mode="multipass"`, the picker also
+prompts for the optional `model_fast`, `model_actions`, and `model_verify`
+slots (skip any of them to reuse the primary). In `mcp`/`both` mode the
+picker is suppressed (stdin is owned by the MCP framing channel); set the
+model keys directly in `config.json` for non-interactive use.
 
 ---
 

diff --git a/ascii_renderer.py b/ascii_renderer.py
@@ -467,16 +467,23 @@ def _phash(crop: "Image.Image") -> str:
 
 def _vlm_describe_crop(crop: "Image.Image", vlm_cfg: dict) -> str:
     """Single-line natural-language description from an OpenWebUI-compatible
-    chat-completions endpoint. Returns '' on any failure."""
-    if not vlm_cfg or not vlm_cfg.get("enabled") or not vlm_cfg.get("model"):
+    chat-completions endpoint. Returns '' on any failure.
+
+    Prefers ``vlm.model_fast`` when set (a small/cheap VLM is plenty for
+    per-widget labelling); falls back to the primary ``vlm.model``.
+    """
+    if not vlm_cfg or not vlm_cfg.get("enabled"):
+        return ""
+    model = vlm_cfg.get("model_fast") or vlm_cfg.get("model")
+    if not model:
         return ""
     try:
         import base64 as _b64
         buf = io.BytesIO()
         crop.save(buf, format="PNG")
         b64 = _b64.b64encode(buf.getvalue()).decode()
         payload = {
-            "model": vlm_cfg["model"],
+            "model": model,
             "max_tokens": 60,
             "messages": [{
                 "role": "user",

diff --git a/config.json.example b/config.json.example
@@ -27,14 +27,44 @@
     "backend": "tesseract"
   },
 
-  "_vlm": "Vision-LLM modality reached through any OpenAI-compatible chat-completions endpoint. Common values for 'base_url': 'http://localhost:3000' (OpenWebUI), 'http://localhost:11434' (Ollama direct — use a vision model such as llava or llama3.2-vision). OSScreenObserver probes /api/v1/... first (OpenWebUI convention) then falls back to /v1/... (Ollama/OpenAI convention) automatically. Leave 'enabled' false unless you have a local endpoint configured. 'base_url' should NOT include a trailing path component. 'api_key' may be left null when the endpoint accepts $OWUI_API_KEY from the environment. Pick a 'model' interactively via `python main.py --mode inspect`, or set it here directly. 'prompt' is sent verbatim alongside the screenshot; tune it for your downstream agent.",
+  "_vlm": "Vision-LLM modality reached through any OpenAI-compatible chat-completions endpoint. Common values for 'base_url': 'http://localhost:3000' (OpenWebUI), 'http://localhost:11434' (Ollama direct — use a vision model such as 'qwen2.5vl:7b' for best UI/screen quality, 'llama3.2-vision:11b' for a different family, or 'minicpm-v:8b' for OCR-heavy screens). OSScreenObserver probes /api/v1/... first (OpenWebUI convention) then falls back to /v1/... (Ollama/OpenAI convention) automatically. 'base_url' should NOT include a trailing path component. 'api_key' may be left null when the endpoint accepts $OWUI_API_KEY from the environment. Pick a 'model' interactively via `python main.py --mode inspect`, or set it here directly. Two operating modes — 'single' sends one screenshot + grounded prompt and returns the raw response; 'multipass' runs a three-pass scene→controls→actions pipeline (plus an optional verify pass) and returns a structured JSON envelope. The ground_with_* flags include the accessibility tree, OCR text, ASCII sketch, and focused-element hint as in-context ground truth alongside the screenshot, gated independently and silently omitted on failure.",
   "vlm": {
     "enabled": true,
     "base_url": "http://localhost:11434",
     "api_key": null,
-    "model": "llama3.2-vision:11b",
-    "max_tokens": 1500,
-    "prompt": "You are analyzing a computer screen for an agentic AI system. Describe what you see in structured detail: (1) What application(s) are visible? (2) What is the main content or task shown? (3) What UI controls are present and in what state? (4) What is the spatial layout? (5) What is the current focus or active element? (6) What actions would be most natural to take next? Be specific and use exact names of buttons, labels, and fields as they appear."
+
+    "model":         "qwen2.5vl:7b",
+    "model_fast":    "qwen2.5vl:3b",
+    "model_actions": null,
+    "model_verify":  null,
+
+    "max_tokens":  2000,
+    "temperature": 0.1,
+    "timeout_s":   60,
+
+    "mode":          "multipass",
+    "output_format": "json",
+
+    "ground_with_tree":   true,
+    "ground_with_ocr":    true,
+    "ground_with_sketch": true,
+    "ground_with_focus":  true,
+
+    "tree_max_lines":   80,
+    "ocr_max_chars":    4000,
+    "sketch_max_chars": 6000,
+    "image_max_dim":    1600,
+    "focused_zoom_pad": 96,
+
+    "_ollama_runner": "How to invoke the Ollama CLI. Leave null and set vlm.enabled=true, then run `python main.py --mode inspect` once — OSScreenObserver will ask you interactively and save the choice here. Common values: [] to skip auto-pull; ['ollama'] for a native install; ['docker', 'exec', 'my_container', 'ollama'] for a Docker-based Ollama. On startup, OSScreenObserver checks which configured models are present in the local Ollama library and pulls any that are missing, printing pull progress to stderr.",
+    "ollama_runner": null,
+
+    "_prompt_legacy": "The legacy 'prompt' key remains honoured as a synonym for 'prompt_single' when mode=='single'. When unset, the built-in defaults in description.py are used; tune them per app/agent.",
+    "prompt_single":   null,
+    "prompt_scene":    null,
+    "prompt_controls": null,
+    "prompt_actions":  null,
+    "prompt_verify":   null
   },
 
   "_ascii_sketch": "Text-sketch renderer (ascii_renderer.py). 'grid_width'/'grid_height' control the output cell dimensions; the renderer projects the window's screen-pixel bounds into that grid. 'unicode_box' uses ┌─┐│└┘ glyphs (set false for plain ASCII +-|). The fidelity toggles are all independently switchable: 'role_glyphs' enables compact [x]/(•)/▼ control representations; 'occlusion_prune' hides siblings fully covered by later-drawn siblings (modals); 'tab_index_badges' writes ①②③ into focusable elements in DFS focus order; 'landmark_headers' bakes role+name into the top border of toolbars / dialogs / status bars; 'vlm_fallback' (off by default) lets the renderer call the VLM endpoint above to label unidentified custom widgets — set true only when you have vlm.enabled=true and accept the network cost.",