Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 80 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@ OSScreenObserver exposes a full REST API on port `5001` (configurable). Most `/a
### Startup modes

```bash
python main.py --mode inspect # HTTP server only (web UI + REST API)
python main.py --mode both # REST API + MCP stdio simultaneously
python main.py --mock --mode inspect # Mock mode with synthetic data (no OS access)
python main.py # Default: auto — TTY → inspect (web UI + interactive setup); piped (Claude Desktop) → both
python main.py --mode both # Force REST API + MCP stdio simultaneously
python main.py --mode inspect # Force web UI only
python main.py --mode mcp # Force MCP stdio only
python main.py --mock # Mock mode with synthetic data (no OS access)
python main.py --mock --scenario scenarios_examples/login.yaml # Scenario-driven mock
```

Expand Down Expand Up @@ -125,6 +127,31 @@ The REST API endpoints map directly to the `SCREEN_TOOLS` OpenAI/OpenWebUI funct

## Installation

### Quick start — automated launchers

The fastest path is the platform launcher, which detects missing
dependencies (Python, Tesseract, Ollama, wmctrl on Linux), asks before
installing each one, sets up a `.venv/`, installs `requirements.txt`, and
starts the server:

```bash
# Linux
./start.sh

# macOS
./start-mac.sh

# Windows (Command Prompt or PowerShell)
start.bat
```

The scripts use `winget` on Windows, Homebrew on macOS, and the native
package manager on Linux (apt / dnf / pacman / zypper). Skip any prompt
to install manually later; the launcher will still bring up whatever is
already working.

For a manual install, follow the steps below.

### 1. Python environment

```bash
Expand Down Expand Up @@ -211,42 +238,74 @@ chat-completions endpoint. Two common setups:
| Setup | `base_url` | notes |
|-------|-----------|-------|
| **OpenWebUI** | `http://localhost:3000` | fronts Ollama, Anthropic, OpenAI, etc. |
| **Ollama direct** | `http://localhost:11434` | use a vision model such as `llava` or `llama3.2-vision` |
| **Ollama direct** | `http://localhost:11434` | use a vision model such as `qwen2.5vl:7b`, `llama3.2-vision`, or `minicpm-v` |
| **OpenAI / LiteLLM / other** | your endpoint URL | standard `/v1` path |

OSScreenObserver automatically probes `/api/v1/models` first (OpenWebUI
convention) and falls back to `/v1/models` (Ollama / OpenAI convention),
so pointing `base_url` straight at Ollama works without any extra
configuration.

```jsonc
// config.json — OpenWebUI example
"vlm": {
"enabled": true,
"base_url": "http://localhost:3000", // OpenWebUI URL
"api_key": null, // or set $OWUI_API_KEY
"model": null, // null → pick interactively on first launch
"max_tokens": 1500
}
```
The VLM channel has two operating modes:

* **`single`** — one screenshot + one prompt, optionally grounded with the
accessibility tree, OCR text, ASCII sketch, and focused-element hint as
`<X>...</X>` envelopes appended to the prompt. Cheap (one HTTP call) and
back-compatible with prior versions.
* **`multipass`** — a three-pass pipeline (scene → controls → next-actions)
with an optional verify pass. Returns a structured JSON envelope with
`summary`, `app`, `screen_type`, `focused`, `controls`, `next_actions`,
`modal_open`, `sensitive_regions`, and per-pass timings. The envelope
travels in the legacy `description` field as pretty-printed JSON and is
also exposed parsed under the new `vlm_structured` field for callers that
prefer not to re-parse.

```jsonc
// config.json — Ollama direct example
// config.json — Ollama direct, recommended starting configuration
"vlm": {
"enabled": true,
"base_url": "http://localhost:11434", // Ollama's native API port
"base_url": "http://localhost:11434",
"api_key": null,
"model": "llama3.2-vision:11b", // any vision-capable model pulled in Ollama
"max_tokens": 1500

"model": "qwen2.5vl:7b", // primary (Pass 2 / single-shot)
"model_fast": "qwen2.5vl:3b", // Pass 1 + per-widget crop labels
"model_actions": null, // Pass 3 (no image); falls back to primary
"model_verify": null, // optional second opinion

"mode": "multipass", // or "single" for legacy one-shot
"output_format": "json",
"max_tokens": 2000,
"temperature": 0.1,

"ground_with_tree": true, // inject <ACCESSIBILITY_TREE>
"ground_with_ocr": true, // inject <OCR_TEXT>
"ground_with_sketch": true, // inject <ASCII_SKETCH> with tab badges
"ground_with_focus": true // inject <FOCUSED_ELEMENT>
}
```

**Recommended Ollama models (24 GB RTX 4090, 128 GB RAM):**

| Role | Model | Tag | ~VRAM | Notes |
|---|---|---|---|---|
| Primary, best overall | Qwen2.5-VL 7B | `qwen2.5vl:7b` | ~8 GB | SOTA small open VLM for UI/document tasks; strong at small fonts. |
| Primary, premium | Qwen2.5-VL 32B (Q4_K_M) | `qwen2.5vl:32b` | ~20 GB | Top-tier reasoning; fits 24 GB at Q4; slower per image. |
| Different family (verify) | Llama 3.2 Vision 11B | `llama3.2-vision:11b` | ~9 GB | Good `model_verify` pair for genuine second opinion. |
| OCR-heavy screens | MiniCPM-V 2.6 | `minicpm-v:8b` | ~7 GB | Excellent on dense text and forms. |
| Pass 1 / crop labels | Qwen2.5-VL 3B | `qwen2.5vl:3b` | ~4 GB | Cheap scene tagging; reused for the ASCII renderer's crop labeller. |
| Pass 3 (text-only) | Qwen2.5 14B | `qwen2.5:14b` | ~9 GB | Pass 3 has no image; a text-only LLM is faster than a VLM. |

Set `OLLAMA_KEEP_ALIVE=30m` and `OLLAMA_MAX_LOADED_MODELS=2` so the primary
and fast models stay resident across multipass calls.

The first time you run `python main.py --mode inspect` with
`vlm.enabled=true` and `vlm.model=null`, OSScreenObserver fetches the
model list from the endpoint, shows a paginated picker, and saves your
choice back to `config.json`. In `mcp`/`both` mode the picker is
suppressed (stdin is owned by the MCP framing channel); set `vlm.model`
directly in `config.json` for non-interactive use.
choice back to `config.json`. When `mode="multipass"`, the picker also
prompts for the optional `model_fast`, `model_actions`, and `model_verify`
slots (skip any of them to reuse the primary). In `mcp`/`both` mode the
picker is suppressed (stdin is owned by the MCP framing channel); set the
model keys directly in `config.json` for non-interactive use.

---

Expand Down
13 changes: 10 additions & 3 deletions ascii_renderer.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,16 +467,23 @@ def _phash(crop: "Image.Image") -> str:

def _vlm_describe_crop(crop: "Image.Image", vlm_cfg: dict) -> str:
"""Single-line natural-language description from an OpenWebUI-compatible
chat-completions endpoint. Returns '' on any failure."""
if not vlm_cfg or not vlm_cfg.get("enabled") or not vlm_cfg.get("model"):
chat-completions endpoint. Returns '' on any failure.

Prefers ``vlm.model_fast`` when set (a small/cheap VLM is plenty for
per-widget labelling); falls back to the primary ``vlm.model``.
"""
if not vlm_cfg or not vlm_cfg.get("enabled"):
return ""
model = vlm_cfg.get("model_fast") or vlm_cfg.get("model")
if not model:
return ""
try:
import base64 as _b64
buf = io.BytesIO()
crop.save(buf, format="PNG")
b64 = _b64.b64encode(buf.getvalue()).decode()
payload = {
"model": vlm_cfg["model"],
"model": model,
"max_tokens": 60,
"messages": [{
"role": "user",
Expand Down
38 changes: 34 additions & 4 deletions config.json.example
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,44 @@
"backend": "tesseract"
},

"_vlm": "Vision-LLM modality reached through any OpenAI-compatible chat-completions endpoint. Common values for 'base_url': 'http://localhost:3000' (OpenWebUI), 'http://localhost:11434' (Ollama direct — use a vision model such as llava or llama3.2-vision). OSScreenObserver probes /api/v1/... first (OpenWebUI convention) then falls back to /v1/... (Ollama/OpenAI convention) automatically. Leave 'enabled' false unless you have a local endpoint configured. 'base_url' should NOT include a trailing path component. 'api_key' may be left null when the endpoint accepts $OWUI_API_KEY from the environment. Pick a 'model' interactively via `python main.py --mode inspect`, or set it here directly. 'prompt' is sent verbatim alongside the screenshot; tune it for your downstream agent.",
"_vlm": "Vision-LLM modality reached through any OpenAI-compatible chat-completions endpoint. Common values for 'base_url': 'http://localhost:3000' (OpenWebUI), 'http://localhost:11434' (Ollama direct — use a vision model such as 'qwen2.5vl:7b' for best UI/screen quality, 'llama3.2-vision:11b' for a different family, or 'minicpm-v:8b' for OCR-heavy screens). OSScreenObserver probes /api/v1/... first (OpenWebUI convention) then falls back to /v1/... (Ollama/OpenAI convention) automatically. 'base_url' should NOT include a trailing path component. 'api_key' may be left null when the endpoint accepts $OWUI_API_KEY from the environment. Pick a 'model' interactively via `python main.py --mode inspect`, or set it here directly. Two operating modes — 'single' sends one screenshot + grounded prompt and returns the raw response; 'multipass' runs a three-pass scene→controls→actions pipeline (plus an optional verify pass) and returns a structured JSON envelope. The ground_with_* flags include the accessibility tree, OCR text, ASCII sketch, and focused-element hint as in-context ground truth alongside the screenshot, gated independently and silently omitted on failure.",
"vlm": {
"enabled": true,
"base_url": "http://localhost:11434",
"api_key": null,
"model": "llama3.2-vision:11b",
"max_tokens": 1500,
"prompt": "You are analyzing a computer screen for an agentic AI system. Describe what you see in structured detail: (1) What application(s) are visible? (2) What is the main content or task shown? (3) What UI controls are present and in what state? (4) What is the spatial layout? (5) What is the current focus or active element? (6) What actions would be most natural to take next? Be specific and use exact names of buttons, labels, and fields as they appear."

"model": "qwen2.5vl:7b",
"model_fast": "qwen2.5vl:3b",
"model_actions": null,
"model_verify": null,

"max_tokens": 2000,
"temperature": 0.1,
"timeout_s": 60,

"mode": "multipass",
"output_format": "json",

"ground_with_tree": true,
"ground_with_ocr": true,
"ground_with_sketch": true,
"ground_with_focus": true,

"tree_max_lines": 80,
"ocr_max_chars": 4000,
"sketch_max_chars": 6000,
"image_max_dim": 1600,
"focused_zoom_pad": 96,

"_ollama_runner": "How to invoke the Ollama CLI. Leave null and set vlm.enabled=true, then run `python main.py --mode inspect` once — OSScreenObserver will ask you interactively and save the choice here. Common values: [] to skip auto-pull; ['ollama'] for a native install; ['docker', 'exec', 'my_container', 'ollama'] for a Docker-based Ollama. On startup, OSScreenObserver checks which configured models are present in the local Ollama library and pulls any that are missing, printing pull progress to stderr.",
"ollama_runner": null,

"_prompt_legacy": "The legacy 'prompt' key remains honoured as a synonym for 'prompt_single' when mode=='single'. When unset, the built-in defaults in description.py are used; tune them per app/agent.",
"prompt_single": null,
"prompt_scene": null,
"prompt_controls": null,
"prompt_actions": null,
"prompt_verify": null
},

"_ascii_sketch": "Text-sketch renderer (ascii_renderer.py). 'grid_width'/'grid_height' control the output cell dimensions; the renderer projects the window's screen-pixel bounds into that grid. 'unicode_box' uses ┌─┐│└┘ glyphs (set false for plain ASCII +-|). The fidelity toggles are all independently switchable: 'role_glyphs' enables compact [x]/(•)/▼ control representations; 'occlusion_prune' hides siblings fully covered by later-drawn siblings (modals); 'tab_index_badges' writes ①②③ into focusable elements in DFS focus order; 'landmark_headers' bakes role+name into the top border of toolbars / dialogs / status bars; 'vlm_fallback' (off by default) lets the renderer call the VLM endpoint above to label unidentified custom widgets — set true only when you have vlm.enabled=true and accept the network cost.",
Expand Down
Loading
Loading