Precision-Enhanced AI Desktop Agent Fork
An AI that controls its own computer with enhanced accuracy, smart focus targeting, and advanced computer vision.
π± Fork of bytebot-ai/bytebot with precision tooling enhancements
This repository contains the Holo 1.5-7B variant of Bytebot Hawkeye. An alternative OmniParser variant is also available:
Computer Vision: Holo 1.5-7B (Qwen2.5-VL-based unified model)
- β Highest accuracy (90%+ localization, ScreenSpot-Pro: 57.94)
- β Single unified model approach
- β Official transformers implementation
β οΈ Larger download (~14GB model)- π You are here
Computer Vision: OmniParser v2.0 (YOLOv8 + Florence-2 + OCR ensemble)
- β Lighter weight (~850MB total models)
- β Modular multi-component pipeline
- β Good accuracy (89% clicks, ScreenSpot-Pro: 39.6)
- π¦ Faster initial setup
- π View OmniParser variant β
Both variants share: Smart Focus System, Set-of-Mark annotations, real-time telemetry, 46 model support, cross-platform containers
Choose Holo if: You want maximum accuracy and have 14GB+ disk space Choose OmniParser if: You prefer faster setup and lighter resource usage
Hawkeye is a precision-enhanced fork of Bytebot that adds advanced computer vision, smart targeting, and accuracy measurement tools to make AI desktop agents more reliable. While upstream Bytebot provides the foundation, Hawkeye layers on:
- π― Smart Focus System - 3-stage coarseβfocusβclick workflow for precise targeting
- π¬ Holo 1.5-7B Vision - AI-powered UI element localization (90%+ accuracy)
- π Real-time Telemetry - Live accuracy metrics and performance monitoring
- π¨ Grid Overlay Guidance - Always-on coordinate scaffolding for spatial reasoning
- π§ 46 Model Support - Vision & text-only models with intelligent adaptation
| Feature | Hawkeye | Upstream |
|---|---|---|
| UI Element Detection | Holo 1.5-7B (90%+ accuracy) + Tesseract.js OCR | Basic screenshot analysis |
| Click Targeting | Smart Focus 3-stage workflow with zoom refinement | Single-shot click reasoning |
| Coordinate Accuracy | Telemetry pipeline with adaptive calibration | No automated measurement |
| Visual Grounding | Set-of-Mark (SOM) numbered annotations (70-85% accuracy) | Raw element IDs (20-30% accuracy) |
| Grid Overlays | Always-on 100px grids with debug mode | No spatial scaffolding |
| Real-time Monitoring | Live CV activity, GPU status, performance metrics | No visibility into detection methods |
| Model Support | 46 models (31 vision + 15 text-only) with auto-adaptation | Limited model support |
| Cross-Platform | Windows/Linux/macOS containers with identical workflows | Linux only |
| UI Theming | Light/dark mode toggle | Single default theme |
| Learning System | Empirical model performance tracking with recommendations | No performance learning |
See docs/FEATURES.md for comprehensive feature documentation.
- Docker (β₯20.10) & Docker Compose (β₯2.0)
- Git for cloning the repository
- Node.js β₯20.0.0 (local development only, not needed for Docker)
# Choose your preferred LLM provider(s):
ANTHROPIC_API_KEY=sk-ant-... # Claude models (console.anthropic.com)
OPENAI_API_KEY=sk-... # GPT models (platform.openai.com)
GEMINI_API_KEY=... # Gemini models (aistudio.google.com)
OPENROUTER_API_KEY=sk-or-v1-... # Multi-model proxy (openrouter.ai)Or use LMStudio for free local models (see LMStudio Setup).
Holo 1.5-7B provides precision UI localization with GPU acceleration:
| Platform | Performance | Requirements |
|---|---|---|
| NVIDIA GPU β‘ | ~1.5-3s/inference | 14GB+ VRAM, nvidia-container-toolkit |
| Apple Silicon π | ~2-4s/inference | 16GB+ unified memory (M1-M4) |
| CPU-only π» | ~15-30s/inference | Automatic fallback to Tesseract.js OCR |
GPU setup guide: docs/GPU_SETUP.md
Note: CPU-only systems automatically disable Holo and use fast Tesseract.js OCR instead.
git clone https://github.com/zhound420/bytebot-hawkeye-holo.git
cd bytebot-hawkeye-holocat <<'EOF' > docker/.env
# At least one LLM provider API key required
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-v1-...
EOF# Interactive setup (recommended for first-time users)
./scripts/start-stack.sh
# Or use flags for automation:
./scripts/start-stack.sh --os linux # Linux desktop (default)
./scripts/start-stack.sh --os windows --prebaked # Windows 11 container
./scripts/start-stack.sh --os macos # macOS containerWhat happens automatically:
- π Detects GPU (NVIDIA/Apple Silicon) and configures Holo 1.5-7B accordingly
- π³ Launches all services (agent, UI, desktop, database, LLM proxy)
- π€ Downloads Holo 1.5-7B model (~14GB, one-time, cached)
- β Applies database migrations and verifies health
Access the stack:
- π Web UI: http://localhost:9992
- π₯οΈ Desktop (noVNC): http://localhost:9990
- π€ Agent API: http://localhost:9991
- π LiteLLM Proxy: http://localhost:4000
- π― Holo 1.5-7B: http://localhost:9989
Stop the stack:
./scripts/stop-stack.shTest automation on native Windows or macOS environments:
Windows 11 (Tiny11 pre-baked - 96% faster startup):
./scripts/start-stack.sh --os windows --prebaked
# Access: http://localhost:8006 (30-60s startup vs 10-15min runtime install)macOS Sonoma/Sequoia:
./scripts/start-stack.sh --os macos --prebaked
# Access: http://localhost:8006
# One-time setup required: Complete Setup Assistant, then run /shared/setup-macos-first-time.shSee: docs/WINDOWS_SETUP.md | docs/MACOS_SETUP.md
Hawkeye supports 46 models across all major providers with intelligent vision/text-only adaptation:
- Claude: Opus 4, Sonnet 4, 4.1, Haiku 4.5, 3.5 Sonnet, 3.7 Sonnet
- GPT: 4o, 4o-mini, 4.1, 5-main, 5-mini, 5-thinking, 5-image (1 & mini)
- Gemini: 2.5 Pro/Flash, 2.0 Flash Thinking, 1.5 Pro/Flash, Computer Use Preview
- OpenRouter: Qwen3-VL series, InternVL3, Kimi-VL, Mistral Small 3.1
- LMStudio: Auto-detected local vision models (LLaVA, Qwen-VL, etc.)
- Reasoning: o3-pro, o3, o3-mini, o1, o1-mini, o1-preview
- OpenRouter: DeepSeek R1 series, Qwen Plus/Turbo/Max, Claude Haiku, Mistral Large, GLM 4.6
Automatic Adaptation:
- β Vision models receive screenshots with SOM numbered annotations
- β Text-only models receive detailed text descriptions
- β System prompts tailored to each model's capabilities
- β Holo enrichment skipped for text-only models (30-80s saved)
Enhanced Model Picker:
- ποΈ Vision section (blue theme)
- π Text-Only section (amber theme)
- π» Local Models section (green theme, FREE badge)
- Clear capability indicators
Hawkeye uses a streamlined 2-method detection system:
- Model: Hcompany/Holo1.5-7B (Qwen2.5-VL-7B base)
- Accuracy: 90%+ UI element localization
- Method: Direct coordinate prediction via vision-language model
- Size: ~14GB (bfloat16 precision, official transformers implementation)
- Performance: 1.5-3s on NVIDIA GPU, 2-4s on Apple Silicon, 15-30s on CPU
- Platforms: NVIDIA CUDA, Apple Metal (MPS), CPU fallback
- Purpose: Text extraction when Holo unavailable or disabled
- Speed: Fast, pure JavaScript
- Use Case: CPU-only systems, text-based UI elements
- Accuracy Boost: 70-85% vs 20-30% with raw element IDs
- Method: Numbered bounding box annotations on screenshots
- Usage: Vision models reference elements by visible numbers (e.g., "click [5]")
- Backend: Automatic numberβcoordinate resolution
See: docs/HOLO_SETUP.md for detailed setup and troubleshooting.
Run models locally with zero API costs through automatic LMStudio integration:
# During stack startup, answer 'y' to LMStudio prompt
# Or run standalone:
./scripts/setup-lmstudio.shAuto-Configuration:
- π Discovers models from LMStudio server
- ποΈ Detects vision support (llava, qwen-vl, cogvlm)
- βοΈ Generates LiteLLM config entries
- π Displays in UI with FREE badge
Best Setup: Run LMStudio on separate GPU-equipped machine to avoid memory contention with Holo 1.5-7B.
See: docs/LMSTUDIO_SETUP.md for deployment strategies.
Hawkeye's signature precision feature - a 3-stage targeting workflow:
- Coarse - Overview with 200px grid to identify region
- Focus - Zoom into region with 25px grid for precise location
- Click - Final coordinate selection with sub-pixel accuracy
Configuration:
BYTEBOT_SMART_FOCUS=true # Enable/disable
BYTEBOT_SMART_FOCUS_MODEL=gpt-4o-mini # Model for focus reasoning
BYTEBOT_OVERVIEW_GRID=200 # Coarse grid size
BYTEBOT_FOCUSED_GRID=25 # Fine grid sizeSee: docs/SMART_FOCUS_SYSTEM.md for comprehensive guide.
- π΄ Live method indicators (Holo 1.5-7B, OCR)
- β‘ GPU status (NVIDIA GPU / Apple Silicon / CPU)
- π Performance metrics (success rate, avg time, executions)
- π― Detection history with cache indicators
- π Model leaderboard (ranked by success rate)
- π Active model performance card
- π― Task outcome tracking
- π‘ Intelligent model recommendations
- β Success rate per session
- π Weighted offset tracking
- π Convergence metrics
- πΊοΈ Hotspot visualization
Canvas build errors (Ubuntu/Debian):
# Install system dependencies BEFORE npm install
sudo apt update
sudo apt install -y pkg-config libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-devGPU not detected:
# Test GPU access
nvidia-smi # For NVIDIA
# If fails β CPU-only mode (automatic fallback to Tesseract.js)Holo 1.5-7B model not downloading:
# Model downloads automatically on first run
./scripts/start-holo.sh # Apple Silicon
./scripts/start-stack.sh # x86_64 DockerServices won't start:
# Check service health
docker compose ps
docker logs bytebot-agent
docker logs bytebot-holo
# Verify database migrations
docker exec bytebot-agent npx prisma migrate statusForce clean reinstall:
./scripts/setup-holo.sh --force
./scripts/stop-stack.sh
./scripts/start-stack.shSee: docs/TROUBLESHOOTING.md for comprehensive troubleshooting guide.
- GPU Setup - Platform-specific GPU configuration
- Holo 1.5-7B Setup - Model installation and troubleshooting
- Windows Container Setup - Tiny11/Nano11 deployment
- macOS Container Setup - Sonoma/Sequoia deployment
- LMStudio Setup - Local model integration
- Features Overview - Comprehensive Hawkeye enhancements
- Smart Focus System - 3-stage targeting workflow
- Model Tier System - CV enforcement and capabilities
- Coordinate Accuracy - Telemetry and calibration
- Architecture Overview - Package structure and dependencies
- Build Instructions - Local development setup
- API Reference - Coming soon
Hawkeye is a precision-enhanced fork of bytebot-ai/bytebot. We maintain compatibility with upstream and regularly sync updates.
What we add:
- Advanced computer vision (Holo 1.5-7B)
- Smart Focus targeting system
- Real-time telemetry and monitoring
- Extended model support (46 models)
- Cross-platform containers (Windows/macOS)
What we preserve:
- Core agent architecture
- Task API and workflows
- Database schema
- Docker deployment
Upstream Resources:
Apache License 2.0 - Same as upstream Bytebot.
See LICENSE for details.
- Original Project: bytebot-ai/bytebot - Foundation for desktop agent capabilities
- Holo 1.5-7B: Hcompany for the precision UI localization model
- Community: Contributors and testers making Hawkeye more reliable
Built with precision. Forked with purpose.