Skip to content

Add opt-in human behavior simulation and OS input path#396

Open
quantsquirrel wants to merge 9 commits into
browser-use:mainfrom
quantsquirrel:feat/human-behavior-sim
Open

Add opt-in human behavior simulation and OS input path#396
quantsquirrel wants to merge 9 commits into
browser-use:mainfrom
quantsquirrel:feat/human-behavior-sim

Conversation

@quantsquirrel
Copy link
Copy Markdown

@quantsquirrel quantsquirrel commented May 31, 2026

Summary

Adds an opt-in human-behavior simulation layer for browser-harness:

  • human-like mouse trajectories, timing, tremor, Fitts-law movement, hover/dwell, typing, scrolling, and wait pacing
  • daemon-side batched input dispatch for ~60Hz pointer delivery without per-event IPC overhead
  • default Runtime.enable removal to avoid console-serialization CDP tell, with BH_CDP_ENABLE_RUNTIME=1 escape hatch
  • live self-test probes for T1/T2/rate/isTrusted and validation notes
  • macOS OS-input path (human_move_os, human_click_os) using Quartz CGEvents for rare detection-sensitive clicks
  • Retina viewport fix: use cssLayoutViewport rather than device-pixel layoutViewport

Live validation

On local Chrome 148 / macOS:

  • os_calibrate: error_px [0.0, 0]
  • os_selftest(stress=True): repeated coalesced max 2 / 3 / 2, so the OS-input path closes the T1 getCoalescedEvents() tell
  • CDP under the same renderer stress: coalesced max 1, confirming the CDP path remains uncoalesced
  • Active display count was 1; multi-monitor mapping is guarded and unit-tested, but not live-tested on multi-monitor hardware

Tests

Validated on the feature branch and again after merging into latest origin/main in a temporary worktree:

  • python3 tests/unit/test_human_behavior.py33/33 passed
  • python3 tests/unit/test_daemon_input_sequence.py3/3 passed
  • uv run --with pytest pytest tests/unit/test_admin.py tests/unit/test_helpers.py tests/unit/test_daemon.py tests/unit/test_run.py -q81 passed after merge with current origin/main
  • python3 -m py_compile agent-workspace/agent_helpers.py src/browser_harness/daemon.py src/browser_harness/helpers.py

Notes

  • OS-input mode is intentionally opt-in: it foregrounds the browser and moves the physical cursor.
  • os_selftest(stress=True) is the deterministic proof mode. Unstressed Chrome can legitimately report getCoalescedEvents()==1 even for real OS moves.

Summary by cubic

Adds an opt-in human behavior simulation layer to browser-harness with human-like mouse, typing, and scrolling, plus a macOS OS-input path for sensitive clicks. Improves realism and raises event cadence via ~60Hz server-side dispatch, and drops default CDP Runtime to reduce detectability.

  • New Features

    • Human-like input: Fitts-timed mouse moves with tremor/overshoot, correct keycodes with key-hold, scroll detents.
    • Server-side batched dispatch (~60Hz) with automatic client fallback and resume-on-failure.
    • macOS OS-input mode: human_move_os, human_click_os, os_selftest, os_calibrate for detection-sensitive actions.
    • Built-in self-tests (T1/T2/rate/isTrusted) to validate behavior on your Chrome.
  • Migration

    • Opt-in only; no changes required for existing flows.
    • OS mode needs macOS, pyobjc-framework-Quartz, and Accessibility; it foregrounds the browser and moves the real cursor.
    • CDP Runtime.enable is now omitted by default; set BH_CDP_ENABLE_RUNTIME=1 if you need Runtime events.

Written for commit 306155f. Summary will update on new commits.

Review in cubic

quantsquirrel and others added 9 commits May 29, 2026 05:46
Rewrite the human-behavior-simulation layer (agent-workspace/) with
correctness and detection-realism fixes from a multi-lens review.

- typing: default semantic mode now emits correct virtual-key codes
  (_vk_for_char) and a non-zero key hold; no longer routes through the
  core press_key (which emitted 0ms holds + VK_NUMPAD codes for letters)
- tremor: exact OU discretization, dt tied to the real per-event interval,
  amplitude re-calibrated to ~0.8px (inside the 0.3-1.2px human band),
  anisotropic 2:1 axes
- motion: asymmetric ballistic velocity (Beta(2,3)) replacing symmetric
  smoothstep; Fitts' Law movement time (optional target width); overshoot
  + correction on long moves
- scroll: cursor anchored before wheel; discrete detent multiples (wheel)
- idle: bounded cursor drift during human_wait (anchored, <=~15px)
- click: <=1px release micro-drift (clamped in-viewport); teleport
  invariant preserved (press == final move)
- session: cursor/click-bias/tremor-orientation persist across -c calls
  via a per-BU_NAME atomic state file
- hardening: underscore-private config tables/class (no namespace leak),
  narrowed _viewport except, physical-typing dd>=hold constraint
- docs: STALE banners on the two design/review drafts; add
  HUMAN_SIM_VALIDATION.md as the authoritative validation artifact

Known ceilings documented in-module (not fixable in this layer): event
rate ~20-40Hz (per-call IPC), getCoalescedEvents().length==1 / no
PointerEvent stream, CDP-presence detectability.

Adds tests/unit/test_human_behavior.py (17 hermetic tests, no browser).
Reviewed in a separate lane (APPROVE; 0 regressions).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… dispatch + Runtime.enable drop

Tackles the three documented behavioral-detection ceilings with what is
actually fixable (researched against Chromium source + fingerprinting lit),
and documents honestly what is not.

Event RATE (was ~28Hz) — FIXED:
- daemon: new `meta:"input_sequence"` handler dispatches a precomputed event
  list server-side over the persistent CDP WS, sleeping delay_ms before each.
  Decouples the rate from the per-call client<->daemon IPC round-trip.
- helpers: `_send(timeout, raise_on_error)` + `dispatch_input_sequence()`
  (timeout sized to sum(delay) + per-event slack).
- agent_helpers: human_move/human_click/human_scroll now build one event
  batch and emit via `_emit()`; move_step_ms lowered to ~16ms (~60Hz). A
  mid-batch failure resumes the remainder client-side (resume-from-count,
  never re-sends the dispatched prefix — no double-fire); pre-batch daemons
  fall back to the client path automatically.

CDP-presence — mitigated:
- daemon omits Runtime.enable by default (`_enabled_domains()`), removing the
  console-serialization detection class. Nothing consumes Runtime events and
  Runtime.evaluate works without enable; BH_CDP_ENABLE_RUNTIME=1 restores it.

Documented as NOT fixable in software (need a patched Chromium):
- getCoalescedEvents() stays empty — CDP injects via ForwardMouseEvent,
  bypassing the compositor coalescing queue (also why we target ~60Hz, not
  higher: extra uncoalesced events look more anomalous).
- screenX==clientX — CDP sets no window/desktop offset (Cloudflare Turnstile
  checks this); not settable via CDP.
Corrected: pressure 0/0.5, tilt 0, pointerType "mouse" are spec-correct for a
real mouse — NOT a bot tell (earlier over-statement removed).

Tests: tests/unit/test_daemon_input_sequence.py (3, hermetic via cdp_use stub)
+ test_human_behavior.py grows to 22 (batch dispatch, ~60Hz rate, single-batch
click invariant, fallback, resume-from-count). All 25 pass; py_compile clean.
Reviewed in a separate lane across two passes (APPROVE; the partial-failure
double-dispatch found in pass 1 is fixed and regression-tested).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on record

Turns the residual-CDP-tell question from speculation into measurement, and
records the no-fork decision from the 6-lens investigation.

- agent_helpers: human_selftest() instruments the live page (transparent
  full-viewport overlay, clicks swallowed) while driving real human_* input,
  then reports for THIS Chrome: T2 screenX-vs-clientX delta, T1
  getCoalescedEvents length, delivered pointer rate (~60 fast / ~30 fallback),
  isTrusted. chrome_version() reads the major via UA (Runtime.enable-free).
  Both exported; _eval stays private.
- CEILING_DECISIONS.md: research conclusions + decision + phased plan.
  Key findings: T2 (screenX==clientX) is already fixed upstream in Chrome >=142
  (crbug 40280325) — verify, don't assume. T1 (getCoalescedEvents empty) has
  zero confirmed production deployments — theoretical, not shipped. Frida
  (macOS hardened-runtime / keychain break) and a Chromium fork (profile
  conflict + 4-week rebase) are both REJECTED for a personal tool. T1's only
  real fix is a sparingly-used OS-injection (CGEvent) mode, left on the shelf
  until a confirmed target is shown to check it.

Tests: +2 selftest verdict-logic tests (exposed vs fixed-Chrome canned data);
test_human_behavior.py now 24, daemon 3. py_compile clean; load-path verified
(helpers auto-loads, human_selftest/chrome_version exported).

NOTE: selftest verdict logic is unit-tested; its LIVE behavior on real Chrome
(whether CDP fires pointermove on the overlay, getCoalescedEvents values) is
exactly what the tool is meant to reveal on first run — not yet verified here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ran human_selftest() against the live machine's real Chrome 148.0.7778.181:
- T2 (screenX==clientX): NOT exposed — screen/client delta 121px (upstream fix live).
- T1 (getCoalescedEvents): EXPOSED — max 1 (CDP coalescing bypass confirmed on 148).
- isTrusted true; delivered pointer rate 41Hz with the server-side fast path
  verified active (dispatch_input_sequence returned {ok,count:2} on a fresh daemon
  against real Chrome — the daemon batch handler works end-to-end).

Confirms the no-fork decision empirically: the only exposed tell (T1) is the one
with zero production deployment. Phases 1/2 not triggered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rate, robust click capture

Live runs surfaced two selftest flaws (verdict logic was always correct; these
are diagnostic-quality issues):

- Rate metric swung 19-41Hz run-to-run: (n-1)/span counted the large gaps BETWEEN
  the move/move/click trajectories. Switched to the MEDIAN inter-move interval,
  which reports the true per-event cadence — now a stable ~48-56Hz (server-side
  fast path), matching the verified daemon dispatch.
- clicks_captured was intermittently 0, which read like a bug. Verified via a
  catch-all probe that human_click DOES fire a full, correct event chain
  (pointerdown/mousedown/pointerup/mouseup/click) — it really clicks. The
  selftest's single-press capture can just miss the read window, so: capture
  pointerdown AND mousedown, retry-read briefly, and — decisively — derive the
  T1/T2/rate/isTrusted verdict from the deterministic MOVE stream only. Clicks are
  now labelled best-effort and never gate the verdict.

Recorded the live Chrome-148 measurement update in CEILING_DECISIONS.md.
Tests: 24/24 (verdict logic unchanged — canned signal lives in the move stream);
py_compile clean; verified across repeated live runs (T2 OK delta 121px, T1
EXPOSED coalesced<=1, rate ~50Hz, isTrusted true — consistent regardless of click capture).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o close the T1 coalesced tell

Real Quartz CGEvents traverse the full HID->compositor->renderer pipeline, so the
page sees genuine coalesced pointer events + correct screenX + isTrusted — the one
thing CDP Input.dispatchMouseEvent provably cannot do (it bypasses the compositor
coalescing queue). Opt-in, macOS-only, lazy pyobjc import (core stays pure-stdlib).

API: human_click_os(x,y[,button,app_name]) / human_move_os(x,y) / os_selftest() /
os_calibrate(). Reuses the Fitts/Bezier/tremor trajectory, posting real moves at
~125Hz so Chrome's compositor coalesces them.

Safety (this posts REAL clicks on the live desktop — three-layer guard, never blind):
- frontmost check: foregrounds BH_BROWSER_APP (default "Google Chrome") and refuses
  if the wrong app is frontmost (Brave/Edge users pass app_name / set the env).
- display-bounds check: refuses if the mapped global point is off all displays
  (CGGetActiveDisplayList/CGDisplayBounds; handles negative multi-monitor origins).
- cursor-arrival check: posts the move, reads back the real cursor, refuses if it
  didn't reach the target (= Accessibility not granted) instead of a silent no-op.
- click uses kCGMouseEventClickState=1 (else clickState 0 may not register / exposes
  MouseEvent.detail===0).

Validated: 30 hermetic tests (mocked Quartz: capability, client->screen mapping,
full move+down+up sequence with clickState, off-screen refusal, Accessibility-denied
refusal). os_calibrate() run LIVE returned error_px [0.0, 0] — the client->screen
mapping matches the browser's reported screenX/screenY EXACTLY on the primary display,
so OS clicks land where intended (validated WITHOUT moving the cursor). Two adversarial
review passes (APPROVE).

NOT yet run live: the CGEvent path needs `pip install pyobjc-framework-Quartz` into the
env + Accessibility granted; os_selftest() then measures whether getCoalescedEvents()>1
actually results. Multi-monitor mapping unvalidated (os_calibrate covered primary only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… OS injection

Running os_selftest() live (pyobjc installed + Accessibility granted) drove real
CGEvent moves and the page reported getCoalescedEvents() max = 2 (>1) with a real
screenX offset (delta 164px): T1 (getCoalescedEvents empty) is GENUINELY CLOSED via
the OS-injection path — the one tell CDP Input.dispatchMouseEvent provably cannot fix.

The live run also exposed a latent Retina bug: _viewport read layoutViewport, which
is DEVICE px (2x at devicePixelRatio 2); CDP/CGEvent coordinates are CSS px, so every
fractional target (e.g. 0.65*w) was 2x too large and mapped off-screen — the OS
display-bounds guard correctly refused. Fixed _viewport to prefer cssLayoutViewport
(CSS px), falling back to layoutViewport for older Chrome. This also corrects _clamp
and cursor-init on any dpr>1 display (affected the CDP path too, latent).

Tests: +1 Retina regression (test_viewport_uses_css_pixels_on_retina; _FakeQuartz cdp
now returns both cssLayoutViewport=1200x800 and layoutViewport=2400x1600, asserts the
CSS one is used). 31/31 + daemon 3. py_compile clean. os_calibrate live still [0.0,0].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The live no-stress OS selftest was flaky on Chrome 148: real Quartz moves can legitimately report getCoalescedEvents()==1 when the renderer is idle. The proof now adds transient renderer stress so the compositor path difference is measured deterministically, while the actual OS movement model stays human-scale.

Constraint: Chrome 148 on a single-display Retina Mac produced no-stress OS coalescing inconsistently.
Rejected: Increase OS move rate or burst CGEvents | live repeats stayed flaky and changed the movement model.
Confidence: high
Scope-risk: narrow
Directive: Treat os_selftest(stress=True) as the T1 proof mode; do not claim unstressed Chrome always returns coalesced events.
Tested: python3 tests/unit/test_human_behavior.py; python3 tests/unit/test_daemon_input_sequence.py; uv run --with pytest pytest tests/unit/test_admin.py tests/unit/test_helpers.py tests/unit/test_daemon.py tests/unit/test_run.py -q; python3 -m py_compile agent-workspace/agent_helpers.py src/browser_harness/daemon.py src/browser_harness/helpers.py; live os_calibrate ok error_px [0.0,0]; live os_selftest stress repeats coalesced_len_max 2/3/3; live CDP stress coalesced_len_max 1.
Not-tested: live multi-monitor hardware; current machine exposes one active display only.
The stress-mode coalescing proof can make CGEventPost completion observable before the WindowServer has settled the final cursor location. The OS path now re-posts the final landing point with a short settle loop before treating the run as an Accessibility failure.

Constraint: live os_selftest(stress=True) once read the cursor 31px from target immediately after trajectory completion.
Rejected: Loosen the 4px verification threshold | would hide wrong-monitor or Accessibility failures instead of waiting for real cursor arrival.
Confidence: high
Scope-risk: narrow
Directive: Keep the final-arrival check strict; add settling, not broad tolerance, when WindowServer timing is the issue.
Tested: python3 tests/unit/test_human_behavior.py; python3 tests/unit/test_daemon_input_sequence.py; uv run --with pytest pytest tests/unit/test_admin.py tests/unit/test_helpers.py tests/unit/test_daemon.py tests/unit/test_run.py -q; python3 -m py_compile agent-workspace/agent_helpers.py src/browser_harness/daemon.py src/browser_harness/helpers.py; live os_calibrate ok error_px [0.0,0]; live os_selftest stress repeats coalesced_len_max 2/3/2; live CDP stress coalesced_len_max 1.
Not-tested: live multi-monitor hardware; current machine exposes one active display only.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 10 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant