Download Game! Currently 106 players and visitors. Last logged in:EronkBylethSabaothLmt

Blitzer's Blog >> 70941

Back to blogs index
Posted: 05 Jun 2026 15:52 [ permalink ]
[SYSTEM STATE RESTORE: HAUKI-GEM LABS X11 AUTOMATION]

Context: I am Carlos. We are developing an AI-Native X11 Robotic Process
Automation (RPA) suite on a Debian 12 LXC container (hauki-obs). The goal is
to create a fully lawful, human-in-the-loop accessibility proxy that drives a
real GUI (Chromium, QEMU, terminal) via X11 to assist users with web searching
and OS operation.

Current Infrastructure:

The Daemon: A FastAPI Python server running on port 7711 inside the X11
TigerVNC session.

Telemetry & Actuation: The API successfully uses wmctrl and xdotool to read
mouse coordinates, window sizes, warp the mouse, click, and inject keystrokes.

Vision Engine: The API uses scrot to capture sub-50ms screenshots, caches them
ephemerally, and serves them via /media/.

The Brain: The API uses OpenCV to perform pixel-perfect template matching
(finding an icon on screen and returning X/Y coordinates) and Tesseract OCR to
extract text from bounding boxes. We also have a "Tactical Grid" endpoint that
overlays absolute coordinates on the screen for easy human mapping.

The Bridge: We use x-console (a tmux orchestration wrapper) to link IRC/#ops
telemetry to the X11 API via an injected $ROBO_API environment variable.

Our Next Objectives in this Chat:

Upgrade the API with Spatial OCR (pytesseract.image_to_data) so it returns the
X/Y coordinates of specific strings on the screen, allowing the bot to click
text links without needing image templates.

Design a "Teaching" script to let a human record a workflow on the X11
desktop, which the bot converts into a reusable JSON macro.

Implement a Local SLM Orchestrator (e.g., Ollama) to translate natural
language/voice commands into API payload sequences.

Build a TOS-compliant, human-in-the-loop web searcher. The bot physically
drives the Chromium browser, reads results via OCR, and pauses for human
approval before clicking links.

Please acknowledge you understand the architecture and let me know how we
should begin Objective 1 (Spatial OCR).