Download Game! Currently 106 players and visitors. Last logged in:EronkBylethSabaothLmt

Blitzer's Blog >> 70981

Back to blogs index
Posted: 10 Jun 2026 10:49 [ permalink ]
This is exactly how it should look. The `cluster_points` algorithm worked
perfectly. It merged the raw pixel hits into two discrete logical objects:

1. **Element #1 (serial_216 / hand2):** The URL bar or a header hyperlink
(`debian.org/News/`).
2. **Element #2 (serial_217 / xterm):** The actual paragraph text body of the
news feed where the I-beam text-selection cursor appears.

Your architectural instinct here is 100% correct. This is the **Unix
Philosophy applied to Cognitive AI**: *Write programs that do one thing and do
it well. Write programs to work together.* If we try to build a monolithic
"God Agent" that tries to do OCR, cursor sweeping, and layout analysis all at
the exact same time, it will be slow, brittle, and impossible to debug. But by
building these isolated, single-purpose primitives (a Sweeper, a Vision
module, an Actuator), we can orchestrate them from the UAH.

An orchestration script could simply say:

* *"Sweeper, find me all the links."*
* *"Vision, read the text of Link #3."*
* *"Actuator, click Link #3."*

### The Next Primitive: Visual Short-Term Memory (`engine_diff.py`)

If we are following the path of building these single-task performers, the
biggest blind spot the agents currently have is **Time / State Changes**.

Right now, an agent has amnesia. It looks at a frame, clicks, and forgets.

* If it clicks a "Settings" button, a dropdown menu appears. But *where* did
it appear?
* If it clicks a link, a page starts loading. But *when* is it done loading?

We need a **Delta Engine**. A module that takes a frame, waits a fraction of a
second, takes a second frame, and uses OpenCV to subtract them
(`cv2.absdiff`). The resulting pixels are exactly what changed on the screen.

Execute this block to build the Visual Delta Engine. It will allow agents to
instantly detect popups, menus, and loading states without needing OCR:

```
[GEM-TERMINAL-START]
cat << 'EOF' | tr '~' ' ' > /opt/hauki-gem/engine_diff.py
import cv2
import numpy as np
import time
import os
from engine_vision import capture_screen

def get_visual_delta(x, y, w, h, delay=0.5, noise_threshold=25, min_area=25):

def wait_for_ui_to_settle(x, y, w, h, max_wait=10.0, settle_time=0.5):
EOF

# Kirjataan uusi moduuli lokaaliin Git-varastoon
git add engine_diff.py
git commit -m "FEAT: Added Visual Delta Engine for short-term UI memory and
state change detection"

echo "[+] engine_diff.py generated and committed."
[GEM-TERMINAL-END]

```

With this primitive in place, your agents can execute logic like:

1. `api/mouse/click` (Click the application menu)
2. `get_visual_delta()` (Wait 0.5s and see what pixels just appeared)
3. *Agent instantly knows the exact X, Y, W, H of the newly opened dropdown
menu without needing to scan the whole screen again.*

Do you want to patch this into `daemon.py` so it's exposed to the mesh, or
should we write a quick test script to see the Delta Engine in action?