Download Game! Currently 106 players and visitors. Last logged in:EronkBylethSabaothLmt

Blitzer's Blog >> 70996

Back to blogs index
Posted: 11 Jun 2026 11:19 [ permalink ]
This is exactly how human cognition works.

We don't read every single letter on a webpage simultaneously. We first
perceive the gross structural layout (the "blobs" and the containers),
identify the areas of interest based on our goal, and *then* focus our foveal
vision (the high-resolution, text-reading part of our eye) exclusively on
those specific boxes.

You are describing a **Two-Pass Cognitive OCR Pipeline**.

1. **Pass 1 (The Gross Structure):** The `engine_hybrid.py` runs instantly and
maps the UI into physical containers (`bx, by, bw, bh`). It says, "There is a
vertical stack of 10 identical boxes on the left."
2. **Pass 2 (The Foveal Focus):** The agent says, "I am looking for 'Saarinen
Oy'. I will run the OCR engine *only* inside those 10 specific bounding
boxes."

This is incredibly efficient. Running Tesseract OCR on a full $1410 \times
1170$ high-resolution desktop screenshot is computationally expensive and
slow. Running it on ten $250 \times 80$ pixel slices is lightning fast.

### Building the Foveal Focus (Semantic OCR)

We already have the Daemon API endpoint for OCR: `http://127.0.0.1:7711/api/vis
ion/ocr`. It accepts `x, y, w, h` parameters.

Let's write the integration script. This script will:

1. Run the Hybrid Engine to find the boxes.
2. Loop through the boxes that look like "Lead Cards" (based on their
width/height ratio).
3. Send *only* those boxes to the OCR API.
4. Return a structured JSON of what the agent "sees."

Execute this block:

```
[GEM-TERMINAL-START]
cat << 'EOF' > /opt/hauki-gem/engine_semantic.py
#!/opt/hauki-gem/venv/bin/python3
import cv2
import numpy as np
import requests
import json
import sys

API_URL = "[http://127.0.0.1:7711/api](http://127.0.0.1:7711/api)"
BASE_URL = "[http://127.0.0.1:7711](http://127.0.0.1:7711)"

def get_semantic_map(x, y, w, h):
    # 1. Fetch Image
    try:
        payload = {"x": x, "y": y, "w": w, "h": h}
        res = requests.post(f"{API_URL}/vision/screenshot", json=payload)
        if res.status_code != 200: return {"error": res.text}
        
        image_url = res.json().get("url")
        img_res = requests.get(f"{BASE_URL}{image_url}")
        img_array = np.asarray(bytearray(img_res.content), dtype=np.uint8)
        img = cv2.imdecode(img_array, cv2.IMREAD_GRAYSCALE)
    except Exception as e:
        return {"error": str(e)}

    # 2. Hybrid Density Mapping
    chunk_w, chunk_h = int(w / 120), int(h / 60)
    density_grid = np.zeros((60, 120), dtype=np.uint8)
    
    for row in range(60):
        for col in range(120):
            y_start, y_end = row * chunk_h, min((row + 1) * chunk_h, h)
            x_start, x_end = col * chunk_w, min((col + 1) * chunk_w, w)
            
            chunk = img[y_start:y_end, x_start:x_end]
            if chunk.size > 0 and np.std(chunk) > 10.0:
                density_grid[row, col] = 255

    contours, _ = cv2.findContours(density_grid, cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
    
    semantic_elements = []
    element_id = 1
    
    # 3. Foveal OCR Focus
    print(f"[*] Hybrid engine found {len(contours)} structural blocks.
Initiating Foveal OCR...")
    for c in contours:
        bx, by, bw, bh = cv2.boundingRect(c)
        # Filter out noise
        if bw >= 2 and bh >= 1:
            # Convert grid coordinates back to real pixel coordinates
            real_x = x + (bx * chunk_w)
            real_y = y + (by * chunk_h)
            real_w = bw * chunk_w
            real_h = bh * chunk_h
            
            # Check if it looks like a sidebar lead card (e.g., wide and
short)
            # Adjust these heuristics based on the CRM layout
            is_likely_card = (real_w > 200 and real_w < 500) and (real_h > 40
and real_h < 150)
            
            # Request OCR for this specific slice
            try:
                ocr_payload = {"x": real_x, "y": real_y, "w": real_w, "h":
real_h}
                ocr_res = requests.post(f"{API_URL}/vision/ocr",
json=ocr_payload).json()
                text = ocr_res.get("text", "").strip()
            except:
                text = ""
            
            if text:
                semantic_elements.append({
                    "id": f"element_{element_id}",
                    "type": "lead_card" if is_likely_card else
"content_block",
                    "bounds": {"x": real_x, "y": real_y, "w": real_w, "h":
real_h},
                    "text_content": text.replace('
', ' | ')
                })
                element_id += 1

    return {"elements": semantic_elements}

if __name__ == "__main__":
    x = int(sys.argv[1]) if len(sys.argv) > 1 else 0
    y = int(sys.argv[2]) if len(sys.argv) > 2 else 0
    w = int(sys.argv[3]) if len(sys.argv) > 3 else 1410
    h = int(sys.argv[4]) if len(sys.argv) > 4 else 1170
    
    result = get_semantic_map(x, y, w, h)
    print(json.dumps(result, indent=2))
EOF

chmod +x /opt/hauki-gem/engine_semantic.py
[GEM-TERMINAL-END]

```

Run this script. It will output a JSON array where every element has exact
coordinates *and* the text inside it.

If this works, you have achieved the holy grail of UI automation: an agent
that can look at an unknown web app, identify the repeating UI components,
read the data inside them, and know exactly where to click if you say "Click
on Saarinen Oy".