Skip to content

feat: Sway backend + hybrid NL find + macro/OCR + Hermes skill#24

Open
Stijnman wants to merge 1 commit into
agent-sh:mainfrom
Stijnman:feat/all-extra-functionalities-v1
Open

feat: Sway backend + hybrid NL find + macro/OCR + Hermes skill#24
Stijnman wants to merge 1 commit into
agent-sh:mainfrom
Stijnman:feat/all-extra-functionalities-v1

Conversation

@Stijnman

@Stijnman Stijnman commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Closes the highest-ROI gaps identified for making computer-use-linux the definitive production Linux desktop MCP:

Windowing

  • Sway/wlroots backend via swaymsg -t get_tree with SWAYSOCK discovery, container-id focus ([con_id=N] focus), and doctor probe registration (between Hyprland and i3).

Agent ergonomics

  • find_element — natural-language element discovery returning @eN refs with confidence scoring
  • hybrid_strategy — accessibility-first vs coordinate-fallback recommendation (COMPUTER_USE_LINUX_HYBRID=1)
  • get_clipboard / set_clipboard — wl-clipboard / xclip / xsel
  • start_recording / stop_recording / replay_macro — JSON workflow capture + Hermes skill skeleton export
  • screenshot_debug — element bounding-box highlights + optional tesseract OCR

Hermes onboarding

  • Expanded skills/computer-use-linux/SKILL.md with the accessibility-first + hybrid decision tree, new tool table, and COMPUTER_USE_LINUX_HYBRID setup.

Test plan

  • cargo test — 110 unit tests pass (including new Sway parser + NL find_element tests)
  • computer-use-linux doctor on KDE/X11 session
  • Manual validation on Sway session with swaymsg available
  • Hermes MCP tool discovery with new tools enabled

Notes

  • Hybrid mode is opt-in via env var to preserve existing accessibility-first defaults.
  • OCR requires tesseract-ocr installed; fails gracefully when absent.
  • Macro replay returns steps for the host to execute (no silent auto-execution).

Add high-ROI agent desktop capabilities on top of the existing AT-SPI
foundation:

- Sway/wlroots window backend via swaymsg (list, focus, doctor probe)
- Natural-language find_element with @en refs and hybrid_strategy guidance
- Clipboard get/set tools (wl-clipboard, xclip, xsel)
- Macro record/replay with JSON export and Hermes skill skeleton
- screenshot_debug with element bounding-box highlights and optional OCR
- Expanded Hermes skill with accessibility-first + hybrid decision tree

Enable hybrid coordinate fallback with COMPUTER_USE_LINUX_HYBRID=1.
@Stijnman Stijnman requested a review from avifenesh as a code owner June 8, 2026 08:17

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several new features to the Linux computer-use agent, including a Sway windowing backend, clipboard management, natural-language element finding, a hybrid input strategy recommendation system, macro recording/replay, and visual debugging with OCR and bounding-box highlights. Feedback on these changes highlights a bug in Sway's X11 PID hydration where the internal container ID is incorrectly used instead of the X11 window ID, an incomplete macro recording implementation that fails to capture steps during mutating actions, and a rendering issue in visual debugging where off-screen bounding boxes draw misleading borders along screen edges.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +141 to +151
fn hydrate_sway_window_pids(windows: &mut [WindowInfo]) {
for window in windows {
if window.pid.is_none() {
if let Some(client_type) = window.client_type.as_deref() {
if client_type == "x11" {
window.pid = sway_x11_window_pid(window.window_id);
}
}
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In Sway, window_id in WindowInfo is set to the internal Sway container ID (self.id), not the X11 window ID (self.window). Passing window.window_id to sway_x11_window_pid causes xprop to fail because it expects a valid X11 window ID, not a Sway container ID.

To fix this, we should perform the PID hydration directly inside to_window_info where self.window is still available, and then remove hydrate_sway_window_pids entirely.

fn hydrate_sway_window_pids(_windows: &mut [WindowInfo]) {}

Comment on lines +318 to +320
wm_class,
pid: self.pid.and_then(|pid| u32::try_from(pid).ok()),
bounds,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hydrate the PID directly here using self.window (the actual X11 window ID) when self.pid is not populated by Sway.

Suggested change
wm_class,
pid: self.pid.and_then(|pid| u32::try_from(pid).ok()),
bounds,
wm_class,
pid: self.pid
.and_then(|pid| u32::try_from(pid).ok())
.or_else(|| self.window.and_then(sway_x11_window_pid)),
bounds,

Comment thread src/server.rs
Comment on lines +1196 to +1200
fn start_recording(
&self,
Parameters(params): Parameters<StartRecordingParams>,
) -> Json<RecordingOutput> {
let message = self.macro_recorder.start(params.name.clone());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The macro_recorder is started and stopped via the start_recording and stop_recording tools, but self.macro_recorder.record_step is never actually called within any of the mutating tool implementations (such as click, type_text, press_key, drag, scroll, set_value, or perform_action).

As a result, any recorded macro will always contain 0 steps. Please ensure that record_step is called with the corresponding tool name and parameters inside each of these mutating tools to enable functional macro recording.

Comment thread src/visual_debug.rs
Comment on lines +71 to +97
fn draw_hollow_rect(image: &mut RgbaImage, x: i32, y: i32, width: u32, height: u32, color: Rgba<u8>) {
let image_width = image.width() as i32;
let image_height = image.height() as i32;
let left = x.max(0);
let top = y.max(0);
let right = (x + width as i32).min(image_width);
let bottom = (y + height as i32).min(image_height);
if left >= right || top >= bottom {
return;
}
for px in left..right {
if top < image_height {
image.put_pixel(px as u32, top as u32, color);
}
if bottom - 1 < image_height {
image.put_pixel(px as u32, (bottom - 1) as u32, color);
}
}
for py in top..bottom {
if left < image_width {
image.put_pixel(left as u32, py as u32, color);
}
if right - 1 < image_width {
image.put_pixel((right - 1) as u32, py as u32, color);
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of draw_hollow_rect draws borders along the screen edges (left = x.max(0), top = y.max(0)) when a bounding box is partially off-screen. This creates misleading visual borders that do not correspond to the actual element boundaries.

We can implement proper clipping by only drawing the pixels of the rectangle's borders that actually fall within the image dimensions.

fn draw_hollow_rect(image: &mut RgbaImage, x: i32, y: i32, width: u32, height: u32, color: Rgba<u8>) {
    let iw = image.width() as i32;
    let ih = image.height() as i32;
    let x2 = x + width as i32 - 1;
    let y2 = y + height as i32 - 1;

    for px in x..=x2 {
        if px >= 0 && px < iw {
            if y >= 0 && y < ih {
                image.put_pixel(px as u32, y as u32, color);
            }
            if y2 >= 0 && y2 < ih {
                image.put_pixel(px as u32, y2 as u32, color);
            }
        }
    }

    for py in y..=y2 {
        if py >= 0 && py < ih {
            if x >= 0 && x < iw {
                image.put_pixel(x as u32, py as u32, color);
            }
            if x2 >= 0 && x2 < iw {
                image.put_pixel(x2 as u32, py as u32, color);
            }
        }
    }
}

@avifenesh avifenesh left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an auto review done by revuto.

Comment thread src/server.rs
&self,
Parameters(params): Parameters<StartRecordingParams>,
) -> Json<RecordingOutput> {
let message = self.macro_recorder.start(params.name.clone());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an auto review done by revuto.


start_recording flips the recorder on, but none of the mutating tool handlers call self.macro_recorder.record_step(...) before returning (a search for record_step only finds the method definition). As a result stop_recording will always report an empty steps array, so the new macro/replay feature advertised by these tools cannot capture any workflow.

@avifenesh

Copy link
Copy Markdown
Collaborator

@Stijnman Hi :)
Thanks for the PR!
Could you please ensure that CI passes and you address all the reviews so I can have a more focused scope to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants