feat: Sway backend + hybrid NL find + macro/OCR + Hermes skill#24
feat: Sway backend + hybrid NL find + macro/OCR + Hermes skill#24Stijnman wants to merge 1 commit into
Conversation
Add high-ROI agent desktop capabilities on top of the existing AT-SPI foundation: - Sway/wlroots window backend via swaymsg (list, focus, doctor probe) - Natural-language find_element with @en refs and hybrid_strategy guidance - Clipboard get/set tools (wl-clipboard, xclip, xsel) - Macro record/replay with JSON export and Hermes skill skeleton - screenshot_debug with element bounding-box highlights and optional OCR - Expanded Hermes skill with accessibility-first + hybrid decision tree Enable hybrid coordinate fallback with COMPUTER_USE_LINUX_HYBRID=1.
There was a problem hiding this comment.
Code Review
This pull request introduces several new features to the Linux computer-use agent, including a Sway windowing backend, clipboard management, natural-language element finding, a hybrid input strategy recommendation system, macro recording/replay, and visual debugging with OCR and bounding-box highlights. Feedback on these changes highlights a bug in Sway's X11 PID hydration where the internal container ID is incorrectly used instead of the X11 window ID, an incomplete macro recording implementation that fails to capture steps during mutating actions, and a rendering issue in visual debugging where off-screen bounding boxes draw misleading borders along screen edges.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| fn hydrate_sway_window_pids(windows: &mut [WindowInfo]) { | ||
| for window in windows { | ||
| if window.pid.is_none() { | ||
| if let Some(client_type) = window.client_type.as_deref() { | ||
| if client_type == "x11" { | ||
| window.pid = sway_x11_window_pid(window.window_id); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
In Sway, window_id in WindowInfo is set to the internal Sway container ID (self.id), not the X11 window ID (self.window). Passing window.window_id to sway_x11_window_pid causes xprop to fail because it expects a valid X11 window ID, not a Sway container ID.
To fix this, we should perform the PID hydration directly inside to_window_info where self.window is still available, and then remove hydrate_sway_window_pids entirely.
fn hydrate_sway_window_pids(_windows: &mut [WindowInfo]) {}| wm_class, | ||
| pid: self.pid.and_then(|pid| u32::try_from(pid).ok()), | ||
| bounds, |
There was a problem hiding this comment.
Hydrate the PID directly here using self.window (the actual X11 window ID) when self.pid is not populated by Sway.
| wm_class, | |
| pid: self.pid.and_then(|pid| u32::try_from(pid).ok()), | |
| bounds, | |
| wm_class, | |
| pid: self.pid | |
| .and_then(|pid| u32::try_from(pid).ok()) | |
| .or_else(|| self.window.and_then(sway_x11_window_pid)), | |
| bounds, |
| fn start_recording( | ||
| &self, | ||
| Parameters(params): Parameters<StartRecordingParams>, | ||
| ) -> Json<RecordingOutput> { | ||
| let message = self.macro_recorder.start(params.name.clone()); |
There was a problem hiding this comment.
The macro_recorder is started and stopped via the start_recording and stop_recording tools, but self.macro_recorder.record_step is never actually called within any of the mutating tool implementations (such as click, type_text, press_key, drag, scroll, set_value, or perform_action).
As a result, any recorded macro will always contain 0 steps. Please ensure that record_step is called with the corresponding tool name and parameters inside each of these mutating tools to enable functional macro recording.
| fn draw_hollow_rect(image: &mut RgbaImage, x: i32, y: i32, width: u32, height: u32, color: Rgba<u8>) { | ||
| let image_width = image.width() as i32; | ||
| let image_height = image.height() as i32; | ||
| let left = x.max(0); | ||
| let top = y.max(0); | ||
| let right = (x + width as i32).min(image_width); | ||
| let bottom = (y + height as i32).min(image_height); | ||
| if left >= right || top >= bottom { | ||
| return; | ||
| } | ||
| for px in left..right { | ||
| if top < image_height { | ||
| image.put_pixel(px as u32, top as u32, color); | ||
| } | ||
| if bottom - 1 < image_height { | ||
| image.put_pixel(px as u32, (bottom - 1) as u32, color); | ||
| } | ||
| } | ||
| for py in top..bottom { | ||
| if left < image_width { | ||
| image.put_pixel(left as u32, py as u32, color); | ||
| } | ||
| if right - 1 < image_width { | ||
| image.put_pixel((right - 1) as u32, py as u32, color); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of draw_hollow_rect draws borders along the screen edges (left = x.max(0), top = y.max(0)) when a bounding box is partially off-screen. This creates misleading visual borders that do not correspond to the actual element boundaries.
We can implement proper clipping by only drawing the pixels of the rectangle's borders that actually fall within the image dimensions.
fn draw_hollow_rect(image: &mut RgbaImage, x: i32, y: i32, width: u32, height: u32, color: Rgba<u8>) {
let iw = image.width() as i32;
let ih = image.height() as i32;
let x2 = x + width as i32 - 1;
let y2 = y + height as i32 - 1;
for px in x..=x2 {
if px >= 0 && px < iw {
if y >= 0 && y < ih {
image.put_pixel(px as u32, y as u32, color);
}
if y2 >= 0 && y2 < ih {
image.put_pixel(px as u32, y2 as u32, color);
}
}
}
for py in y..=y2 {
if py >= 0 && py < ih {
if x >= 0 && x < iw {
image.put_pixel(x as u32, py as u32, color);
}
if x2 >= 0 && x2 < iw {
image.put_pixel(x2 as u32, py as u32, color);
}
}
}
}| &self, | ||
| Parameters(params): Parameters<StartRecordingParams>, | ||
| ) -> Json<RecordingOutput> { | ||
| let message = self.macro_recorder.start(params.name.clone()); |
There was a problem hiding this comment.
This is an auto review done by revuto.
start_recording flips the recorder on, but none of the mutating tool handlers call self.macro_recorder.record_step(...) before returning (a search for record_step only finds the method definition). As a result stop_recording will always report an empty steps array, so the new macro/replay feature advertised by these tools cannot capture any workflow.
|
@Stijnman Hi :) |
Summary
Closes the highest-ROI gaps identified for making
computer-use-linuxthe definitive production Linux desktop MCP:Windowing
swaymsg -t get_treewithSWAYSOCKdiscovery, container-id focus ([con_id=N] focus), and doctor probe registration (between Hyprland and i3).Agent ergonomics
find_element— natural-language element discovery returning@eNrefs with confidence scoringhybrid_strategy— accessibility-first vs coordinate-fallback recommendation (COMPUTER_USE_LINUX_HYBRID=1)get_clipboard/set_clipboard— wl-clipboard / xclip / xselstart_recording/stop_recording/replay_macro— JSON workflow capture + Hermes skill skeleton exportscreenshot_debug— element bounding-box highlights + optional tesseract OCRHermes onboarding
skills/computer-use-linux/SKILL.mdwith the accessibility-first + hybrid decision tree, new tool table, andCOMPUTER_USE_LINUX_HYBRIDsetup.Test plan
cargo test— 110 unit tests pass (including new Sway parser + NL find_element tests)computer-use-linux doctoron KDE/X11 sessionswaymsgavailableNotes
tesseract-ocrinstalled; fails gracefully when absent.