Skip to content

gl clone: clean partial clone of repos with private subtrees#33

Open
beardthelion wants to merge 4 commits into
Gitlawb:mainfrom
beardthelion:feat/clean-clone-ux
Open

gl clone: clean partial clone of repos with private subtrees#33
beardthelion wants to merge 4 commits into
Gitlawb:mainfrom
beardthelion:feat/clean-clone-ux

Conversation

@beardthelion
Copy link
Copy Markdown
Contributor

@beardthelion beardthelion commented Jun 7, 2026

Pairs with #28 (subtree content withholding). That PR makes the node withhold private blob bytes from the served pack; the resulting pack is not closed under reachability, so a stock git clone is refused at fetch unless the user manually passes --filter. This adds the client-side piece so a non-reader gets a clean checkout with one command.

What's here

  • GET /api/v1/repos/{owner}/{repo}/withheld-paths: returns only the path globs the (optionally authenticated) caller is denied. Not owner-gated and never exposes reader DIDs, unlike list_visibility. Computed from the existing visibility_check via a new pure withheld_globs.
  • gl clone <gitlawb://owner/name | owner/name> [dir]: asks the node what's withheld for the caller, then clones as a promisor (git clone --filter=blob:none --no-checkout) and sparse-excludes those globs before checkout. Public files land, private paths are absent, the tree entries and SHAs stay visible, no error.

A connect-mode helper can't influence git's own clone logic, so the orchestration lives in gl clone rather than trying to make a bare git clone gitlawb://... work without flags (noted as a follow-up).

Notes

Test plan

  • cargo test -p gl -p gitlawb-node (new: withheld_globs, gl clone orchestration against a file:// bare, repo-arg parsing)
  • Manual smoke against a running node: as a non-reader of a /secret mode-B repo, gl clone it and confirm public/ present, secret/ absent from the worktree, secret path + SHA still in git ls-tree.

Summary by CodeRabbit

  • New Features

    • Added a new gl clone command that detects repos with withheld content and performs clean partial clones using sparse checkout when needed.
    • Added an API endpoint that returns a repo identifier plus withheld and reinclude path patterns (accepts optional authentication).
  • Tests

    • Added unit tests covering withheld-path computation and sparse checkout behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 7, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c2aeca2c-1703-45ed-b43b-3e3c62238b90

📥 Commits

Reviewing files that changed from the base of the PR and between 2fd4fe1 and af97249.

📒 Files selected for processing (3)
  • crates/gitlawb-node/src/api/visibility.rs
  • crates/gitlawb-node/src/visibility.rs
  • crates/gl/src/clone.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/gitlawb-node/src/api/visibility.rs

📝 Walkthrough

Walkthrough

Adds backend helpers and a new REST endpoint that expose per-caller withheld and reinclude path globs, and a gl clone CLI subcommand that queries the endpoint and performs full or promisor+sparse clones to exclude withheld paths while re-including allowed nested paths.

Changes

Backend Withheld Paths Support

Layer / File(s) Summary
Withheld globs & reincluded globs helpers
crates/gitlawb-node/src/visibility.rs
Adds withheld_globs and reincluded_globs that probe representative prefixes with visibility_check to return denied non-root path_glob strings and allowed nested re-inclusions; includes unit tests.
withheld-paths REST endpoint and routing
crates/gitlawb-node/src/api/visibility.rs, crates/gitlawb-node/src/server.rs
Adds GET /api/v1/repos/{owner}/{repo}/withheld-paths handler that performs a whole-repo read gate, computes withheld and reinclude via the helpers, and returns JSON { repo, withheld, reinclude }; route is registered in read-only git routes.

CLI Clone Command with Partial Clone Support

Layer / File(s) Summary
Clone args, module, and git helpers
crates/gl/src/clone.rs, crates/gl/src/main.rs
Adds CloneArgs, declares mod clone, and helper wrappers to run git with consistent stderr-inclusive error reporting.
Sparse-checkout pattern translation
crates/gl/src/clone.rs
Implements sparse_patterns to convert withheld/reinclude globs into sparse-checkout patterns, handling subtree vs exact-path semantics and unit tests for pattern translations.
Partial clone setup and checkout orchestration
crates/gl/src/clone.rs
setup_partial_clone performs either a full clone or a promisor clone (--filter=blob:none --no-checkout), initializes sparse checkout (--no-cone), writes .git/info/sparse-checkout with negated withheld patterns and appended reinclude patterns, and checks out the target branch; includes integration tests verifying exclusions and re-inclusions.
Repo parsing, node query, and orchestration
crates/gl/src/clone.rs
Adds parse_repo, fetch_withheld (with optional signing), and run which validates destination, queries the node for withheld/reinclude lists, and invokes setup_partial_clone.
Clone subcommand dispatch
crates/gl/src/main.rs
Adds Clone variant to Commands enum and wires main() to call clone::run() for the subcommand.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Gitlawb/node#25: The main PR's new visibility::withheld_globs helper builds directly on the prior PR's path-scoped visibility logic (visibility_check/Decision), establishing a foundational dependency.

Suggested reviewers

  • kevincodex1

Poem

🐰 I sniffed the globs both near and far,
Hid the roots but left the star.
Clone with care, sparse patterns sing—
Secrets stay sealed, but small leaves spring. 🌿🧺

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and specifically describes the main change: adding a clean partial clone feature that handles repos with private subtrees.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/api/visibility.rs`:
- Around line 204-212: The handler currently returns withheld metadata without
verifying the caller can read the repository root; before calling
crate::visibility::withheld_globs or returning the JSON, perform a root-read
gate by calling the existing visibility_check (or equivalent) with path "/"
using the same auth, rules, record.is_public, and record.owner_did; if
visibility_check(..., "/") denies access, return the repo-not-found/unauthorized
response used elsewhere (same error path as other root-read failures) instead of
returning the withheld data. Ensure this check is placed after fetching rules
(list_visibility_rules) but before invoking withheld_globs and constructing the
Json response.

In `@crates/gitlawb-node/src/visibility.rs`:
- Around line 114-119: The current withheld_globs filtering uses a single
representative probe (glob_prefix) and calls visibility_check, which
misclassifies a denied parent when a more-specific child rule allows access;
update the logic in withheld_globs (or change its API contract) to detect
allow-overrides by checking for any more-specific descendant allow before
emitting the parent glob: either (A) when
visibility_check(glob_prefix(&r.path_glob)) == Decision::Deny, scan rules for
any child globs under r.path_glob that yield Decision::Allow (using
visibility_check on those child probes) and skip emitting the parent if any
exist, or (B) change the return structure to emit tuples like (denying_glob,
allowed_exceptions) so the caller can re-include allowed descendants; reference
visibility_check, glob_prefix, Decision::Deny, and the withheld_globs helper to
locate where to implement this.

In `@crates/gl/src/clone.rs`:
- Around line 133-144: The current parsing uses split_once('/') which accepts
inputs like "owner/name/extra"; update the parsing logic in clone.rs (the block
that computes owner and name from stripped) to reject any repo string containing
more than one slash: after trimming trailing slashes and calling
split_once('/'), validate that neither owner nor name contains another '/' (or
alternately check that the count of '/' is exactly one) and bail! with the same
error message if extra slashes are present so malformed inputs like
"owner/name/extra" fail fast instead of producing an invalid repo identifier.
- Around line 147-171: fetch_withheld currently swallows all errors and returns
an empty Vec, which allows clones to proceed when the withheld-paths endpoint
actually failed; change fetch_withheld to return a Result<Vec<String>,
anyhow::Error> (or your crate's error type) instead of Vec<String>, and
propagate any network/HTTP/json errors to the caller except for an explicit
“endpoint unsupported” HTTP status (e.g. 404 Not Found or 501 Not Implemented)
where you should return Ok(Vec::new()). Concretely: update fetch_withheld
signature, stop using unwrap_or_default on resp.json(), treat Err(resp) and
non-success statuses as Err unless status is 404/501 (in which case return
Ok([])), and update callers of fetch_withheld to handle the Result; refer to
load_keypair_from_dir, NodeClient::get_signed/get and the path
"/api/v1/repos/{owner}/{name}/withheld-paths" to locate the code to change.
- Around line 112-121: The current logic in clone.rs extracts the default branch
by parsing stdout from `git remote show origin` (variables `out`, `text`,
`head`) and does not check `out.status.success()`, which breaks for localized
output or non-zero exit; replace this with a refs-based lookup by running `git
symbolic-ref --short refs/remotes/origin/HEAD` (or equivalent) and check the
child process exit status (`out.status.success()`), then read and trim stdout to
produce the branch name, returning an error if the command fails or stdout is
empty instead of relying on localized `HEAD branch:` parsing.
- Around line 99-105: The loop in setup_partial_clone that converts
withheld_globs currently always strips a trailing "**" and emits only a
directory-exclude like "!{dir}/", which fails to exclude the exact path when the
upstream sent "/prefix" (no "/**"). Change the logic inside the for g in
withheld_globs loop: detect whether g originally ended with "**" (or "/**"); if
it did keep the existing behavior (emit "!{dir}/"), but if it did not then emit
two exclude lines for that dir—one for the exact path ("!{dir}") and one for the
subtree ("!{dir}/")—so the sparse-checkout encoding matches visibility_check
semantics; update/add a unit/integration test that calls setup_partial_clone (or
the surrounding code) with a withheld glob like "/docs/private" to assert both
the exact path and subtree are withheld.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 43094733-35b0-4310-8cae-ee86146445f9

📥 Commits

Reviewing files that changed from the base of the PR and between 6abaf1d and 2fd4fe1.

📒 Files selected for processing (5)
  • crates/gitlawb-node/src/api/visibility.rs
  • crates/gitlawb-node/src/server.rs
  • crates/gitlawb-node/src/visibility.rs
  • crates/gl/src/clone.rs
  • crates/gl/src/main.rs

Comment thread crates/gitlawb-node/src/api/visibility.rs
Comment thread crates/gitlawb-node/src/visibility.rs
Comment thread crates/gl/src/clone.rs
Comment thread crates/gl/src/clone.rs
Comment thread crates/gl/src/clone.rs
Comment on lines +133 to +144
let (owner, name) = stripped
.trim_end_matches('/')
.split_once('/')
.context("repo must be <owner>/<name> or gitlawb://<owner>/<name>")?;
if owner.is_empty() || name.is_empty() {
bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
}
Ok((
format!("gitlawb://{owner}/{name}"),
owner.to_string(),
name.to_string(),
))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject repo inputs with more than one slash.

split_once('/') accepts owner/name/extra, and that name then flows into /api/v1/repos/{owner}/{name}/withheld-paths, which silently degrades to [] on the fetch path. This should fail fast as malformed input instead of building an invalid repo identifier.

Suggested fix
     if owner.is_empty() || name.is_empty() {
         bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
     }
+    if name.contains('/') {
+        bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let (owner, name) = stripped
.trim_end_matches('/')
.split_once('/')
.context("repo must be <owner>/<name> or gitlawb://<owner>/<name>")?;
if owner.is_empty() || name.is_empty() {
bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
}
Ok((
format!("gitlawb://{owner}/{name}"),
owner.to_string(),
name.to_string(),
))
let (owner, name) = stripped
.trim_end_matches('/')
.split_once('/')
.context("repo must be <owner>/<name> or gitlawb://<owner>/<name>")?;
if owner.is_empty() || name.is_empty() {
bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
}
if name.contains('/') {
bail!("repo must be <owner>/<name> or gitlawb://<owner>/<name>");
}
Ok((
format!("gitlawb://{owner}/{name}"),
owner.to_string(),
name.to_string(),
))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gl/src/clone.rs` around lines 133 - 144, The current parsing uses
split_once('/') which accepts inputs like "owner/name/extra"; update the parsing
logic in clone.rs (the block that computes owner and name from stripped) to
reject any repo string containing more than one slash: after trimming trailing
slashes and calling split_once('/'), validate that neither owner nor name
contains another '/' (or alternately check that the count of '/' is exactly one)
and bail! with the same error message if extra slashes are present so malformed
inputs like "owner/name/extra" fail fast instead of producing an invalid repo
identifier.

Comment thread crates/gl/src/clone.rs Outdated
Three fixes from the PR Gitlawb#33 review:

- withheld_paths now applies the whole-repo "/" read gate (returns
  repo-not-found when the caller cannot read the root), matching the git read
  endpoints. Without it the endpoint disclosed a private repo's existence and
  path layout to unauthorized callers. The withheld_globs doc already assumed
  this gate existed; now it does.

- A nested allow under a denied parent (e.g. "/secret/public/**" allowed,
  "/secret/**" denied) was over-withheld: the client sparse-excluded the whole
  parent and hid paths the caller may read. The endpoint now also returns a
  "reinclude" list (allowed globs strictly under a denied one) and gl clone
  re-includes them in the sparse spec after the excludes.

- Wildcard-free globs like "/docs/private" match both the exact path and a
  subtree (per glob_matches), but the client only emitted the subtree exclude.
  sparse_patterns now emits both "/docs/private" and "/docs/private/".

Verified the exclude-then-reinclude sparse ordering checks out cleanly with
real git, plus unit tests for reincluded_globs, the nested re-include, the
exact-path exclude, and sparse_patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant