Skip to content

feat(chutes): tiered NEAR-primary/Chutes-fallback across all TEE models#768

Open
Evrard-Nil wants to merge 4 commits into
mainfrom
feat/chutes-tiered-fallback
Open

feat(chutes): tiered NEAR-primary/Chutes-fallback across all TEE models#768
Evrard-Nil wants to merge 4 commits into
mainfrom
feat/chutes-tiered-fallback

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Collaborator

Builds on #758 (Chutes as an attested provider). Turns Chutes from a single pinned provider (registered under its raw -TEE slug) into a fallback fabric across all Chutes TEE models, under canonical ids, with NEAR-primary routing and a hard "verifiable models never serve plaintext" rule.

Requirements (from product)

  1. Register all Chutes TEE models, not one.
  2. List them under canonical ids — the NEAR-served id when NEAR also serves the model (zai-org/GLM-5.1-FP8, not …-TEE), else the OpenRouter id.
  3. Route NEAR first, Chutes fallback when NEAR serves the model; Chutes-primary when NEAR doesn't.
  4. Do not fall back to a non-attested provider for verifiable models.
  5. (Streaming) enable E2EE streaming — gated on a staging check (below).

Design decisions (each grounded in the existing pool, with examples)

1. Canonical id vs chute slug — decouple them

The Chutes provider used model_name as both the catalog/public id and the upstream model it sends to Chutes. We split that:

  • CHUTES_MODELS is now comma-separated canonical_id=chute_slug pairs (a bare entry → canonical == slug), parsed into Vec<ChutesModelEntry>.
    • Example: CHUTES_MODELS=zai-org/GLM-5.1-FP8=zai-org/GLM-5.1-TEE,moonshotai/Kimi-K2.6-TEE
    • zai-org/GLM-5.1-FP8 (NEAR-served) is the public/route id; zai-org/GLM-5.1-TEE is the upstream chute slug; moonshotai/Kimi-K2.6-TEE (Chutes-only) is canonical==slug for now (swap to its OpenRouter id when known).
  • The provider is still built with the chute slug (request_body pins it + cached_chute_id resolves it upstream), but seeded/registered under the canonical id. So a client calling zai-org/GLM-5.1-FP8 reaches Chutes with the right -TEE slug.

2. Wire ProviderTier into selection

ProviderTier{Near, Attested3p, NonAttested} existed but carried no behavior. Added fn tier(&self) to the InferenceProvider trait (default NonAttested); nearai::Provider → Near, chutes::Provider → Attested3p.

3. Coexistence — Chutes as a secondary provider (highest-risk change)

register_pinned_provider overwrote a model's provider list, and the discovery guards made a pinned name mutually exclusive with DB discovery — so NEAR + Chutes couldn't share a canonical id. New register_pinned_secondary_provider pushes instead, and records the provider in a pinned_providers map (the source of truth for the pinned tier). The refresh merge now rebuilds each model's list as [fresh discovered NEAR..] + [re-attached pinned Chutes] — discovery refreshes/adds the NEAR side but can never drop or overwrite the per-request-verified Chutes provider (the security invariant from #758 is preserved: discovery can only ADD an attested provider, never substitute it).

4. Tier-ordered selection = NEAR-primary / Chutes-fallback (per-request)

get_providers_with_fallback stable-sorts by tier (Near < Attested3p < NonAttested) within each health group. The existing retry_with_fallback then walks that ordered list, so a retryable NEAR failure (5xx/429/connection) falls through to Chutes within the same request (a non-retryable client 4xx still fast-returns without spending Chutes quota). A Chutes-only id has a single provider, so it's primary automatically — no special case. NEAR demotion (10 consecutive failures) already moves it below Chutes.

5. Verifiable models never fall back to plaintext

If a model has any attested provider (Near or Attested3p), all NonAttested providers are dropped from selection. A verifiable request thus only ever tries attested providers and fails closed rather than silently downgrading to a plaintext backend — defending the case where a Chutes-only canonical id also has a stray external (OpenAI/OpenRouter) row. Pure non-attested models (no attested provider) are unaffected.

6. Catalog seeding under the canonical id

Seed/register under the canonical id. A NEAR-served canonical id (e.g. zai-org/GLM-5.1-FP8) already has an active, attested row → it's respected verbatim and just gains Chutes as a fallback provider. A Chutes-only canonical id is auto-seeded inactive / $0 (operator activates + prices via PATCH /v1/admin/models, unchanged from #758).

Streaming

Decoder already implements the exact protocol (single e2e_init ML-KEM encapsulation → one HKDF stream key → per-chunk ChaCha20-Poly1305). It stays gated behind CHUTES_ENABLE_STREAMING pending a staging verification that Chutes emits the [DONE] terminator inside the encrypted channel (our truncation guard depends on it). Per-chunk AEAD authenticates every byte; frame reorder remains undetectable (no sequence numbers) — accepted residual, truncation is caught. No code change needed to flip it on.

Tests

  • config: canonical=slug pair parsing (+ bare entries, whitespace, dropped empties); empty when unset.
  • pool: verifiable_model_prefers_near_then_chutes_and_never_plaintext; chutes_only_model_serves_as_single_attested_primary; non_verifiable_model_is_served_by_its_non_attested_provider; pinned_secondary_chutes_coexists_with_db_near_provider. Existing pinned_provider_survives_refresh still green.
  • 297 inference_providers + config tests, 57 pool tests, clippy + fmt clean.

Follow-ups

Turns Chutes from a single pinned provider (registered under its raw `-TEE`
slug) into a fallback fabric for every Chutes TEE model, under canonical ids,
with NEAR-primary routing and a hard "verifiable models never serve plaintext"
rule.

Config (crates/config/src/types.rs)
- CHUTES_MODELS is now comma-separated `canonical_id=chute_slug` pairs (a bare
  entry means canonical == slug), parsed into Vec<ChutesModelEntry>. The
  canonical id is the user-facing/routing id (NEAR-served id when NEAR also
  serves the model, e.g. zai-org/GLM-5.1-FP8, else the OpenRouter id); the chute
  slug (zai-org/GLM-5.1-TEE) is the internal upstream identity.

Provider tier (crates/inference_providers)
- Wire the dormant ProviderTier into the InferenceProvider trait: new
  `fn tier(&self)` (default NonAttested); nearai::Provider => Near,
  chutes::Provider => Attested3p. MockProvider gains `with_tier` for tests.

Routing (crates/services/.../inference_provider_pool)
- register_pinned_secondary_provider: PUSHES a pinned (Chutes) provider onto a
  model's list instead of overwriting, so it coexists with NEAR's own
  DB-discovered providers; recorded in a new `pinned_providers` map.
- Refresh merge re-attaches pinned providers after rebuilding the NEAR side, so
  a discovery tick refreshes NEAR backends while NEVER dropping/overwriting the
  attested Chutes fallback (replaces the old skip-if-pinned behavior).
- get_providers_with_fallback orders providers by tier (Near < Attested3p <
  NonAttested) within each health group, so NEAR is primary and Chutes is the
  per-request fallback; a Chutes-only id has just the one provider (primary).
- VERIFIABLE SAFETY: if a model has any attested provider, all NonAttested
  providers are dropped from selection — a verifiable model fails closed rather
  than silently downgrading to a plaintext provider.

Catalog / registration (crates/api/src/lib.rs)
- Seed/register under the CANONICAL id (the provider still talks to Chutes with
  the chute slug, which request_body pins). A NEAR-served canonical id keeps its
  existing row and simply gains Chutes as a fallback provider.

Streaming stays gated (CHUTES_ENABLE_STREAMING) pending a staging check that
Chutes emits the [DONE] terminator inside the encrypted channel.

Tests: config pair-parsing; pool tier ordering, Chutes-only primary,
non-attested-only still served, NEAR+Chutes coexistence, and the
verifiable-never-plaintext filter.
Copilot AI review requested due to automatic review settings June 11, 2026 11:20
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env June 11, 2026 11:21 — with GitHub Actions Inactive

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tiered trust system for inference providers, allowing the system to prefer NEAR-served attested providers (Near) and fall back to attested third-party providers like Chutes (Attested3p), while ensuring that verifiable models never fall back to non-attested (plaintext) providers. It also updates the configuration to support mapping user-facing canonical IDs to internal Chutes slugs (canonical_id=chute_slug) and adds robust unit tests for the new routing, fallback, and parsing logic. The review feedback suggests improving the idiomaticity and performance of the provider deduplication logic by replacing raw pointer casts and unnecessary HashSet allocations with Arc::ptr_eq.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +695 to +701
let ptr = Arc::as_ptr(&provider) as *const () as usize;
if !entry
.iter()
.any(|p| Arc::as_ptr(p) as *const () as usize == ptr)
{
entry.push(provider);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using raw pointer casts to compare Arc allocations is unidiomatic and prone to errors. Since both p and provider are of the same type Arc<InferenceProviderTrait>, you can use Arc::ptr_eq directly to check for duplicates.

        if !entry.iter().any(|p| Arc::ptr_eq(p, &provider)) {
            entry.push(provider);
        }

Comment on lines +3396 to +3404
let present: std::collections::HashSet<usize> = providers
.iter()
.map(|p| Arc::as_ptr(p) as *const () as usize)
.collect();
for p in extra {
if !present.contains(&(Arc::as_ptr(p) as *const () as usize)) {
providers.push(p.clone());
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of allocating a HashSet and performing raw pointer casts to deduplicate the providers list, you can perform a simple linear search using Arc::ptr_eq. Since the number of providers per model is typically very small (usually 1 or 2), a linear search is more idiomatic, avoids heap allocation, and is faster in practice.

                    for p in extra {
                        if !providers.iter().any(|existing| Arc::ptr_eq(existing, p)) {
                            providers.push(p.clone());
                        }
                    }

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review — tiered NEAR/Chutes fallback

Traced the four moving parts (config parsing, tier wiring, register_pinned_secondary_provider, the refresh merge + get_providers_with_fallback) against the actual pool code. The design holds up and the invariants from #758 are preserved:

  • Verifiable fail-closed filter (mod.rs:1427): drops every NonAttested provider whenever any attested one is registered — correctly defends the stray-plaintext-row case and fails closed rather than downgrading. ✓
  • Refresh merge (mod.rs:3394): rebuilds as [fresh discovered] + [re-attached pinned] keyed off pinned_providers (source of truth), with Arc::as_ptr dedup. Discovery can only ADD/REFRESH the NEAR side, never drop/overwrite Chutes. ✓
  • Chutes-only survival: not in DB valid_model_names, but guarded by pinned_models in remove_stale_providers (mod.rs:3496) — survives refresh. ✓
  • Tier ordering: sort_by_key(tier_rank) is a stable sort, so round-robin order is preserved within a tier. Comment is accurate. ✓
  • Backward compat: a bare CHUTES_MODELS entry → canonical == slug, so the old zai-org/GLM-5.1-TEE form registers under the same id as before. ✓
  • E2EE-pinned requests (model_pub_key set) filter to the matching NEAR backend pubkey and won't reach Chutes — correct, since the client's encryption is keyed to that NEAR backend; Chutes is not a valid substitute there.

No critical/blocking issues found. Two minor, non-blocking notes:

  • Duplicate canonical_id in config (e.g. id=slugA,id=slugB): register_pinned_secondary_provider pushes into pinned_providers without dedup (it only dedups model_to_providers). Two distinct Chutes Arcs would both attach under one id. Harmless (both Attested3p), but a tracing::warn! on a repeated canonical id during registration would catch the misconfig.
  • Multi-= token: split_once('=') on a=b=c yields slug b=c. Not a realistic input, but worth a guard/comment if slugs could ever contain =.

✅ Approved.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Chutes attested provider integration from a single pinned -TEE model to a tiered, canonical-model routing strategy across all configured Chutes TEE models, making NEAR the primary provider where available and Chutes an attested fallback, while ensuring “verifiable models never fall back to plaintext.”

Changes:

  • Adds ProviderTier-driven provider ordering and “verifiable ⇒ drop non-attested providers” selection logic in the inference provider pool.
  • Introduces canonical-id-to-chute-slug mapping for CHUTES_MODELS and seeds/attaches Chutes providers under canonical model IDs as secondary providers.
  • Updates provider implementations/tests/mocks to support tier reporting and validates config parsing + pool behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
env.example Documents CHUTES_MODELS as canonical_id=chute_slug pairs and explains tiered routing + plaintext fallback rules.
crates/services/src/inference_provider_pool/mod.rs Adds secondary pinned providers, tier-ordered selection, verifiable plaintext filtering, and refresh-time re-attachment of pinned providers.
crates/inference_providers/src/mock.rs Adds configurable ProviderTier to MockProvider for pool selection tests.
crates/inference_providers/src/lib.rs Extends InferenceProvider trait with a default tier() method.
crates/inference_providers/src/attested/nearai/mod.rs Marks NEAR provider tier as Near.
crates/inference_providers/src/attested/chutes/mod.rs Marks Chutes provider tier as Attested3p.
crates/config/src/types.rs Parses CHUTES_MODELS into ChutesModelEntry { canonical_id, chute_slug } and adds tests.
crates/api/src/lib.rs Registers Chutes providers under canonical IDs while sending chute slugs upstream; uses secondary provider registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3389 to +3394
let pinned_providers = self
.pinned_providers
.read()
.unwrap_or_else(|e| e.into_inner())
.clone();
for (model_name, providers) in model_providers {
if pinned.contains(&model_name) {
warn!(
model = %model_name,
"DB discovery attempted to overwrite a pinned (attested) provider; ignoring"
);
continue;
for (model_name, mut providers) in model_providers {
Comment on lines +3396 to +3404
let present: std::collections::HashSet<usize> = providers
.iter()
.map(|p| Arc::as_ptr(p) as *const () as usize)
.collect();
for p in extra {
if !present.contains(&(Arc::as_ptr(p) as *const () as usize)) {
providers.push(p.clone());
}
}

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured change — the canonical-id/slug split is threaded correctly (slug goes upstream in the request body, canonical id drives routing/catalog), the verifiable-never-plaintext filter sits on the single selection path all completion ops use, and the new config format is well documented in env.example. Three issues need addressing before merge, though.

Blocking / important

  1. Fail-closed gap when Chutes registration failscrates/services/src/inference_provider_pool/mod.rs:1427. The plaintext filter only kicks in once an attested provider is registered. With the canonical IDs, an active external/OpenRouter row can share a model ID with a configured Chutes model; if Chutes is enabled but its API key is missing or provider construction fails, the model never gets pinned, the external row survives load_external_providers (the pinned_models guard there only protects models that actually got pinned), and plaintext is served for a model configured as verifiable. Reserve configured Chutes canonical IDs before external provider loading, or make Chutes config/registration failures fatal for those IDs — and add a regression test for the collision case.

  2. Streaming requests regress while CHUTES_ENABLE_STREAMING=false (the default) — mod.rs:1383-1517 with crates/inference_providers/src/attested/chutes/mod.rs:481-488. Selection has no streaming-capability filter, so the Chutes provider is included for streaming requests it categorically refuses. Its refusal is a CompletionError matching no retry keyword, and since Chutes is tried last it overwrites both last_error and last_retry_decision in retry_with_fallback: a transient NEAR 5xx that previously got exponential-backoff retries now gets none, and the client sees "set CHUTES_ENABLE_STREAMING" instead of the real NEAR error. Filter providers that can't serve the requested operation at selection time, or treat capability refusals as skip-without-clobbering.

  3. Legacy register_pinned_provider silently lost its overwrite protection — mod.rs:3394-3407 with mod.rs:651-662. The refresh merge now re-attaches only providers recorded in pinned_providers, but register_pinned_provider writes only pinned_models, so a same-name DB inference_url row overwrites a legacy-pinned provider on the next refresh — the exact substitution the removed guard existed to prevent. It's latent (production no longer calls the legacy function after this PR), but it's a broken security invariant on a still-public API that the e2e tests still use. Either have it populate pinned_providers too, or delete it and migrate its callers; either way, add an overwrite test on the load_inference_url_models path.

Non-blocking

  • mod.rs:686-691 — register_pinned_secondary_provider pushes into pinned_providers unconditionally, so duplicate canonical ids in CHUTES_MODELS accumulate duplicate providers that every refresh re-attaches. Dedup by Arc pointer there too, or reject duplicate canonical ids during CHUTES_MODELS parsing.
  • mod.rs:3394-3407 / 3483-3498 — if an operator drops the NEAR inference_url row but keeps the Chutes config, the model's entry is never rebuilt (absent from model_providers) and never stale-removed (pinned), so dead NEAR providers stay tier-first until per-provider failure demotion and are only pruned at restart. Consider rebuilding pinned models absent from fresh discovery as pinned-only.
  • mod.rs:1456 — round-robin rotation is applied over the full provider list before the stable tier sort, so with multiple NEAR providers plus a fallback the rotation skews same-tier balancing (e.g. with two NEAR providers and one Chutes, one NEAR provider leads two cycles out of three). Tier-group first and rotate within each tier.
  • E2EE/pubkey-routed requests get no Chutes fallback, since pinned secondaries are never added to pubkey_to_providers. That looks intentional (Chutes has no signing keys), but a comment stating it would help.
  • The refresh-merge re-attachment — the riskiest part of the change — has no direct test; the four new pool tests cover registration and selection ordering only. A test driving the actual load_inference_url_models merge with a pinned secondary would lock in the "discovery can never drop the Chutes provider" invariant.

Checks run locally: cargo fmt --all -- --check clean; cargo clippy clean across services/config/inference_providers/api; cargo test -p config --lib 25 passed (incl. the two new CHUTES_MODELS parsing tests); cargo test -p services --lib inference_provider_pool 57 passed (incl. the four new tier/coexistence tests); cargo test -p inference_providers --lib 297 passed; cargo test -p api --no-run compiles all test targets against the new Vec<ChutesModelEntry> type. The chutes_catalog e2e tests fail for environmental reasons only (they need a dstack TEE unix socket absent outside a CVM; all e2e tests in this checkout are blocked identically). GitHub CI (Lint, Test Suite, security audit) was green at last read.

Blocking:
- Fail-closed reservation (Pierre #1): reserve every configured Chutes canonical
  id in the pool BEFORE external/discovery load (api init), via new
  pool.reserve_pinned_models. A plaintext external/OpenRouter row sharing a
  canonical id can no longer register even if the Chutes provider fails to build
  (missing key / construction error); the id serves only attested providers or
  fails closed (404), never plaintext. + regression test.
- Streaming capability filter (Pierre #2 / review F1): InferenceProvider gains
  supports_streaming() (default true; Chutes returns allow_streaming). For
  streaming ops, retry_with_fallback drops streaming-incapable providers when a
  capable sibling exists (extracted to filter_streaming_capable), so a NEAR 5xx no
  longer falls through to Chutes' "streaming disabled" error — which masked the
  real error and suppressed NEAR's backoff retry. + unit test.
- Legacy register_pinned_provider (Pierre #3): now also records in
  pinned_providers, so the refresh merge re-attaches it — restoring the
  overwrite-protection the old skip-if-pinned guard gave (the secondary path
  already did this).

Non-blocking:
- Dedup duplicate canonical ids in CHUTES_MODELS parsing (first wins + warn);
  symmetric Arc dedup in register_pinned_secondary_provider. + test.
- Stale-NEAR prune: a pinned canonical id absent from a (complete) discovery cycle
  is rebuilt pinned-only, so a dead NEAR backend isn't tried first until demotion.
- Round-robin within the leading tier: order by (health, tier) first, then rotate
  the leading group — fixes same-tier load skew from rotating the full list.
- Comment that pinned secondaries are intentionally absent from pubkey routing
  (E2EE path has no Chutes fallback — Chutes has no signing key).

The refresh merge (re-append + stale-prune) is extracted to a pure
merge_discovered_and_pinned helper with a direct unit test — the riskiest path
Pierre flagged as untested.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env June 11, 2026 12:00 — with GitHub Actions Inactive
@Evrard-Nil

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review, Pierre — all three blocking issues + the five non-blocking ones are addressed in 2f89b1a1. I also ran an independent adversarial review pass (multi-dimension, refute-verified) that confirmed the verifiable-never-plaintext invariant holds on every path it probed; it flagged the same streaming/stale/dedup issues you did (and refuted a handful of false positives), and your #1 fail-closed gap was the most valuable catch — fixed.

Blocking

  1. Fail-closed when Chutes registration fails — new InferenceProviderPool::reserve_pinned_models, called in init_inference_providers before the external/discovery load, reserves every configured Chutes canonical id. A colliding plaintext external/OpenRouter row can no longer register for a configured-verifiable id even if the Chutes provider fails to build (missing key / Provider::new error); the id then serves only attested providers or fails closed (404). Regression test reserved_canonical_id_blocks_plaintext_external_collision.

  2. Streaming regresses with CHUTES_ENABLE_STREAMING=falseInferenceProvider::supports_streaming() (default true; Chutes returns allow_streaming). retry_with_fallback now filters streaming-incapable providers for streaming ops when a capable sibling exists (extracted to filter_streaming_capable), so a retryable NEAR 5xx no longer falls through to Chutes' refusal (which clobbered last_retry_decision and surfaced the wrong error). If no provider can stream (Chutes-only, streaming off), the list is kept so the clear error still surfaces. Unit test filter_streaming_capable_prefers_streaming_providers.

  3. Legacy register_pinned_provider lost overwrite protection — it now also records in pinned_providers, so the refresh merge re-attaches it (a same-name DB inference_url row can't overwrite it). The refresh merge itself is extracted to a pure merge_discovered_and_pinned helper with a direct unit test (merge_discovered_and_pinned_never_drops_pinned_and_prunes_stale) — the riskiest path you flagged as untested.

Non-blocking

  • Dedup duplicate canonical ids in CHUTES_MODELS parsing (first wins + WARN), plus symmetric Arc-pointer dedup in register_pinned_secondary_provider. Test chutes_models_dedup_duplicate_canonical_ids_first_wins.
  • Stale-NEAR prune — a pinned canonical id absent from a (complete) discovery cycle is rebuilt pinned-only, so a dropped NEAR backend isn't tried tier-first until demotion. (Safe because load_inference_url_models early-returns on empty and fetch_inference_url_models is all-or-error — documented on the helper.)
  • Round-robin within the leading tier — providers are ordered by (health, tier) first, then the leading group is rotated, fixing the same-tier balance skew.
  • pubkey/E2EE fallback — added the comment: pinned secondaries are intentionally absent from pubkey_to_providers (Chutes has no signing key; its integrity is the ML-KEM AEAD channel), so the E2EE path has no Chutes fallback by design.

All green locally: cargo fmt/clippy clean across the four crates; config 25, pool 60, inference_providers 297. Re-requesting review.

@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen June 11, 2026 12:00

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiered NEAR-primary/Chutes-fallback design is sound and well-tested on the happy path: reserve_pinned_models runs before any external load so plaintext rows can't claim a canonical id (fail-closed), the verifiable-tier filter prevents silent downgrade to non-attested providers, and discovery merges re-append pinned providers rather than overwriting them. However, the merge logic has real gaps around its "complete discovery set" precondition that affect existing call paths.

Blocking

  • crates/services/src/inference_provider_pool/mod.rs:1888-1893, caller crates/api/src/routes/admin.rs:501merge_discovered_and_pinned rebuilds every pinned canonical id absent from discovered as pinned-only, but the admin PATCH /v1/admin/models runtime path calls load_inference_url_models with only the models in the PATCH batch, violating the documented complete-set precondition. Any admin PATCH carrying an inference_url overwrites the model_to_providers entry of every Chutes-enabled, NEAR-served model not in the batch to [Chutes only]. Until the next periodic refresh (up to refresh_interval_secs, default 900s; indefinitely if refresh is disabled), traffic for those models silently shifts to the Chutes fallback — inverting the PR's routing goal — and E2EE pubkey-routed requests hard-fail with NoPubKeyProvider, since Chutes intentionally has no pubkey_to_providers entry. Fix: gate the absent-pinned rebuild on an explicit complete_set: bool (true only on init and the discovery/refresh path), or give the admin path a variant that merges pinned providers only for the models actually in the batch. Please add a regression test for a partial-batch call with an unrelated pinned model — the new merge tests only cover complete sets.

Important

  • crates/services/src/inference_provider_pool/mod.rs:2884 — the inverse edge case: load_inference_url_models returns early on an empty discovery set, so when the last NEAR backend disappears the pinned merge never runs at all. Since remove_stale_providers exempts pinned names, a Chutes-pinned model that previously had NEAR+Chutes keeps the stale NEAR provider in model_to_providers until restart, breaking the "pinned id absent from discovery is rebuilt pinned-only" invariant. Treat an empty fetch as a complete empty set: still run the merge with HashMap::new() and include the dropped providers in cleanup.
  • crates/services/src/inference_provider_pool/mod.rs:3468-3478 — related cleanup gap: when the merge rebuilds a pinned id as pinned-only, the dropped NEAR provider Arcs are never pruned from pubkey_to_providers (old_provider_ptrs only covers keys present in this cycle's discovery, and remove_stale_providers skips pinned names). Orphaned entries keep stale pubkeys routable and surface as NoPubKeyProvider instead of a clean miss. Collect the pointers dropped by the absent-pinned branch and prune them in the same atomic update.

Minor

  • crates/inference_providers/src/attested/nearai/mod.rs:1068Fleet implements InferenceProvider but doesn't override tier(), so it reports NonAttested. Not registered in the pool today, but if it ever is, the verifiable filter would classify NEAR's own fleet as plaintext. A one-line tier() override closes the latent footgun.
  • crates/services/src/inference_provider_pool/mod.rs:1842 — the streaming-capability filter keys off operation_name.ends_with("_stream"). Only chat_completion_stream exists today, but a future streaming op that breaks the suffix convention silently skips the filter. An explicit is_streaming flag on retry_with_fallback, or a const list of streaming ops asserted in a test, would make this fail loudly.

Checks run locally: CI (Lint, Test Suite, security_audit) green on the PR; cargo fmt --all -- --check clean; clippy clean on the touched crates; cargo test -p config 26/26 (incl. new CHUTES_MODELS parsing tests); cargo test -p services pool suite 60/60 (incl. new tier/merge/streaming-filter/reservation tests); cargo test -p inference_providers 297/297; cargo test -p api 137/137; combined services/inference_providers/config run 373/373. Skipped: e2e (needs PostgreSQL) and vLLM integration tests (need a live inference server). Note the blocking and important findings above are not exercised by any current test.

…inors

Pierre round 2 — the F2 stale-prune introduced a real regression and gaps:

Blocking — admin-PATCH partial batch:
- merge_discovered_and_pinned is now RE-APPEND ONLY and touches only the models
  in `discovered`, so it's safe for a partial batch. The admin PATCH path
  (load_inference_url_models with just the batch) can no longer reset unrelated
  NEAR+Chutes models to Chutes-only.
- The stale-prune moved to a complete-set-only prune_stale_pinned, invoked ONLY
  from sync_inference_url_models (the periodic refresh, which passes the full
  fetch). Regression test: merge leaves absent pinned ids untouched.

Important:
- Empty discovery: sync_inference_url_models captures the (possibly empty)
  complete name set and still runs prune_stale_pinned, so when the last NEAR
  backend disappears a Chutes-pinned model is correctly rebuilt pinned-only
  (load_inference_url_models early-returns on empty; the prune does the rebuild).
- Cleanup: prune_stale_pinned prunes the dropped NEAR Arcs from
  pubkey_to_providers and the failure counters, so they don't linger as orphaned
  / NoPubKeyProvider-routable entries.

Minor:
- nearai::Fleet now overrides tier() -> Near (defensive; Provider is what's pooled).
- Streaming ops are an explicit STREAMING_OPERATIONS const (not a "_stream" name
  suffix), with a test asserting chat_completion_stream is registered.

Tests: prune_stale_pinned_rebuilds_absent_pinned_only,
merge_discovered_and_pinned_reappends_and_leaves_absent_pinned_untouched,
streaming_operations_list_is_explicit. pool 62, inference_providers 297, clippy+fmt clean.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env June 11, 2026 12:21 — with GitHub Actions Inactive
@Evrard-Nil

Copy link
Copy Markdown
Collaborator Author

Round 2 addressed in e2e92028 — you were right, the F2 stale-prune I added had a real partial-batch regression. Thanks for catching it before merge.

Blocking — admin-PATCH partial batch

  • merge_discovered_and_pinned is now re-append only and touches only the models in discovered, so a partial batch (the admin PATCH /v1/admin/models runtime registration) can no longer reset unrelated NEAR+Chutes models to Chutes-only.
  • The stale-prune moved to a separate complete-set-only prune_stale_pinned, invoked ONLY from sync_inference_url_models (the periodic refresh, which passes the full fetch). The admin path calls load_inference_url_models directly → re-append only, no prune. Regression test merge_discovered_and_pinned_reappends_and_leaves_absent_pinned_untouched asserts an absent pinned id is left untouched.

Important — empty discovery

  • sync_inference_url_models now captures the (possibly empty) complete name set and still runs prune_stale_pinned. So when the last NEAR backend disappears, a Chutes-pinned model is rebuilt pinned-only (load_inference_url_models early-returns on empty; the prune does the rebuild). Test prune_stale_pinned_rebuilds_absent_pinned_only covers both "still served → coexist" and "dropped → pinned-only".

Important — cleanup

  • prune_stale_pinned collects the dropped NEAR Arc pointers and prunes them from pubkey_to_providers (and the failure counters), so they don't linger as orphaned / NoPubKeyProvider-routable entries.

Minor

  • nearai::Fleet now overrides tier() -> Near (defensive — Provider is what's pooled, but no misclassification if a Fleet is ever pooled directly).
  • Streaming ops are now an explicit STREAMING_OPERATIONS const (not a _stream suffix), with streaming_operations_list_is_explicit asserting chat_completion_stream is registered so a rename/new op can't silently bypass the filter.

pool 62, inference_providers 297, config 26; clippy + fmt clean. Re-requesting.

@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen June 11, 2026 12:21

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiered NEAR-primary/Chutes-fallback design is sound and well-tested: NEAR inference_url backends stay tier Near and pass the verifiable filter, merge_discovered_and_pinned is partial-batch safe for the admin PATCH path, prune_stale_pinned only runs on complete-set refreshes, the model-pubkey (E2EE) routing path deliberately gets no Chutes fallback, and the streaming-capability filter correctly covers chat_completion_stream, the only streaming op routed through retry_with_fallback. New log lines carry only IDs/counts. Two issues should be addressed before merge.

Blocking

  • crates/services/src/inference_provider_pool/mod.rs:2030 — provider filtering before fallback only handles streaming capability, not client-side E2EE. A request with x-client-pub-key but no x-model-pub-key is valid on NEAR but rejected by Chutes at crates/inference_providers/src/attested/chutes/mod.rs:397 with a non-retryable "client-facing E2EE is not supported" error. If NEAR fails with a retryable 5xx/connection error, fallback reaches Chutes and that error masks the NEAR failure and suppresses further retry. Add a client-E2EE capability filter (mirroring the streaming one) or exclude Chutes when x-client-pub-key is present, plus a regression test.

Important

  • crates/services/src/inference_provider_pool/mod.rs:1907,3645 + crates/api/src/lib.rs:764 — a reserved-only pinned id (Chutes enabled but provider construction failed, or CHUTES_API_KEY missing — the reservation at lib.rs:764 runs before the key check) lands in pinned_models but not pinned_providers. remove_stale_providers skips everything in pinned_models, while prune_stale_pinned iterates only pinned_providers (early return at mod.rs:1913). So if NEAR later drops such a model from discovery, its dead nearai::Provider mapping, pubkey_to_providers entries, and failure counters linger for the process lifetime. The catalog is_active gate bounds the blast radius to staleness rather than wrong serving, but it breaks cleanup in exactly the failure mode the reservation was added for. Fix: in prune_stale_pinned, also process ids in pinned_models with no pinned_providers entry that are absent from the complete set — remove the mapping entirely (true fail-closed 404) — or make remove_stale_providers skip only ids present in pinned_providers.

Non-blocking

  • crates/services/src/inference_provider_pool/mod.rs:1907-1953prune_stale_pinned strips dropped provider Arcs from pubkey_to_providers globally by pointer. Since inference_url_providers is keyed by URL, two model names sharing one inference URL share one Arc; pruning a stale backend for one pinned id also removes the pubkey entries the other, still-discovered model relies on, transiently breaking its pubkey-routed selection until the next refresh backfills. This mirrors the pre-existing pattern in remove_stale_providers and self-heals, but a guard (skip pointers still referenced by other models' model_to_providers) would be strictly correct.
  • crates/api/src/routes/admin.rs:463 — the is_pinned skip-unregister guard now covers every NEAR-served canonical id with a Chutes fallback, not just dedicated Chutes ids. A PATCH that changes/clears inference_url or deactivates such a model no longer runs unregister_provider cleanup, leaving stale pubkey_to_providers entries, failure counters, and the inference_url_providers cache entry until the next refresh tick; a runtime provider_type transition for these ids is silently ignored. Behavior stays safe (pubkey intersection plus catalog gating), but update the comment at admin.rs:456-462 and ideally let the URL-change case refresh cleanly.
  • crates/inference_providers/src/attested/chutes/mod.rs:705 — under one canonical id, a Chutes-fallback-served response has supports_chat_signatures = false, so signature availability becomes nondeterministic per request for a model whose NEAR-served responses are normally signable. This is the documented #758 trade-off, but it's now per-request visible on an existing model id and deserves client-facing documentation.

Checks run locally

  • cargo fmt --all -- --check — clean
  • cargo clippy --all-targets / --no-deps across services, config, inference_providers, api — clean
  • cargo test -p services --lib inference_provider_pool — 62 passed (includes new tier/merge/prune/streaming-filter/reservation tests)
  • cargo test -p config — 26 passed (includes CHUTES_MODELS parsing/dedup tests)
  • cargo test -p inference_providers --lib — 297 passed
  • cargo test --lib --bins — 375 passed, 1 ignored
  • cargo check -p api, git diff --check origin/main...HEAD — clean
  • GitHub CI (security_audit, Lint, Test Suite) green. E2E suites skipped locally — they require a running PostgreSQL instance; the unit tests cover the new logic directly.

Pierre round 3:

Blocking — client-side E2EE fallback gap (mirrors the streaming one):
- InferenceProvider gains supports_client_e2ee() (default true; Chutes false).
- retry_with_fallback_caps applies filter_client_e2ee_capable: when a request
  carries x_client_pub_key and a capable (NEAR) sibling exists, the non-capable
  Chutes provider is dropped — so a retryable NEAR failure can't fall through to
  Chutes' non-retryable "client E2EE not supported" rejection (which masked the
  NEAR error + suppressed retry). chat_completion[_stream] compute the flag from
  params.extra and call the _caps variant; the 15 other callers go through a thin
  retry_with_fallback wrapper (needs_client_e2ee=false), unchanged. + unit test.

Important — reserved-only pinned cleanup:
- remove_stale_providers now skips ids present in pinned_PROVIDERS, not all of
  pinned_models. A reserved-only id (reserved for the fail-closed external block
  but with no provider because Chutes failed to build / key missing) is no longer
  exempt: if NEAR also drops it, it's removed entirely (fail-closed 404) with full
  pubkey/lb/failure-counter cleanup, instead of lingering as a dead NEAR mapping.
  + unit test (reserved-only removed, real pinned survives).

Non-blocking:
- admin.rs: documented that the is_pinned skip-unregister now also covers
  NEAR+Chutes coexisting ids, so a runtime inference_url change leaves the stale
  old-url NEAR provider until the next refresh prunes it (self-heals; safe).
- chutes: documented the per-request signature-availability trade-off for a model
  that lists both a NEAR (signable) and a Chutes (unsigned) provider.
- (Deferred, noted) prune_stale_pinned prunes dropped Arcs from pubkey globally by
  pointer, mirroring remove_stale_providers; a shared-Arc guard would be strictly
  correct but it self-heals on the next refresh, consistent with existing behavior.

pool 64, inference_providers 297, config 26; clippy + fmt clean.
@Evrard-Nil Evrard-Nil deployed to Cloud API test env June 11, 2026 12:47 — with GitHub Actions Active
@Evrard-Nil

Copy link
Copy Markdown
Collaborator Author

Round 3 addressed in a459509b.

Blocking — client-side E2EE fallback gap
Same class as the streaming one, fixed the same way:

  • InferenceProvider::supports_client_e2ee() (default true; Chutes returns false).
  • retry_with_fallback_caps applies filter_client_e2ee_capable: when a request carries x_client_pub_key and a capable (NEAR) sibling exists, the non-capable Chutes provider is dropped — so a retryable NEAR failure can't fall through to Chutes' non-retryable "client E2EE not supported" rejection (which masked the NEAR error + suppressed retry). chat_completion/chat_completion_stream compute the flag from params.extra and call the _caps variant; the other 15 callers go through a thin retry_with_fallback wrapper (needs_client_e2ee=false) and are unchanged. Test filter_client_e2ee_capable_prefers_capable_providers.

Important — reserved-only pinned cleanup
remove_stale_providers now skips ids present in pinned_providers (an actual pinned provider), not all of pinned_models. So a reserved-only id (reserved for the fail-closed external block but with no provider because Chutes failed to build / the key was missing) is no longer exempt: if NEAR also drops it, it's removed entirely (fail-closed 404) with full pubkey/lb/failure-counter cleanup, instead of lingering as a dead NEAR mapping. The reservation still blocks plaintext external (that check uses pinned_models, untouched). Test reserved_only_pinned_id_is_removed_when_stale_but_real_pinned_survives.

Non-blocking

  • admin.rs: documented that the is_pinned skip-unregister now also covers NEAR+Chutes coexisting ids, so a runtime inference_url change leaves the stale old-url NEAR provider until the next refresh prunes it (self-heals; pubkey-intersection + catalog gating keep it safe).
  • chutes: documented the per-request signature-availability trade-off for a model listing both a NEAR (signable) and a Chutes (unsigned) provider.
  • The shared-Arc pubkey-prune guard: I left prune_stale_pinned mirroring remove_stale_providers' existing global-by-pointer prune (it self-heals on the next refresh, exactly as you noted) rather than adding the guard to one path only — happy to add it to both as a separate cleanup if you'd prefer.

pool 64, inference_providers 297, config 26; clippy + fmt clean. Re-requesting.

@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen June 11, 2026 12:47

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid iteration overall — the fail-closed reservation before external load, the partial-batch-safe merge_discovered_and_pinned, the verifiable→never-plaintext filter at selection time, and the capability filters that only drop a provider when a capable sibling exists are all well constructed, and the new pool tests cover them well. The #758 invariant (discovery can never substitute the attested provider) is preserved, and the pubkey-routed E2EE path is unaffected since Chutes never enters pubkey_to_providers. One substantive gap remains, which is why I'm requesting changes.

Blocking

  • crates/inference_providers/src/attested/chutes/mod.rs:433, with pass-through at :465 (non-streaming raw_bytes) and :516 (each decrypted stream chunk): the provider receives the internal chute slug and pins it into the upstream request, then returns the decrypted upstream body verbatim — so response.model carries the slug (e.g. zai-org/GLM-5.1-TEE) instead of the canonical id the client requested. crates/api/src/routes/completions.rs:1601 and :3299 forward these bytes unmodified. For a NEAR-primary model this happens on every fallback-served response; for a Chutes-only model with canonical_id≠chute_slug (the documented OpenRouter-id case) it happens on every response. This breaks the PR's own canonical-id contract ("never the raw -TEE slug") and the expectation that response.model matches an id listable in /v1/models; response shape also ends up depending on which tier served the request. Since supports_chat_signatures is false for Chutes, rewriting model in the decrypted plaintext doesn't sacrifice any hash/signature guarantee. Suggested fix: carry both the canonical id and the chute slug — use the slug only for resolution and the upstream call, and rewrite model to the canonical id in chat_completion and in the stream decoder before returning.

Non-blocking

  • crates/services/src/inference_provider_pool/mod.rs (prune_stale_pinned): the needs_rewrite check uses cur.len() != pinned.len() as a proxy for set equality. That's only correct while the current list is guaranteed a superset of the pinned list; a future partial overwrite would silently skip the rebuild. Compare the Arc-pointer sets directly — you already compute them for dropped.
  • Same function, cleanup path: dropped prunes pubkey_to_providers and failure counters by Arc pointer without checking whether the same Arc still backs another discovery-served model (providers are cached per-URL; see the shared-Arc comment near mod.rs:3263). If two model names share an inference_url and only one is in CHUTES_MODELS, that model leaving discovery breaks pubkey-routed requests for the sibling until the next refresh (~one cycle). Consider excluding pointers still referenced by other model_to_providers entries.
  • crates/api/src/routes/admin.rs:463-474: the is_pinned skip-unregister guard means a runtime inference_url change on a pinned model never registers the new URL's provider during the PATCH — the periodic refresh eventually applies it, but the window is unbounded if refresh_interval_secs is long or refresh is disabled, leaving the model de facto Chutes-only until restart. Add an unregister→re-register path for the URL-change case inside the guard, or document the window.
  • crates/config/src/types.rs:258: the duplicate-CHUTES_MODELS warning uses eprintln!; use tracing::warn! so it reaches the logging pipeline in containerized deployments.
  • crates/services/src/inference_provider_pool/mod.rs:651: register_pinned_provider no longer has production callers (tests and crates/api/tests/e2e_all/chutes_catalog.rs only); production uses register_pinned_secondary_provider exclusively. Two registration paths with replace-vs-push semantics is a maintenance hazard — migrate the tests and remove it, or mark it test-only.

Checks run locally: cargo fmt --all -- --check and cargo clippy (config, services, inference_providers, api) clean; cargo test -p config --lib 26 passed (incl. 3 new CHUTES_MODELS parsing tests); cargo test -p inference_providers --lib 297 passed; cargo test -p services --lib 377 passed, 1 ignored (incl. all 64 inference_provider_pool tests and the new tier/merge/prune/capability tests). Skipped: e2e tests (need PostgreSQL) and vLLM integration tests (need a live server). GitHub CI on the PR head: Lint, Test Suite, and security_audit all green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants