feat(chutes): tiered NEAR-primary/Chutes-fallback across all TEE models#768
feat(chutes): tiered NEAR-primary/Chutes-fallback across all TEE models#768Evrard-Nil wants to merge 4 commits into
Conversation
Turns Chutes from a single pinned provider (registered under its raw `-TEE` slug) into a fallback fabric for every Chutes TEE model, under canonical ids, with NEAR-primary routing and a hard "verifiable models never serve plaintext" rule. Config (crates/config/src/types.rs) - CHUTES_MODELS is now comma-separated `canonical_id=chute_slug` pairs (a bare entry means canonical == slug), parsed into Vec<ChutesModelEntry>. The canonical id is the user-facing/routing id (NEAR-served id when NEAR also serves the model, e.g. zai-org/GLM-5.1-FP8, else the OpenRouter id); the chute slug (zai-org/GLM-5.1-TEE) is the internal upstream identity. Provider tier (crates/inference_providers) - Wire the dormant ProviderTier into the InferenceProvider trait: new `fn tier(&self)` (default NonAttested); nearai::Provider => Near, chutes::Provider => Attested3p. MockProvider gains `with_tier` for tests. Routing (crates/services/.../inference_provider_pool) - register_pinned_secondary_provider: PUSHES a pinned (Chutes) provider onto a model's list instead of overwriting, so it coexists with NEAR's own DB-discovered providers; recorded in a new `pinned_providers` map. - Refresh merge re-attaches pinned providers after rebuilding the NEAR side, so a discovery tick refreshes NEAR backends while NEVER dropping/overwriting the attested Chutes fallback (replaces the old skip-if-pinned behavior). - get_providers_with_fallback orders providers by tier (Near < Attested3p < NonAttested) within each health group, so NEAR is primary and Chutes is the per-request fallback; a Chutes-only id has just the one provider (primary). - VERIFIABLE SAFETY: if a model has any attested provider, all NonAttested providers are dropped from selection — a verifiable model fails closed rather than silently downgrading to a plaintext provider. Catalog / registration (crates/api/src/lib.rs) - Seed/register under the CANONICAL id (the provider still talks to Chutes with the chute slug, which request_body pins). A NEAR-served canonical id keeps its existing row and simply gains Chutes as a fallback provider. Streaming stays gated (CHUTES_ENABLE_STREAMING) pending a staging check that Chutes emits the [DONE] terminator inside the encrypted channel. Tests: config pair-parsing; pool tier ordering, Chutes-only primary, non-attested-only still served, NEAR+Chutes coexistence, and the verifiable-never-plaintext filter.
There was a problem hiding this comment.
Code Review
This pull request introduces a tiered trust system for inference providers, allowing the system to prefer NEAR-served attested providers (Near) and fall back to attested third-party providers like Chutes (Attested3p), while ensuring that verifiable models never fall back to non-attested (plaintext) providers. It also updates the configuration to support mapping user-facing canonical IDs to internal Chutes slugs (canonical_id=chute_slug) and adds robust unit tests for the new routing, fallback, and parsing logic. The review feedback suggests improving the idiomaticity and performance of the provider deduplication logic by replacing raw pointer casts and unnecessary HashSet allocations with Arc::ptr_eq.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| let ptr = Arc::as_ptr(&provider) as *const () as usize; | ||
| if !entry | ||
| .iter() | ||
| .any(|p| Arc::as_ptr(p) as *const () as usize == ptr) | ||
| { | ||
| entry.push(provider); | ||
| } |
There was a problem hiding this comment.
| let present: std::collections::HashSet<usize> = providers | ||
| .iter() | ||
| .map(|p| Arc::as_ptr(p) as *const () as usize) | ||
| .collect(); | ||
| for p in extra { | ||
| if !present.contains(&(Arc::as_ptr(p) as *const () as usize)) { | ||
| providers.push(p.clone()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Instead of allocating a HashSet and performing raw pointer casts to deduplicate the providers list, you can perform a simple linear search using Arc::ptr_eq. Since the number of providers per model is typically very small (usually 1 or 2), a linear search is more idiomatic, avoids heap allocation, and is faster in practice.
for p in extra {
if !providers.iter().any(|existing| Arc::ptr_eq(existing, p)) {
providers.push(p.clone());
}
}
Review — tiered NEAR/Chutes fallbackTraced the four moving parts (config parsing, tier wiring,
No critical/blocking issues found. Two minor, non-blocking notes:
✅ Approved. |
There was a problem hiding this comment.
Pull request overview
This PR extends the Chutes attested provider integration from a single pinned -TEE model to a tiered, canonical-model routing strategy across all configured Chutes TEE models, making NEAR the primary provider where available and Chutes an attested fallback, while ensuring “verifiable models never fall back to plaintext.”
Changes:
- Adds
ProviderTier-driven provider ordering and “verifiable ⇒ drop non-attested providers” selection logic in the inference provider pool. - Introduces canonical-id-to-chute-slug mapping for
CHUTES_MODELSand seeds/attaches Chutes providers under canonical model IDs as secondary providers. - Updates provider implementations/tests/mocks to support tier reporting and validates config parsing + pool behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| env.example | Documents CHUTES_MODELS as canonical_id=chute_slug pairs and explains tiered routing + plaintext fallback rules. |
| crates/services/src/inference_provider_pool/mod.rs | Adds secondary pinned providers, tier-ordered selection, verifiable plaintext filtering, and refresh-time re-attachment of pinned providers. |
| crates/inference_providers/src/mock.rs | Adds configurable ProviderTier to MockProvider for pool selection tests. |
| crates/inference_providers/src/lib.rs | Extends InferenceProvider trait with a default tier() method. |
| crates/inference_providers/src/attested/nearai/mod.rs | Marks NEAR provider tier as Near. |
| crates/inference_providers/src/attested/chutes/mod.rs | Marks Chutes provider tier as Attested3p. |
| crates/config/src/types.rs | Parses CHUTES_MODELS into ChutesModelEntry { canonical_id, chute_slug } and adds tests. |
| crates/api/src/lib.rs | Registers Chutes providers under canonical IDs while sending chute slugs upstream; uses secondary provider registration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let pinned_providers = self | ||
| .pinned_providers | ||
| .read() | ||
| .unwrap_or_else(|e| e.into_inner()) | ||
| .clone(); | ||
| for (model_name, providers) in model_providers { | ||
| if pinned.contains(&model_name) { | ||
| warn!( | ||
| model = %model_name, | ||
| "DB discovery attempted to overwrite a pinned (attested) provider; ignoring" | ||
| ); | ||
| continue; | ||
| for (model_name, mut providers) in model_providers { |
| let present: std::collections::HashSet<usize> = providers | ||
| .iter() | ||
| .map(|p| Arc::as_ptr(p) as *const () as usize) | ||
| .collect(); | ||
| for p in extra { | ||
| if !present.contains(&(Arc::as_ptr(p) as *const () as usize)) { | ||
| providers.push(p.clone()); | ||
| } | ||
| } |
PierreLeGuen
left a comment
There was a problem hiding this comment.
This is a well-structured change — the canonical-id/slug split is threaded correctly (slug goes upstream in the request body, canonical id drives routing/catalog), the verifiable-never-plaintext filter sits on the single selection path all completion ops use, and the new config format is well documented in env.example. Three issues need addressing before merge, though.
Blocking / important
-
Fail-closed gap when Chutes registration fails —
crates/services/src/inference_provider_pool/mod.rs:1427. The plaintext filter only kicks in once an attested provider is registered. With the canonical IDs, an active external/OpenRouter row can share a model ID with a configured Chutes model; if Chutes is enabled but its API key is missing or provider construction fails, the model never gets pinned, the external row survivesload_external_providers(thepinned_modelsguard there only protects models that actually got pinned), and plaintext is served for a model configured as verifiable. Reserve configured Chutes canonical IDs before external provider loading, or make Chutes config/registration failures fatal for those IDs — and add a regression test for the collision case. -
Streaming requests regress while
CHUTES_ENABLE_STREAMING=false(the default) — mod.rs:1383-1517 withcrates/inference_providers/src/attested/chutes/mod.rs:481-488. Selection has no streaming-capability filter, so the Chutes provider is included for streaming requests it categorically refuses. Its refusal is aCompletionErrormatching no retry keyword, and since Chutes is tried last it overwrites bothlast_errorandlast_retry_decisioninretry_with_fallback: a transient NEAR 5xx that previously got exponential-backoff retries now gets none, and the client sees "set CHUTES_ENABLE_STREAMING" instead of the real NEAR error. Filter providers that can't serve the requested operation at selection time, or treat capability refusals as skip-without-clobbering. -
Legacy
register_pinned_providersilently lost its overwrite protection — mod.rs:3394-3407 with mod.rs:651-662. The refresh merge now re-attaches only providers recorded inpinned_providers, butregister_pinned_providerwrites onlypinned_models, so a same-name DBinference_urlrow overwrites a legacy-pinned provider on the next refresh — the exact substitution the removed guard existed to prevent. It's latent (production no longer calls the legacy function after this PR), but it's a broken security invariant on a still-public API that the e2e tests still use. Either have it populatepinned_providerstoo, or delete it and migrate its callers; either way, add an overwrite test on theload_inference_url_modelspath.
Non-blocking
- mod.rs:686-691 —
register_pinned_secondary_providerpushes intopinned_providersunconditionally, so duplicate canonical ids inCHUTES_MODELSaccumulate duplicate providers that every refresh re-attaches. Dedup by Arc pointer there too, or reject duplicate canonical ids duringCHUTES_MODELSparsing. - mod.rs:3394-3407 / 3483-3498 — if an operator drops the NEAR
inference_urlrow but keeps the Chutes config, the model's entry is never rebuilt (absent frommodel_providers) and never stale-removed (pinned), so dead NEAR providers stay tier-first until per-provider failure demotion and are only pruned at restart. Consider rebuilding pinned models absent from fresh discovery as pinned-only. - mod.rs:1456 — round-robin rotation is applied over the full provider list before the stable tier sort, so with multiple NEAR providers plus a fallback the rotation skews same-tier balancing (e.g. with two NEAR providers and one Chutes, one NEAR provider leads two cycles out of three). Tier-group first and rotate within each tier.
- E2EE/pubkey-routed requests get no Chutes fallback, since pinned secondaries are never added to
pubkey_to_providers. That looks intentional (Chutes has no signing keys), but a comment stating it would help. - The refresh-merge re-attachment — the riskiest part of the change — has no direct test; the four new pool tests cover registration and selection ordering only. A test driving the actual
load_inference_url_modelsmerge with a pinned secondary would lock in the "discovery can never drop the Chutes provider" invariant.
Checks run locally: cargo fmt --all -- --check clean; cargo clippy clean across services/config/inference_providers/api; cargo test -p config --lib 25 passed (incl. the two new CHUTES_MODELS parsing tests); cargo test -p services --lib inference_provider_pool 57 passed (incl. the four new tier/coexistence tests); cargo test -p inference_providers --lib 297 passed; cargo test -p api --no-run compiles all test targets against the new Vec<ChutesModelEntry> type. The chutes_catalog e2e tests fail for environmental reasons only (they need a dstack TEE unix socket absent outside a CVM; all e2e tests in this checkout are blocked identically). GitHub CI (Lint, Test Suite, security audit) was green at last read.
Blocking: - Fail-closed reservation (Pierre #1): reserve every configured Chutes canonical id in the pool BEFORE external/discovery load (api init), via new pool.reserve_pinned_models. A plaintext external/OpenRouter row sharing a canonical id can no longer register even if the Chutes provider fails to build (missing key / construction error); the id serves only attested providers or fails closed (404), never plaintext. + regression test. - Streaming capability filter (Pierre #2 / review F1): InferenceProvider gains supports_streaming() (default true; Chutes returns allow_streaming). For streaming ops, retry_with_fallback drops streaming-incapable providers when a capable sibling exists (extracted to filter_streaming_capable), so a NEAR 5xx no longer falls through to Chutes' "streaming disabled" error — which masked the real error and suppressed NEAR's backoff retry. + unit test. - Legacy register_pinned_provider (Pierre #3): now also records in pinned_providers, so the refresh merge re-attaches it — restoring the overwrite-protection the old skip-if-pinned guard gave (the secondary path already did this). Non-blocking: - Dedup duplicate canonical ids in CHUTES_MODELS parsing (first wins + warn); symmetric Arc dedup in register_pinned_secondary_provider. + test. - Stale-NEAR prune: a pinned canonical id absent from a (complete) discovery cycle is rebuilt pinned-only, so a dead NEAR backend isn't tried first until demotion. - Round-robin within the leading tier: order by (health, tier) first, then rotate the leading group — fixes same-tier load skew from rotating the full list. - Comment that pinned secondaries are intentionally absent from pubkey routing (E2EE path has no Chutes fallback — Chutes has no signing key). The refresh merge (re-append + stale-prune) is extracted to a pure merge_discovered_and_pinned helper with a direct unit test — the riskiest path Pierre flagged as untested.
|
Thanks for the thorough review, Pierre — all three blocking issues + the five non-blocking ones are addressed in Blocking
Non-blocking
All green locally: |
PierreLeGuen
left a comment
There was a problem hiding this comment.
The tiered NEAR-primary/Chutes-fallback design is sound and well-tested on the happy path: reserve_pinned_models runs before any external load so plaintext rows can't claim a canonical id (fail-closed), the verifiable-tier filter prevents silent downgrade to non-attested providers, and discovery merges re-append pinned providers rather than overwriting them. However, the merge logic has real gaps around its "complete discovery set" precondition that affect existing call paths.
Blocking
crates/services/src/inference_provider_pool/mod.rs:1888-1893, callercrates/api/src/routes/admin.rs:501—merge_discovered_and_pinnedrebuilds every pinned canonical id absent fromdiscoveredas pinned-only, but the adminPATCH /v1/admin/modelsruntime path callsload_inference_url_modelswith only the models in the PATCH batch, violating the documented complete-set precondition. Any admin PATCH carrying aninference_urloverwrites themodel_to_providersentry of every Chutes-enabled, NEAR-served model not in the batch to[Chutes only]. Until the next periodic refresh (up torefresh_interval_secs, default 900s; indefinitely if refresh is disabled), traffic for those models silently shifts to the Chutes fallback — inverting the PR's routing goal — and E2EE pubkey-routed requests hard-fail withNoPubKeyProvider, since Chutes intentionally has nopubkey_to_providersentry. Fix: gate the absent-pinned rebuild on an explicitcomplete_set: bool(true only on init and the discovery/refresh path), or give the admin path a variant that merges pinned providers only for the models actually in the batch. Please add a regression test for a partial-batch call with an unrelated pinned model — the new merge tests only cover complete sets.
Important
crates/services/src/inference_provider_pool/mod.rs:2884— the inverse edge case:load_inference_url_modelsreturns early on an empty discovery set, so when the last NEAR backend disappears the pinned merge never runs at all. Sinceremove_stale_providersexempts pinned names, a Chutes-pinned model that previously had NEAR+Chutes keeps the stale NEAR provider inmodel_to_providersuntil restart, breaking the "pinned id absent from discovery is rebuilt pinned-only" invariant. Treat an empty fetch as a complete empty set: still run the merge withHashMap::new()and include the dropped providers in cleanup.crates/services/src/inference_provider_pool/mod.rs:3468-3478— related cleanup gap: when the merge rebuilds a pinned id as pinned-only, the dropped NEAR provider Arcs are never pruned frompubkey_to_providers(old_provider_ptrsonly covers keys present in this cycle's discovery, andremove_stale_providersskips pinned names). Orphaned entries keep stale pubkeys routable and surface asNoPubKeyProviderinstead of a clean miss. Collect the pointers dropped by the absent-pinned branch and prune them in the same atomic update.
Minor
crates/inference_providers/src/attested/nearai/mod.rs:1068—FleetimplementsInferenceProviderbut doesn't overridetier(), so it reportsNonAttested. Not registered in the pool today, but if it ever is, the verifiable filter would classify NEAR's own fleet as plaintext. A one-linetier()override closes the latent footgun.crates/services/src/inference_provider_pool/mod.rs:1842— the streaming-capability filter keys offoperation_name.ends_with("_stream"). Onlychat_completion_streamexists today, but a future streaming op that breaks the suffix convention silently skips the filter. An explicitis_streamingflag onretry_with_fallback, or a const list of streaming ops asserted in a test, would make this fail loudly.
Checks run locally: CI (Lint, Test Suite, security_audit) green on the PR; cargo fmt --all -- --check clean; clippy clean on the touched crates; cargo test -p config 26/26 (incl. new CHUTES_MODELS parsing tests); cargo test -p services pool suite 60/60 (incl. new tier/merge/streaming-filter/reservation tests); cargo test -p inference_providers 297/297; cargo test -p api 137/137; combined services/inference_providers/config run 373/373. Skipped: e2e (needs PostgreSQL) and vLLM integration tests (need a live inference server). Note the blocking and important findings above are not exercised by any current test.
…inors Pierre round 2 — the F2 stale-prune introduced a real regression and gaps: Blocking — admin-PATCH partial batch: - merge_discovered_and_pinned is now RE-APPEND ONLY and touches only the models in `discovered`, so it's safe for a partial batch. The admin PATCH path (load_inference_url_models with just the batch) can no longer reset unrelated NEAR+Chutes models to Chutes-only. - The stale-prune moved to a complete-set-only prune_stale_pinned, invoked ONLY from sync_inference_url_models (the periodic refresh, which passes the full fetch). Regression test: merge leaves absent pinned ids untouched. Important: - Empty discovery: sync_inference_url_models captures the (possibly empty) complete name set and still runs prune_stale_pinned, so when the last NEAR backend disappears a Chutes-pinned model is correctly rebuilt pinned-only (load_inference_url_models early-returns on empty; the prune does the rebuild). - Cleanup: prune_stale_pinned prunes the dropped NEAR Arcs from pubkey_to_providers and the failure counters, so they don't linger as orphaned / NoPubKeyProvider-routable entries. Minor: - nearai::Fleet now overrides tier() -> Near (defensive; Provider is what's pooled). - Streaming ops are an explicit STREAMING_OPERATIONS const (not a "_stream" name suffix), with a test asserting chat_completion_stream is registered. Tests: prune_stale_pinned_rebuilds_absent_pinned_only, merge_discovered_and_pinned_reappends_and_leaves_absent_pinned_untouched, streaming_operations_list_is_explicit. pool 62, inference_providers 297, clippy+fmt clean.
|
Round 2 addressed in Blocking — admin-PATCH partial batch
Important — empty discovery
Important — cleanup
Minor
pool 62, inference_providers 297, config 26; clippy + fmt clean. Re-requesting. |
PierreLeGuen
left a comment
There was a problem hiding this comment.
The tiered NEAR-primary/Chutes-fallback design is sound and well-tested: NEAR inference_url backends stay tier Near and pass the verifiable filter, merge_discovered_and_pinned is partial-batch safe for the admin PATCH path, prune_stale_pinned only runs on complete-set refreshes, the model-pubkey (E2EE) routing path deliberately gets no Chutes fallback, and the streaming-capability filter correctly covers chat_completion_stream, the only streaming op routed through retry_with_fallback. New log lines carry only IDs/counts. Two issues should be addressed before merge.
Blocking
crates/services/src/inference_provider_pool/mod.rs:2030— provider filtering before fallback only handles streaming capability, not client-side E2EE. A request withx-client-pub-keybut nox-model-pub-keyis valid on NEAR but rejected by Chutes atcrates/inference_providers/src/attested/chutes/mod.rs:397with a non-retryable "client-facing E2EE is not supported" error. If NEAR fails with a retryable 5xx/connection error, fallback reaches Chutes and that error masks the NEAR failure and suppresses further retry. Add a client-E2EE capability filter (mirroring the streaming one) or exclude Chutes whenx-client-pub-keyis present, plus a regression test.
Important
crates/services/src/inference_provider_pool/mod.rs:1907,3645+crates/api/src/lib.rs:764— a reserved-only pinned id (Chutes enabled but provider construction failed, orCHUTES_API_KEYmissing — the reservation at lib.rs:764 runs before the key check) lands inpinned_modelsbut notpinned_providers.remove_stale_providersskips everything inpinned_models, whileprune_stale_pinnediterates onlypinned_providers(early return at mod.rs:1913). So if NEAR later drops such a model from discovery, its deadnearai::Providermapping,pubkey_to_providersentries, and failure counters linger for the process lifetime. The catalogis_activegate bounds the blast radius to staleness rather than wrong serving, but it breaks cleanup in exactly the failure mode the reservation was added for. Fix: inprune_stale_pinned, also process ids inpinned_modelswith nopinned_providersentry that are absent from the complete set — remove the mapping entirely (true fail-closed 404) — or makeremove_stale_providersskip only ids present inpinned_providers.
Non-blocking
crates/services/src/inference_provider_pool/mod.rs:1907-1953—prune_stale_pinnedstrips dropped provider Arcs frompubkey_to_providersglobally by pointer. Sinceinference_url_providersis keyed by URL, two model names sharing one inference URL share one Arc; pruning a stale backend for one pinned id also removes the pubkey entries the other, still-discovered model relies on, transiently breaking its pubkey-routed selection until the next refresh backfills. This mirrors the pre-existing pattern inremove_stale_providersand self-heals, but a guard (skip pointers still referenced by other models'model_to_providers) would be strictly correct.crates/api/src/routes/admin.rs:463— theis_pinnedskip-unregister guard now covers every NEAR-served canonical id with a Chutes fallback, not just dedicated Chutes ids. A PATCH that changes/clearsinference_urlor deactivates such a model no longer runsunregister_providercleanup, leaving stalepubkey_to_providersentries, failure counters, and theinference_url_providerscache entry until the next refresh tick; a runtimeprovider_typetransition for these ids is silently ignored. Behavior stays safe (pubkey intersection plus catalog gating), but update the comment at admin.rs:456-462 and ideally let the URL-change case refresh cleanly.crates/inference_providers/src/attested/chutes/mod.rs:705— under one canonical id, a Chutes-fallback-served response hassupports_chat_signatures = false, so signature availability becomes nondeterministic per request for a model whose NEAR-served responses are normally signable. This is the documented #758 trade-off, but it's now per-request visible on an existing model id and deserves client-facing documentation.
Checks run locally
cargo fmt --all -- --check— cleancargo clippy --all-targets/--no-depsacrossservices,config,inference_providers,api— cleancargo test -p services --lib inference_provider_pool— 62 passed (includes new tier/merge/prune/streaming-filter/reservation tests)cargo test -p config— 26 passed (includesCHUTES_MODELSparsing/dedup tests)cargo test -p inference_providers --lib— 297 passedcargo test --lib --bins— 375 passed, 1 ignoredcargo check -p api,git diff --check origin/main...HEAD— clean- GitHub CI (security_audit, Lint, Test Suite) green. E2E suites skipped locally — they require a running PostgreSQL instance; the unit tests cover the new logic directly.
Pierre round 3: Blocking — client-side E2EE fallback gap (mirrors the streaming one): - InferenceProvider gains supports_client_e2ee() (default true; Chutes false). - retry_with_fallback_caps applies filter_client_e2ee_capable: when a request carries x_client_pub_key and a capable (NEAR) sibling exists, the non-capable Chutes provider is dropped — so a retryable NEAR failure can't fall through to Chutes' non-retryable "client E2EE not supported" rejection (which masked the NEAR error + suppressed retry). chat_completion[_stream] compute the flag from params.extra and call the _caps variant; the 15 other callers go through a thin retry_with_fallback wrapper (needs_client_e2ee=false), unchanged. + unit test. Important — reserved-only pinned cleanup: - remove_stale_providers now skips ids present in pinned_PROVIDERS, not all of pinned_models. A reserved-only id (reserved for the fail-closed external block but with no provider because Chutes failed to build / key missing) is no longer exempt: if NEAR also drops it, it's removed entirely (fail-closed 404) with full pubkey/lb/failure-counter cleanup, instead of lingering as a dead NEAR mapping. + unit test (reserved-only removed, real pinned survives). Non-blocking: - admin.rs: documented that the is_pinned skip-unregister now also covers NEAR+Chutes coexisting ids, so a runtime inference_url change leaves the stale old-url NEAR provider until the next refresh prunes it (self-heals; safe). - chutes: documented the per-request signature-availability trade-off for a model that lists both a NEAR (signable) and a Chutes (unsigned) provider. - (Deferred, noted) prune_stale_pinned prunes dropped Arcs from pubkey globally by pointer, mirroring remove_stale_providers; a shared-Arc guard would be strictly correct but it self-heals on the next refresh, consistent with existing behavior. pool 64, inference_providers 297, config 26; clippy + fmt clean.
|
Round 3 addressed in Blocking — client-side E2EE fallback gap
Important — reserved-only pinned cleanup Non-blocking
pool 64, inference_providers 297, config 26; clippy + fmt clean. Re-requesting. |
PierreLeGuen
left a comment
There was a problem hiding this comment.
Solid iteration overall — the fail-closed reservation before external load, the partial-batch-safe merge_discovered_and_pinned, the verifiable→never-plaintext filter at selection time, and the capability filters that only drop a provider when a capable sibling exists are all well constructed, and the new pool tests cover them well. The #758 invariant (discovery can never substitute the attested provider) is preserved, and the pubkey-routed E2EE path is unaffected since Chutes never enters pubkey_to_providers. One substantive gap remains, which is why I'm requesting changes.
Blocking
crates/inference_providers/src/attested/chutes/mod.rs:433, with pass-through at:465(non-streamingraw_bytes) and:516(each decrypted stream chunk): the provider receives the internal chute slug and pins it into the upstream request, then returns the decrypted upstream body verbatim — soresponse.modelcarries the slug (e.g.zai-org/GLM-5.1-TEE) instead of the canonical id the client requested.crates/api/src/routes/completions.rs:1601and:3299forward these bytes unmodified. For a NEAR-primary model this happens on every fallback-served response; for a Chutes-only model withcanonical_id≠chute_slug(the documented OpenRouter-id case) it happens on every response. This breaks the PR's own canonical-id contract ("never the raw-TEEslug") and the expectation thatresponse.modelmatches an id listable in/v1/models; response shape also ends up depending on which tier served the request. Sincesupports_chat_signaturesisfalsefor Chutes, rewritingmodelin the decrypted plaintext doesn't sacrifice any hash/signature guarantee. Suggested fix: carry both the canonical id and the chute slug — use the slug only for resolution and the upstream call, and rewritemodelto the canonical id inchat_completionand in the stream decoder before returning.
Non-blocking
crates/services/src/inference_provider_pool/mod.rs(prune_stale_pinned): theneeds_rewritecheck usescur.len() != pinned.len()as a proxy for set equality. That's only correct while the current list is guaranteed a superset of the pinned list; a future partial overwrite would silently skip the rebuild. Compare the Arc-pointer sets directly — you already compute them fordropped.- Same function, cleanup path:
droppedprunespubkey_to_providersand failure counters by Arc pointer without checking whether the same Arc still backs another discovery-served model (providers are cached per-URL; see the shared-Arc comment near mod.rs:3263). If two model names share aninference_urland only one is inCHUTES_MODELS, that model leaving discovery breaks pubkey-routed requests for the sibling until the next refresh (~one cycle). Consider excluding pointers still referenced by othermodel_to_providersentries. crates/api/src/routes/admin.rs:463-474: theis_pinnedskip-unregister guard means a runtimeinference_urlchange on a pinned model never registers the new URL's provider during the PATCH — the periodic refresh eventually applies it, but the window is unbounded ifrefresh_interval_secsis long or refresh is disabled, leaving the model de facto Chutes-only until restart. Add an unregister→re-register path for the URL-change case inside the guard, or document the window.crates/config/src/types.rs:258: the duplicate-CHUTES_MODELSwarning useseprintln!; usetracing::warn!so it reaches the logging pipeline in containerized deployments.crates/services/src/inference_provider_pool/mod.rs:651:register_pinned_providerno longer has production callers (tests andcrates/api/tests/e2e_all/chutes_catalog.rsonly); production usesregister_pinned_secondary_providerexclusively. Two registration paths with replace-vs-push semantics is a maintenance hazard — migrate the tests and remove it, or mark it test-only.
Checks run locally: cargo fmt --all -- --check and cargo clippy (config, services, inference_providers, api) clean; cargo test -p config --lib 26 passed (incl. 3 new CHUTES_MODELS parsing tests); cargo test -p inference_providers --lib 297 passed; cargo test -p services --lib 377 passed, 1 ignored (incl. all 64 inference_provider_pool tests and the new tier/merge/prune/capability tests). Skipped: e2e tests (need PostgreSQL) and vLLM integration tests (need a live server). GitHub CI on the PR head: Lint, Test Suite, and security_audit all green.
Builds on #758 (Chutes as an attested provider). Turns Chutes from a single pinned provider (registered under its raw
-TEEslug) into a fallback fabric across all Chutes TEE models, under canonical ids, with NEAR-primary routing and a hard "verifiable models never serve plaintext" rule.Requirements (from product)
zai-org/GLM-5.1-FP8, not…-TEE), else the OpenRouter id.Design decisions (each grounded in the existing pool, with examples)
1. Canonical id vs chute slug — decouple them
The Chutes provider used
model_nameas both the catalog/public id and the upstreammodelit sends to Chutes. We split that:CHUTES_MODELSis now comma-separatedcanonical_id=chute_slugpairs (a bare entry →canonical == slug), parsed intoVec<ChutesModelEntry>.CHUTES_MODELS=zai-org/GLM-5.1-FP8=zai-org/GLM-5.1-TEE,moonshotai/Kimi-K2.6-TEEzai-org/GLM-5.1-FP8(NEAR-served) is the public/route id;zai-org/GLM-5.1-TEEis the upstream chute slug;moonshotai/Kimi-K2.6-TEE(Chutes-only) is canonical==slug for now (swap to its OpenRouter id when known).request_bodypins it +cached_chute_idresolves it upstream), but seeded/registered under the canonical id. So a client callingzai-org/GLM-5.1-FP8reaches Chutes with the right-TEEslug.2. Wire
ProviderTierinto selectionProviderTier{Near, Attested3p, NonAttested}existed but carried no behavior. Addedfn tier(&self)to theInferenceProvidertrait (defaultNonAttested);nearai::Provider → Near,chutes::Provider → Attested3p.3. Coexistence — Chutes as a secondary provider (highest-risk change)
register_pinned_provideroverwrote a model's provider list, and the discovery guards made a pinned name mutually exclusive with DB discovery — so NEAR + Chutes couldn't share a canonical id. Newregister_pinned_secondary_providerpushes instead, and records the provider in apinned_providersmap (the source of truth for the pinned tier). The refresh merge now rebuilds each model's list as[fresh discovered NEAR..] + [re-attached pinned Chutes]— discovery refreshes/adds the NEAR side but can never drop or overwrite the per-request-verified Chutes provider (the security invariant from #758 is preserved: discovery can only ADD an attested provider, never substitute it).4. Tier-ordered selection = NEAR-primary / Chutes-fallback (per-request)
get_providers_with_fallbackstable-sorts by tier (Near < Attested3p < NonAttested) within each health group. The existingretry_with_fallbackthen walks that ordered list, so a retryable NEAR failure (5xx/429/connection) falls through to Chutes within the same request (a non-retryable client 4xx still fast-returns without spending Chutes quota). A Chutes-only id has a single provider, so it's primary automatically — no special case. NEAR demotion (10 consecutive failures) already moves it below Chutes.5. Verifiable models never fall back to plaintext
If a model has any attested provider (
NearorAttested3p), allNonAttestedproviders are dropped from selection. A verifiable request thus only ever tries attested providers and fails closed rather than silently downgrading to a plaintext backend — defending the case where a Chutes-only canonical id also has a stray external (OpenAI/OpenRouter) row. Pure non-attested models (no attested provider) are unaffected.6. Catalog seeding under the canonical id
Seed/register under the canonical id. A NEAR-served canonical id (e.g.
zai-org/GLM-5.1-FP8) already has an active, attested row → it's respected verbatim and just gains Chutes as a fallback provider. A Chutes-only canonical id is auto-seeded inactive / $0 (operator activates + prices viaPATCH /v1/admin/models, unchanged from #758).Streaming
Decoder already implements the exact protocol (single
e2e_initML-KEM encapsulation → one HKDF stream key → per-chunk ChaCha20-Poly1305). It stays gated behindCHUTES_ENABLE_STREAMINGpending a staging verification that Chutes emits the[DONE]terminator inside the encrypted channel (our truncation guard depends on it). Per-chunk AEAD authenticates every byte; frame reorder remains undetectable (no sequence numbers) — accepted residual, truncation is caught. No code change needed to flip it on.Tests
canonical=slugpair parsing (+ bare entries, whitespace, dropped empties); empty when unset.verifiable_model_prefers_near_then_chutes_and_never_plaintext;chutes_only_model_serves_as_single_attested_primary;non_verifiable_model_is_served_by_its_non_attested_provider;pinned_secondary_chutes_coexists_with_db_near_provider. Existingpinned_provider_survives_refreshstill green.Follow-ups
CHUTES_ENABLE_STREAMINGafter the staging inner-[DONE]check.zai-org/GLM-5.1-TEEstaging row activated for PR7: Chutes as a fully attested inference provider (E2EE, hard-off) #758 verification — that should be deactivated; Chutes attaches underzai-org/GLM-5.1-FP8.