nightshift: perf-regression — 12 allocation hotspots in scan pipeline

# nightshift: perf-regression — 12 allocation hotspots in the scan pipeline

## Summary

**jarspect** is a Rust security scanner for Minecraft `.jar` mods. Analysis of 28 source files (299KB) reveals 12 performance-relevant patterns in the scan pipeline: excessive `clone()` calls (134 total), unbounded `Vec` allocations without capacity hints, and redundant string allocations in the hot path (`lib.rs` → `scan.rs` → detectors → `verdict.rs`).

## Findings

### 🟡 P1 — Excessive `.clone()` in hot path (134 occurrences)

The scan pipeline creates strings for paths, severities, categories, and evidence, then clones them at every stage boundary.

**Worst offenders:**

| File | `.clone()` count | Hot path context |
|------|-------------------|------------------|
| `src/lib.rs` | ~15 | `entry.path.clone()`, `signature.id.clone()`, `signature.severity.clone()` in signature matching loop (lines 256-288) |
| `src/scan.rs` | ~12 | `request.upload_id.clone()`, `capability_profile.capabilities.clone()`, `root_label.clone()` in `run_scan` (lines 96-264) |
| `src/verdict.rs` | ~20 | `static_findings.matches` iteration with per-indicator cloning for evidence aggregation |

**Impact:** For a mod with 1000+ class files, the signature matching loop in `lib.rs` clones strings for every match on every entry. With 11 detectors, this compounds.

### 🟡 P1 — `Vec::new()` without capacity hints (87 `.collect()` + many `Vec::new()`)

Key locations:

| File | Location | Issue |
|------|----------|-------|
| `src/lib.rs:151-153` | `let mut matches = Vec::new()`, `matched_pattern_ids = Vec::new()`, `matched_signature_ids = Vec::new()` | Could pre-allocate based on known pattern/signature counts |
| `src/lib.rs:354-355` | `counts_by_category: HashMap::new()`, `counts_by_severity: HashMap::new()` | Small fixed number of categories (≤5) — could use `with_capacity(8)` |
| `src/lib.rs:415,456,475` | Signature/rulepack loading Vectors | Size known from parsed JSON — `with_capacity(signatures.len())` |
| `src/detectors/mod.rs:92` | `merge_strings(target, incoming)` — takes `mut Vec<String>` | Could use `extend` with `reserve` instead of pushing one-by-one |

### 🟢 P2 — Redundant string formatting in verdict generation

`src/verdict.rs` (38KB, largest file) generates AI prompts by concatenating strings in loops:

| Line | Pattern | Count |
|------|---------|-------|
| 198-216 | Per-indicator string extraction loops | 3 loops over `static_findings.matches` |
| 256 | `for value in strings` formatting | O(indicators) allocations |
| 551 | Second full iteration over `static_findings.matches` | Duplicates work from line 198 |
| 636 | Profile capability iteration | Additional string building |

The verdict builder iterates over all findings multiple times to construct the AI prompt, creating intermediate `String` allocations each pass.

### 🟢 P2 — Archive entry cloning in scan pipeline

`src/scan.rs:150-151`:
```rust
path: root_label.clone(),
bytes: bytes.clone(),
```

`bytes.clone()` on the full archive content is expensive. For large mods (10MB+), this duplicates the entire byte buffer. Consider using `Arc<[u8]>` or passing by reference.

### 🟢 P2 — Metadata analysis allocations

`src/analysis/metadata.rs` (22KB) is the largest analysis file:
- Creates `MetadataFinding` structs with cloned strings for every entry
- `analyze_layer()` iterates all entries in a layer, cloning paths for each finding
- `analyze_fabric_metadata()`, `analyze_forge_metadata()`, `analyze_spigot_metadata()` all build finding vectors without capacity hints

## Benchmarks Suggested

1. **Add criterion benchmarks** for:
   - `run_capability_detectors()` with a 1000-entry evidence index
   - `generate_verdict()` with 50+ indicators
   - `analyze_metadata()` on a multi-layer mod
2. **Track allocation count** with `dhat` or `jemalloc` profiling on a real 10MB mod
3. **Set up CI perf regression** — fail if scan time on a fixed fixture exceeds baseline by >10%

## Low-Hanging Fruit (Estimated Impact)

| Fix | Effort | Impact |
|-----|--------|--------|
| Replace `bytes.clone()` with `Arc<[u8]>` in `scan.rs` | Low | High for large mods |
| Add `Vec::with_capacity()` in signature matching loop | Low | Medium |
| Use `&str` / `Cow<str>` for indicator fields passed between stages | Medium | Medium |
| Single-pass verdict prompt builder instead of 3 iterations | Medium | Medium |
| `HashMap::with_capacity(8)` for category/severity counts | Low | Low |

## Files Analyzed

- `src/lib.rs` (17.8KB) — main analysis pipeline
- `src/scan.rs` (19.6KB) — scan orchestrator
- `src/verdict.rs` (38.1KB) — AI verdict generation
- `src/analysis/metadata.rs` (22.1KB) — metadata analysis
- `src/detectors/mod.rs` (6.8KB) — detector dispatch
- All 11 detector files (`src/detectors/capability_*.rs`)


File	`.clone()` count	Hot path context
`src/lib.rs`	~15	`entry.path.clone()`, `signature.id.clone()`, `signature.severity.clone()` in signature matching loop (lines 256-288)
`src/scan.rs`	~12	`request.upload_id.clone()`, `capability_profile.capabilities.clone()`, `root_label.clone()` in `run_scan` (lines 96-264)
`src/verdict.rs`	~20	`static_findings.matches` iteration with per-indicator cloning for evidence aggregation

File	Location	Issue
`src/lib.rs:151-153`	`let mut matches = Vec::new()`, `matched_pattern_ids = Vec::new()`, `matched_signature_ids = Vec::new()`	Could pre-allocate based on known pattern/signature counts
`src/lib.rs:354-355`	`counts_by_category: HashMap::new()`, `counts_by_severity: HashMap::new()`	Small fixed number of categories (≤5) — could use `with_capacity(8)`
`src/lib.rs:415,456,475`	Signature/rulepack loading Vectors	Size known from parsed JSON — `with_capacity(signatures.len())`
`src/detectors/mod.rs:92`	`merge_strings(target, incoming)` — takes `mut Vec<String>`	Could use `extend` with `reserve` instead of pushing one-by-one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nightshift: perf-regression — 12 allocation hotspots in scan pipeline #26

nightshift: perf-regression — 12 allocation hotspots in the scan pipeline

Summary

Findings

🟡 P1 — Excessive `.clone()` in hot path (134 occurrences)

🟡 P1 — `Vec::new()` without capacity hints (87 `.collect()` + many `Vec::new()`)

🟢 P2 — Redundant string formatting in verdict generation

🟢 P2 — Archive entry cloning in scan pipeline

🟢 P2 — Metadata analysis allocations

Benchmarks Suggested

Low-Hanging Fruit (Estimated Impact)

Files Analyzed

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Line	Pattern	Count
198-216	Per-indicator string extraction loops	3 loops over `static_findings.matches`
256	`for value in strings` formatting	O(indicators) allocations
551	Second full iteration over `static_findings.matches`	Duplicates work from line 198
636	Profile capability iteration	Additional string building

Fix	Effort	Impact
Replace `bytes.clone()` with `Arc<[u8]>` in `scan.rs`	Low	High for large mods
Add `Vec::with_capacity()` in signature matching loop	Low	Medium
Use `&str` / `Cow<str>` for indicator fields passed between stages	Medium	Medium
Single-pass verdict prompt builder instead of 3 iterations	Medium	Medium
`HashMap::with_capacity(8)` for category/severity counts	Low	Low

nightshift: perf-regression — 12 allocation hotspots in scan pipeline #26

Description

nightshift: perf-regression — 12 allocation hotspots in the scan pipeline

Summary

Findings

🟡 P1 — Excessive .clone() in hot path (134 occurrences)

🟡 P1 — Vec::new() without capacity hints (87 .collect() + many Vec::new())

🟢 P2 — Redundant string formatting in verdict generation

🟢 P2 — Archive entry cloning in scan pipeline

🟢 P2 — Metadata analysis allocations

Benchmarks Suggested

Low-Hanging Fruit (Estimated Impact)

Files Analyzed

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

🟡 P1 — Excessive `.clone()` in hot path (134 occurrences)

🟡 P1 — `Vec::new()` without capacity hints (87 `.collect()` + many `Vec::new()`)