fix(kernelio): create-or-replace for drop-in writes (live-caught bug)#115
Merged
Conversation
LIVE agent-mode validation on the test fleet (RHEL 9.6, 211) caught a
real bug in the merged Phase 6/7 kernel-IO handlers: they used
fsatomic.AtomicReplace to write drop-in files (sysctl persist, modprobe
blacklist, audit rules.d, dconf profile/snippet/lock) and AtomicRemove to
remove them on rollback. But fsatomic deliberately splits the cases —
AtomicWrite errors ErrAlreadyExists on an existing target, AtomicReplace
and AtomicRemove error ErrNotExist on a missing one. A drop-in file is
usually NEW on first apply, so AtomicReplace failed:
sysctl_set: persist write failed: fsatomic: target does not exist:
/etc/sysctl.d/99-kensa.conf
On the live host this caused apply to fail → the transaction rolled back →
and the rollback's AtomicRemove ALSO errored ErrNotExist on the
never-created file, so rollback returned early and left the runtime sysctl
value UNRESTORED (changed but not rolled back). The in-memory test fake
modeled AtomicReplace/Remove as a plain map set/delete (no existence
semantics), so every unit test passed — the fake masked the bug; only the
real fsatomic primitives on a real host surfaced it.
Fix:
- internal/agent/kernelio: WriteFile (create-OR-replace: AtomicWrite, fall
back to AtomicReplace on ErrAlreadyExists; race-tolerant) and RemoveFile
(AtomicRemove, ErrNotExist tolerated as a no-op success). Direct tests.
- The 4 affected handlers (sysctl_set, kernel_module_disable,
audit_rule_set, dconf_set) switch their drop-in writes/removes to
WriteFile/RemoveFile. mount_option_set is unaffected (/etc/fstab always
exists; AtomicReplace is correct there).
- FakeSysctlTransport now MODELS the real fsatomic existence semantics:
AtomicReplace/AtomicRemove return ErrNotExist on a missing file,
AtomicWrite returns ErrAlreadyExists on an existing one. This is what
makes the unit tests catch this class of bug going forward — with the
hardened fake, the pre-fix handlers FAIL.
Live re-validation on 211 (sysctl_set, benign log_martians fixture,
restored after each run): apply now COMMITS with the correct runtime
value + drop-in content; a forced apply-time rollback (2-step rule whose
2nd step the kernel rejects) restores the runtime value AND removes the
drop-in BYTE-PERFECT. Both the apply-commit and apply-then-rollback paths
verified on the real host.
Failure-mode analysis:
1. What could this do wrong in production? This IS the production bug —
agent-mode remediation of sysctl/module/audit/dconf would fail to apply
(and, worse, fail to fully roll back, leaving runtime state changed).
The fix makes both paths correct and the hardened fake prevents
regression. Shell-path (non-agent) hosts were never affected (printf >
file create-or-replaces fine).
2. Captured-state sufficiency: unchanged — the fix is in the write/remove
primitives, not the captured PreState. Rollback now actually completes
(RemoveFile tolerates absence) so the captured runtime value IS
restored, which the pre-fix early-return prevented.
3. Edge case not safe for / gated: WriteFile's two-step (AtomicWrite then
AtomicReplace) is race-tolerant against a concurrent writer
creating/removing the file between attempts. Live-validated for sysctl;
the other three handlers use the identical helper + fsatomic primitives.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Live agent-mode validation on the test fleet (RHEL 9.6, 211) caught a real bug in the merged Phase 6/7 kernel-IO handlers. They used
fsatomic.AtomicReplaceto write drop-in files andAtomicRemoveto remove them on rollback — butfsatomicsplits the cases:AtomicWriteerrorsErrAlreadyExistson an existing target;AtomicReplace/AtomicRemoveerrorErrNotExiston a missing one. A drop-in is usually new on first apply, so:On the live host this made apply fail → the transaction rolled back → and the rollback's
AtomicRemovealso erroredErrNotExiston the never-created file, so rollback returned early and left the runtime sysctl value unrestored (changed but not rolled back). The in-memory fake modeledAtomicReplace/Removeas a plain map set/delete (no existence semantics), so every unit test passed — the fake masked the bug; only the realfsatomicprimitives on a real host surfaced it.Fix
internal/agent/kernelio:WriteFile(create-OR-replace;AtomicWrite, fall back toAtomicReplaceonErrAlreadyExists; race-tolerant) andRemoveFile(AtomicRemove,ErrNotExisttolerated as no-op). Direct tests.sysctl_set,kernel_module_disable,audit_rule_set,dconf_set) switch their drop-in writes/removes toWriteFile/RemoveFile.mount_option_setis unaffected (/etc/fstabalways exists).FakeSysctlTransportnow models the realfsatomicexistence semantics (ErrNotExist/ErrAlreadyExists) — so the unit tests catch this class going forward (the pre-fix handlers now FAIL the fake).Live re-validation on 211 (sysctl_set, benign
log_martians, restored after each run)FIXED) with correct runtime value + drop-in content.Verification
go test ./...green;go build ./...clean;golangci-lint0;comment-lintclean (post-commit);specter syncall pass.Failure-mode analysis
WriteFile's two-step is race-tolerant. Live-validated for sysctl; the other three handlers use the identical helper + primitives.🤖 Generated with Claude Code