feat: Implement metadata resilience and audit suite#193
feat: Implement metadata resilience and audit suite#193sushant-suse merged 5 commits intoopenSUSE:mainfrom
Conversation
Signed-off-by: sushant-suse <[email protected]>
Coverage ReportFor commit 8e91d67 Click to expand Coverage Report Name Stmts Miss Branch BrPart Cover
--------------------------------------------------------------------------------
+ src/docbuild/models/deliverable.py 180 1 22 0 99.5%
+ src/docbuild/cli/cmd_check/process.py 58 0 22 1 98.8%
+ src/docbuild/models/manifest.py 111 1 12 1 98.4%
+ src/docbuild/cli/cmd_cli.py 93 1 8 1 98.0%
+ src/docbuild/utils/pidlock.py 79 1 14 1 97.8%
+ src/docbuild/cli/cmd_validate/process.py 178 5 52 4 96.1%
+ src/docbuild/cli/callback.py 35 0 10 2 95.6%
- src/docbuild/cli/cmd_config/__init__.py 9 1 0 0 88.9%
- src/docbuild/config/xml/stitch.py 47 5 12 0 88.1%
- src/docbuild/cli/cmd_metadata/metaprocess.py 215 26 66 13 82.6%
- src/docbuild/cli/cmd_check/__init__.py 18 5 2 0 65.0%
- src/docbuild/cli/cmd_build/__init__.py 13 5 0 0 61.5%
- src/docbuild/cli/cmd_metadata/__init__.py 27 10 2 0 58.6%
- src/docbuild/cli/cmd_config/environment.py 11 6 2 0 38.5%
--------------------------------------------------------------------------------
+ TOTAL 2891 67 670 23 97.0%
46 files skipped due to complete coverage. |
tomschr
left a comment
There was a problem hiding this comment.
Thanks @sushant-suse! Awesome work!
I have some ideas/suggestions/questions for you below. 😉
) Signed-off-by: sushant-suse <[email protected]>
|
Hi Toms, I’ve consolidated the 4 separate scripts into a single, portable |
tomschr
left a comment
There was a problem hiding this comment.
Before this old man forgets it, I'll send you this little idea. 😂
…openSUSE#192) Signed-off-by: sushant-suse <[email protected]>
You are younger than me 😂 |
tomschr
left a comment
There was a problem hiding this comment.
I have some more ideas for you. 🙂
…tadata validation Signed-off-by: sushant-suse <[email protected]>
tomschr
left a comment
There was a problem hiding this comment.
Unfortunately I have some more suggestions. 😇
…penSUSE#192) Signed-off-by: sushant-suse <[email protected]>
Fixes #192
This PR is a comprehensive fix for the "All-or-Nothing" failure mode in our metadata generation. Previously, if a single legacy document (like those in SLES 12 or 15) had missing XML fields, the Pydantic models would throw a validation error and drop the entire document or crash the manifest generation.
I’ve moved the logic from "Strict Validation" to "Resilient Recovery," ensuring that we always get a usable manifest even if the source data is messy.
What this PR changes
1. Model Resilience (
src/docbuild/models/manifest.py)I updated the
SingleDocumentandDescriptionmodels to be less fragile.title,description, anddcfileare now optional."No Title Available"instead of failing. This keeps the document entry alive in the JSON so it can still be tracked and fixed later.DocumentFormatmodel now accepts empty strings for HTML paths, which was a common point of failure for legacy products.2. Fault-Tolerant Processing (
src/docbuild/cli/cmd_metadata/metaprocess.py)I wrapped the document loading loop in a
try/exceptblock. If one document is so corrupted that it still fails validation, the tool now logs the error andcontinuesto the next document rather than killing the entire process for that product version.3. Decoupling the Stitcher (
src/docbuild/config/xml/stitch.py)I removed the
ValueErrorthat was triggered by unresolved references. We can now build metadata for specific products (likesuma-retail) even if the products they reference aren't currently present in the build environment. It still logs the missing reference, but it no longer blocks the build.4. Git Performance & Disk Safety (
src/docbuild/utils/git.py)I ran into major storage issues (120GB+ usage) during testing. So, I removed the forced
--localflag in the Git helper. This allows the tool to use our symlinked worktrees more efficiently without trying to perform heavy-duty local clones that eat up container storage.📊 New Audit Tooling
I’ve added a suite of tools to help us benchmark the automated output against our manual "Gold Standard" JSONs:
tools/mass_audit.py: A runner that iterates through the entire manual catalog and attempts to generate metadata for every single one, producing astatus_summary.csv.tools/mass_audit_lean.py: A faster version of the above that reads fromlean_audit.txt. Use this for quick verification of code changes without running the full catalog.tools/audit_metadata.py: A statistical tool that compares the manual vs. generated JSONs and gives a "Match Rate %" for every product. This is how we track our progress toward 100% parity.How to verify
This was tested on both macOS and an Ubuntu-based Docker environment. To replicate:
env.development.tomlto point to your local documentation checkouts.lean_audit.txtin the root and add a few problematic doctypes (for example,sles/12-SP5/en-us).python3 tools/mass_audit_lean.py.audit_reports/lean_audit/for the logs.pytestshould pass with 100% success (I updated the test suite to reflect the new resilient behavior).This PR also includes the mandatory news fragment in
changelog.d/192.feature.rstand updates.gitignore.