Skip to content

feat: Implement metadata resilience and audit suite#193

Merged
sushant-suse merged 5 commits intoopenSUSE:mainfrom
sushant-suse:test_json_PR-183
Feb 27, 2026
Merged

feat: Implement metadata resilience and audit suite#193
sushant-suse merged 5 commits intoopenSUSE:mainfrom
sushant-suse:test_json_PR-183

Conversation

@sushant-suse
Copy link
Copy Markdown
Collaborator

Fixes #192

This PR is a comprehensive fix for the "All-or-Nothing" failure mode in our metadata generation. Previously, if a single legacy document (like those in SLES 12 or 15) had missing XML fields, the Pydantic models would throw a validation error and drop the entire document or crash the manifest generation.

I’ve moved the logic from "Strict Validation" to "Resilient Recovery," ensuring that we always get a usable manifest even if the source data is messy.

What this PR changes

1. Model Resilience (src/docbuild/models/manifest.py)

I updated the SingleDocument and Description models to be less fragile.

  • Optional Fields: Fields like title, description, and dcfile are now optional.
  • Smart Defaults: If a title is missing, it now defaults to "No Title Available" instead of failing. This keeps the document entry alive in the JSON so it can still be tracked and fixed later.
  • Empty String Handling: The DocumentFormat model now accepts empty strings for HTML paths, which was a common point of failure for legacy products.

2. Fault-Tolerant Processing (src/docbuild/cli/cmd_metadata/metaprocess.py)

I wrapped the document loading loop in a try/except block. If one document is so corrupted that it still fails validation, the tool now logs the error and continues to the next document rather than killing the entire process for that product version.

3. Decoupling the Stitcher (src/docbuild/config/xml/stitch.py)

I removed the ValueError that was triggered by unresolved references. We can now build metadata for specific products (like suma-retail) even if the products they reference aren't currently present in the build environment. It still logs the missing reference, but it no longer blocks the build.

4. Git Performance & Disk Safety (src/docbuild/utils/git.py)

I ran into major storage issues (120GB+ usage) during testing. So, I removed the forced --local flag in the Git helper. This allows the tool to use our symlinked worktrees more efficiently without trying to perform heavy-duty local clones that eat up container storage.

📊 New Audit Tooling

I’ve added a suite of tools to help us benchmark the automated output against our manual "Gold Standard" JSONs:

  • tools/mass_audit.py: A runner that iterates through the entire manual catalog and attempts to generate metadata for every single one, producing a status_summary.csv.
  • tools/mass_audit_lean.py: A faster version of the above that reads from lean_audit.txt. Use this for quick verification of code changes without running the full catalog.
  • tools/audit_metadata.py: A statistical tool that compares the manual vs. generated JSONs and gives a "Match Rate %" for every product. This is how we track our progress toward 100% parity.

How to verify

This was tested on both macOS and an Ubuntu-based Docker environment. To replicate:

  1. Set up your env.development.toml to point to your local documentation checkouts.
  2. Create a lean_audit.txt in the root and add a few problematic doctypes (for example, sles/12-SP5/en-us).
  3. Execute python3 tools/mass_audit_lean.py.
  4. Check audit_reports/lean_audit/ for the logs.
    • You will see that the tool now generates JSON files even for products that previously caused tracebacks.
  5. pytest should pass with 100% success (I updated the test suite to reflect the new resilient behavior).

This PR also includes the mandatory news fragment in changelog.d/192.feature.rst and updates .gitignore.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 24, 2026

Coverage Report

For commit 8e91d67

Click to expand Coverage Report
  Name                                           Stmts   Miss Branch BrPart  Cover
  --------------------------------------------------------------------------------
+ src/docbuild/models/deliverable.py               180      1     22      0  99.5%
+ src/docbuild/cli/cmd_check/process.py             58      0     22      1  98.8%
+ src/docbuild/models/manifest.py                  111      1     12      1  98.4%
+ src/docbuild/cli/cmd_cli.py                       93      1      8      1  98.0%
+ src/docbuild/utils/pidlock.py                     79      1     14      1  97.8%
+ src/docbuild/cli/cmd_validate/process.py         178      5     52      4  96.1%
+ src/docbuild/cli/callback.py                      35      0     10      2  95.6%
- src/docbuild/cli/cmd_config/__init__.py            9      1      0      0  88.9%
- src/docbuild/config/xml/stitch.py                 47      5     12      0  88.1%
- src/docbuild/cli/cmd_metadata/metaprocess.py     215     26     66     13  82.6%
- src/docbuild/cli/cmd_check/__init__.py            18      5      2      0  65.0%
- src/docbuild/cli/cmd_build/__init__.py            13      5      0      0  61.5%
- src/docbuild/cli/cmd_metadata/__init__.py         27     10      2      0  58.6%
- src/docbuild/cli/cmd_config/environment.py        11      6      2      0  38.5%
  --------------------------------------------------------------------------------
+ TOTAL                                           2891     67    670     23  97.0%
  
  46 files skipped due to complete coverage.

@sushant-suse sushant-suse requested a review from tomschr February 25, 2026 06:47
Copy link
Copy Markdown
Contributor

@tomschr tomschr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sushant-suse! Awesome work!

I have some ideas/suggestions/questions for you below. 😉

Comment thread src/docbuild/cli/cmd_metadata/metaprocess.py Outdated
Comment thread src/docbuild/config/xml/stitch.py Outdated
Comment thread src/docbuild/models/manifest.py Outdated
Comment thread tools/audit_metadata.py Outdated
Comment thread tools/mass_audit.py Outdated
Comment thread tools/mass_audit_lean.py Outdated
Comment thread tools/audit_metadata.py Outdated
@sushant-suse
Copy link
Copy Markdown
Collaborator Author

Hi Toms, I’ve consolidated the 4 separate scripts into a single, portable audit_suite.py. It now handles environment-aware path resolution (detecting Docker vs. local macOS) and includes a CLI for stats and parity. I also updated the manifest models to be more resilient, ensuring we get 'Hollow JSONs' with metadata instead of empty files when XML fields are missing. Pytests are also green at 97% coverage.

@sushant-suse sushant-suse requested a review from tomschr February 25, 2026 15:06
Copy link
Copy Markdown
Contributor

@tomschr tomschr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this old man forgets it, I'll send you this little idea. 😂

Comment thread tools/audit_suite.py Outdated
@sushant-suse
Copy link
Copy Markdown
Collaborator Author

Before this old man forgets it, I'll send you this little idea. 😂

You are younger than me 😂

@sushant-suse sushant-suse requested a review from tomschr February 26, 2026 05:10
Comment thread src/docbuild/models/manifest.py Outdated
Copy link
Copy Markdown
Contributor

@tomschr tomschr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more ideas for you. 🙂

Comment thread tools/audit_suite.py Outdated
Comment thread tools/audit_suite.py Outdated
@sushant-suse sushant-suse requested a review from tomschr February 27, 2026 08:23
Copy link
Copy Markdown
Contributor

@tomschr tomschr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I have some more suggestions. 😇

Comment thread src/docbuild/config/xml/stitch.py
Comment thread src/docbuild/models/manifest.py Outdated
Comment thread src/docbuild/models/manifest.py Outdated
Comment thread tools/audit_suite.py
Comment thread tools/audit_suite.py
@sushant-suse sushant-suse requested a review from tomschr February 27, 2026 09:12
Copy link
Copy Markdown
Contributor

@tomschr tomschr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Sushant! 👍

@sushant-suse sushant-suse merged commit 26f7c37 into openSUSE:main Feb 27, 2026
9 of 10 checks passed
@sushant-suse sushant-suse deleted the test_json_PR-183 branch February 27, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audit: Metadata Generation Parity Gap & Validation Resilience Improvements

2 participants