feat: Implement metadata resilience and audit suite by sushant-suse · Pull Request #193 · openSUSE/docbuild

sushant-suse · 2026-02-24T14:48:18Z

Fixes #192

This PR is a comprehensive fix for the "All-or-Nothing" failure mode in our metadata generation. Previously, if a single legacy document (like those in SLES 12 or 15) had missing XML fields, the Pydantic models would throw a validation error and drop the entire document or crash the manifest generation.

I’ve moved the logic from "Strict Validation" to "Resilient Recovery," ensuring that we always get a usable manifest even if the source data is messy.

What this PR changes

1. Model Resilience (`src/docbuild/models/manifest.py`)

I updated the SingleDocument and Description models to be less fragile.

Optional Fields: Fields like title, description, and dcfile are now optional.
Smart Defaults: If a title is missing, it now defaults to "No Title Available" instead of failing. This keeps the document entry alive in the JSON so it can still be tracked and fixed later.
Empty String Handling: The DocumentFormat model now accepts empty strings for HTML paths, which was a common point of failure for legacy products.

2. Fault-Tolerant Processing (`src/docbuild/cli/cmd_metadata/metaprocess.py`)

I wrapped the document loading loop in a try/except block. If one document is so corrupted that it still fails validation, the tool now logs the error and continues to the next document rather than killing the entire process for that product version.

3. Decoupling the Stitcher (`src/docbuild/config/xml/stitch.py`)

I removed the ValueError that was triggered by unresolved references. We can now build metadata for specific products (like suma-retail) even if the products they reference aren't currently present in the build environment. It still logs the missing reference, but it no longer blocks the build.

4. Git Performance & Disk Safety (`src/docbuild/utils/git.py`)

I ran into major storage issues (120GB+ usage) during testing. So, I removed the forced --local flag in the Git helper. This allows the tool to use our symlinked worktrees more efficiently without trying to perform heavy-duty local clones that eat up container storage.

📊 New Audit Tooling

I’ve added a suite of tools to help us benchmark the automated output against our manual "Gold Standard" JSONs:

tools/mass_audit.py: A runner that iterates through the entire manual catalog and attempts to generate metadata for every single one, producing a status_summary.csv.
tools/mass_audit_lean.py: A faster version of the above that reads from lean_audit.txt. Use this for quick verification of code changes without running the full catalog.
tools/audit_metadata.py: A statistical tool that compares the manual vs. generated JSONs and gives a "Match Rate %" for every product. This is how we track our progress toward 100% parity.

How to verify

This was tested on both macOS and an Ubuntu-based Docker environment. To replicate:

Set up your env.development.toml to point to your local documentation checkouts.
Create a lean_audit.txt in the root and add a few problematic doctypes (for example, sles/12-SP5/en-us).
Execute python3 tools/mass_audit_lean.py.
Check audit_reports/lean_audit/ for the logs.
- You will see that the tool now generates JSON files even for products that previously caused tracebacks.
pytest should pass with 100% success (I updated the test suite to reflect the new resilient behavior).

This PR also includes the mandatory news fragment in changelog.d/192.feature.rst and updates .gitignore.

Signed-off-by: sushant-suse <[email protected]>

github-actions · 2026-02-24T15:00:04Z

Coverage Report

For commit 8e91d67

Click to expand Coverage Report

  Name                                           Stmts   Miss Branch BrPart  Cover
  --------------------------------------------------------------------------------
+ src/docbuild/models/deliverable.py               180      1     22      0  99.5%
+ src/docbuild/cli/cmd_check/process.py             58      0     22      1  98.8%
+ src/docbuild/models/manifest.py                  111      1     12      1  98.4%
+ src/docbuild/cli/cmd_cli.py                       93      1      8      1  98.0%
+ src/docbuild/utils/pidlock.py                     79      1     14      1  97.8%
+ src/docbuild/cli/cmd_validate/process.py         178      5     52      4  96.1%
+ src/docbuild/cli/callback.py                      35      0     10      2  95.6%
- src/docbuild/cli/cmd_config/__init__.py            9      1      0      0  88.9%
- src/docbuild/config/xml/stitch.py                 47      5     12      0  88.1%
- src/docbuild/cli/cmd_metadata/metaprocess.py     215     26     66     13  82.6%
- src/docbuild/cli/cmd_check/__init__.py            18      5      2      0  65.0%
- src/docbuild/cli/cmd_build/__init__.py            13      5      0      0  61.5%
- src/docbuild/cli/cmd_metadata/__init__.py         27     10      2      0  58.6%
- src/docbuild/cli/cmd_config/environment.py        11      6      2      0  38.5%
  --------------------------------------------------------------------------------
+ TOTAL                                           2891     67    670     23  97.0%
  
  46 files skipped due to complete coverage.

tomschr

Thanks @sushant-suse! Awesome work!

I have some ideas/suggestions/questions for you below. 😉

) Signed-off-by: sushant-suse <[email protected]>

sushant-suse · 2026-02-25T15:06:45Z

Hi Toms, I’ve consolidated the 4 separate scripts into a single, portable audit_suite.py. It now handles environment-aware path resolution (detecting Docker vs. local macOS) and includes a CLI for stats and parity. I also updated the manifest models to be more resilient, ensuring we get 'Hollow JSONs' with metadata instead of empty files when XML fields are missing. Pytests are also green at 97% coverage.

tomschr

Before this old man forgets it, I'll send you this little idea. 😂

…openSUSE#192) Signed-off-by: sushant-suse <[email protected]>

sushant-suse · 2026-02-26T05:10:23Z

Before this old man forgets it, I'll send you this little idea. 😂

You are younger than me 😂

tomschr

I have some more ideas for you. 🙂

…tadata validation Signed-off-by: sushant-suse <[email protected]>

tomschr

Unfortunately I have some more suggestions. 😇

…penSUSE#192) Signed-off-by: sushant-suse <[email protected]>

tomschr

Thanks a lot Sushant! 👍

feat openSUSE#192: implement metadata resilience and audit suite

1376d07

Signed-off-by: sushant-suse <[email protected]>

sushant-suse requested a review from tomschr February 25, 2026 06:47

tomschr requested changes Feb 25, 2026

View reviewed changes

feat: unified audit suite and improved metadata resilience (openSUSE#192

079c017

) Signed-off-by: sushant-suse <[email protected]>

sushant-suse requested a review from tomschr February 25, 2026 15:06

tomschr requested changes Feb 25, 2026

View reviewed changes

Comment thread tools/audit_suite.py Outdated

feat: implement resilient metadata validation and unified audit suite (…

d487216

…openSUSE#192) Signed-off-by: sushant-suse <[email protected]>

sushant-suse requested a review from tomschr February 26, 2026 05:10

tomschr reviewed Feb 27, 2026

View reviewed changes

Comment thread src/docbuild/models/manifest.py Outdated

tomschr requested changes Feb 27, 2026

View reviewed changes

Comment thread tools/audit_suite.py Outdated

Comment thread tools/audit_suite.py Outdated

feat openSUSE#192: unified audit suite with argparse and resilient me…

5df3e12

…tadata validation Signed-off-by: sushant-suse <[email protected]>

sushant-suse requested a review from tomschr February 27, 2026 08:23

tomschr requested changes Feb 27, 2026

View reviewed changes

Comment thread src/docbuild/config/xml/stitch.py

Comment thread src/docbuild/models/manifest.py Outdated

Comment thread src/docbuild/models/manifest.py Outdated

Comment thread tools/audit_suite.py

Comment thread tools/audit_suite.py

feat: implement resilient metadata pipeline and unified audit suite (o…

8e91d67

…penSUSE#192) Signed-off-by: sushant-suse <[email protected]>

sushant-suse requested a review from tomschr February 27, 2026 09:12

tomschr approved these changes Feb 27, 2026

View reviewed changes

sushant-suse merged commit 26f7c37 into openSUSE:main Feb 27, 2026
9 of 10 checks passed

sushant-suse deleted the test_json_PR-183 branch February 27, 2026 13:04

Conversation

sushant-suse commented Feb 24, 2026

What this PR changes

1. Model Resilience (src/docbuild/models/manifest.py)

2. Fault-Tolerant Processing (src/docbuild/cli/cmd_metadata/metaprocess.py)

3. Decoupling the Stitcher (src/docbuild/config/xml/stitch.py)

4. Git Performance & Disk Safety (src/docbuild/utils/git.py)

📊 New Audit Tooling

How to verify

Uh oh!

github-actions Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

tomschr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sushant-suse commented Feb 25, 2026

Uh oh!

tomschr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sushant-suse commented Feb 26, 2026

Uh oh!

Uh oh!

tomschr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tomschr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomschr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Model Resilience (`src/docbuild/models/manifest.py`)

2. Fault-Tolerant Processing (`src/docbuild/cli/cmd_metadata/metaprocess.py`)

3. Decoupling the Stitcher (`src/docbuild/config/xml/stitch.py`)

4. Git Performance & Disk Safety (`src/docbuild/utils/git.py`)

github-actions Bot commented Feb 24, 2026 •

edited

Loading