Skip to content

registry: auto-generate 475 entity stubs from disclosure data#83

Merged
kurtseifried merged 1 commit into
mainfrom
feat/entity-stubs-from-disclosure
May 23, 2026
Merged

registry: auto-generate 475 entity stubs from disclosure data#83
kurtseifried merged 1 commit into
mainfrom
feat/entity-stubs-from-disclosure

Conversation

@kurtseifried

Copy link
Copy Markdown
Collaborator

Closes the disclosure→entity coverage gap. 486 disclosure entries existed but only 11 had parallel entity entries — this PR ships 475 auto-generated entity stubs.

What

`scripts/generate-entity-stubs-from-disclosure.py` (new):

  • Reads each `registry/disclosure/.json`
  • Skips if any entity entry already exists for that namespace
  • Emits a parallel entity stub at the mirrored path with:
    • `namespace`, `official_name`, `common_name`, `alternate_names` copied
    • Website URL copied (CNA-specific URLs like disclosure-policy stay on the disclosure record)
    • `wikidata`, `wikipedia` preserved (usually null in source data)
    • `notes` cross-referencing `secid:disclosure//cna`
    • `status: "draft"` with `status_notes` flagging the auto-gen and empty match_nodes
    • `match_nodes: []` — the disclosure data doesn't tell us specific products
  • `--dry-run` flag for preview
  • Idempotent: re-running emits zero stubs if all disclosure entries have entity coverage

Bulk result: 475 new entity files created across the appropriate reverse-DNS directories.

Test plan

  • All 690 entity JSON files (475 new + 215 existing) parse cleanly
  • All 475 new entries have required schema fields (`schema_version`, `namespace`, `type`, `status`, `official_name`) and `type: entity`
  • `scripts/validate-subtypes.py` passes — none of the stubs introduce new subtype values (match_nodes are empty)
  • After merge + auto-deploy: `curl 'https://secid.cloudsecurityalliance.org/api/v1/resolve?secid=secid:entity/adobe.com'\` returns the new Adobe entity entry

Notes on the result

  • The stubs are intentionally minimal. Future human research per vendor will populate `match_nodes` with product/service patterns. The stub just makes the entity citable by namespace.
  • The reference-type coverage gap remains. 397 disclosure entries still lack reference entries; that's a separate problem because "what's worth referencing" needs per-vendor judgement (which blog? which advisory archive? which whitepapers page?) that doesn't auto-generate cleanly.
  • Why `match_nodes: []` is OK: existing entity entries like `amazon.com` already use this pattern — namespace-level metadata only, no sub-resolution. Resolves at `secid:entity/`.

🤖 Generated with Claude Code

Closes the disclosure->entity coverage gap audited earlier: 486 disclosure
entries (mostly CNAs) existed but only 11 had parallel entity entries.
This commit ships 475 auto-generated entity stubs, one for each disclosure
namespace that didn't already have an entity record.

Each stub carries:
- namespace, official_name, common_name, alternate_names, urls
  (website only — CNA-specific URLs like disclosure-policy stay on the
  disclosure record where they belong)
- wikidata, wikipedia (usually null in the source disclosure data;
  preserved as-is for opportunistic future population)
- notes pointing at the companion secid:disclosure/<ns>/cna for the
  vulnerability-reporting program
- status: "draft" with status_notes explicitly marking the stub as
  auto-generated and flagging that match_nodes are empty pending
  human research
- match_nodes: [] (the disclosure data doesn't tell us specific
  products or services to pattern-match; that's per-vendor research)

scripts/generate-entity-stubs-from-disclosure.py (new):
- Idempotent: re-running emits zero stubs if all disclosure entries
  already have entity coverage
- Inventories existing entity files by namespace (not just by path)
  so an entity record at a non-mirror path still skips creation
- --dry-run flag for preview
- Preserves disclosure-side directory structure under entity/

Validation:
- All 690 entity JSON files (475 new + 215 existing) parse cleanly
- All 475 new entries have the required schema fields and type: entity
- scripts/validate-subtypes.py passes (none of the stubs introduce
  new subtype values)

Future work (not in this PR):
- Per-vendor human research to populate match_nodes with product
  patterns (e.g., for Adobe, entries for Reader, Acrobat, Photoshop...)
- Reference-type coverage is separately uneven (89 of 486 disclosure
  entries have parallel reference entries); flagged but not addressed
  here — reference content requires per-vendor judgement (which blog,
  which advisory page) that doesn't auto-generate cleanly

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@kurtseifried kurtseifried merged commit 892e241 into main May 23, 2026
1 check passed
@kurtseifried kurtseifried deleted the feat/entity-stubs-from-disclosure branch May 23, 2026 02:41
kurtseifried added a commit that referenced this pull request May 28, 2026
- Update stale namespace counts (1,151 -> 1,768) in three places;
  table refreshed by scripts/update-counts.sh.
- Add five scripts missing from the Scripts section
  (generate-entity-stubs-from-disclosure, check-security-txt,
  scan-well-known, scan-mcp-endpoints, validate-subtypes).
- Document the auto-generated-entity-stub pattern introduced in PR #83
  so future contributors know thin entity files are deliberate.
- Trim Repository Structure tree (45 -> 14 lines); collapse the
  exhaustive docs/ enumeration to one line pointing at the Document
  Map above, which already routes by question.

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant