Skip to content

Add 247 entity stubs from CSA public member list#87

Merged
kurtseifried merged 1 commit into
mainfrom
entity-csa-members
Jun 8, 2026
Merged

Add 247 entity stubs from CSA public member list#87
kurtseifried merged 1 commit into
mainfrom
entity-csa-members

Conversation

@kurtseifried

Copy link
Copy Markdown
Collaborator

What

Adds 247 new entity stub records generated from the CSA public member roster (csa-website-members-2026-06-08.csv), plus a reusable generator script.

This is a third entity-ingestion track alongside the existing CNA disclosure-stub generator and hand-curated entries. The CSA member list is a curated, security-relevant roster (overwhelmingly commercial vendors) that fills gaps the CNA-derived stubs miss — e.g. members like Abnormal AI, A-LIGN, AVEVA had no entity record at all.

How

scripts/generate-entity-from-csa-members.py (parallel to generate-entity-stubs-from-disclosure.py):

  • Cleans scraped domains (strips www., zero-width chars, trailing protocol junk like "huaweicloud.com http:")
  • Skips by namespace — never overwrites existing/hand-curated files
  • Deterministic output, --dry-run
Outcome Count
New entity stubs created 247
Skipped — namespace already exists (untouched) 44
Skipped — duplicate domain within CSV 3
Skipped — no usable domain (need hand-fill) 10

Design notes

  • Generated untagged (no subtype:) to match the current state of the entity registry. The entity subtype catalog (organization/product/service/model/dataset + org-nature tags) is still being designed; provenance notes flag these as prime candidates for ["organization","company"] in a later tagging sweep.
  • CSA membership is intentionally not encoded on the records — that's a relationship-layer fact (an open-ended, changing edge), not an intrinsic entity property. The registry says what each node is; the future Relationship layer says how nodes connect.
  • No regex patterns added (match_nodes: []), so no regex-safety considerations.

Validation

Both CI checks pass locally:

  • validate-registry-schema.py → all 2015 files validate against the schema
  • validate-subtypes.py → clean (new files add no subtypes)

Follow-ups (not in this PR)

  • 10 members need a domain hand-filled: Baker Tilly, DATADOG, Dassault Systèmes, Dynatrace, HiveMQ, Ometria, OpenAI, Prescient, SecqureOne, Securisea
  • Subtype tagging once the entity subtype catalog lands

🤖 Generated with Claude Code

New ingestion track parallel to the CNA disclosure-stub generator. Reads
the CSA website member roster (entity_name + domain) and emits draft entity
stubs for members with no existing entity record.

- scripts/generate-entity-from-csa-members.py: generator (domain cleanup for
  scraped junk, skip-by-namespace so existing/hand-curated files are never
  overwritten, deterministic output, --dry-run)
- 247 new registry/entity/**/*.json: status draft, empty match_nodes, same
  fidelity as the existing CNA stubs
- Generated untagged (no subtype:) to match the rest of the entity registry;
  the entity subtype catalog is still being designed. Provenance notes flag
  them as candidates for ["organization","company"] in a later sweep.
- CSA membership itself is intentionally NOT encoded — that is a
  relationship-layer fact, not an intrinsic entity property.

Skipped: 44 namespaces already present (untouched), 3 in-CSV duplicate
domains, 10 rows with no usable domain (need a domain hand-filled).

All 2015 registry files validate against the schema; subtype check clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@kurtseifried kurtseifried merged commit 06dedf0 into main Jun 8, 2026
2 checks passed
@kurtseifried kurtseifried deleted the entity-csa-members branch June 8, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant