Skip to content
This repository was archived by the owner on Mar 31, 2026. It is now read-only.

Latest commit

 

History

History
26 lines (19 loc) · 5.31 KB

File metadata and controls

26 lines (19 loc) · 5.31 KB

CatalogDataToCsv

This driver reads data directly from the NuGet.org catalog and project it into CSV. Data that is not directly available in the catalog should not be seen or processed by this driver.

CatalogScanDriverType enum value CatalogDataToCsv
Driver implementation CatalogDataToCsvDriver
Processing mode process latest catalog leaf per package ID and version
Cursor dependencies V3 package content: blocks on this cursor to align with other drivers
Components using driver output Kusto ingestion via KustoIngestionMessageProcessor, since this driver produces CSV data
Temporary storage config Table Storage:
CsvRecordTableName (name prefix): holds CSV records before they are added to a CSV blob
TaskStateTableName (name prefix): tracks completion of CSV blob aggregation
Persistent storage config Blob Storage:
CatalogLeafItemContainerName: contains CSVs for the CatalogLeafItems table
PackageDeprecationContainerName: contains CSVs for the PackageDeprecations table
PackageVulnerabilityContainerName: contains CSVs for the PackageVulnerabilities table
Output CSV tables CatalogLeafItems
PackageDeprecations
PackageVulnerabilities

Algorithm

This driver produces multiple views (or projections) of the catalog data.

  • CatalogLeafItems is raw data pulled from a PackageDetails catalog leaf. This contains historical data found in the catalog.
  • PackageDeprecations is nicely formatted and latest deprecation data.
  • PackageVulnerabilities is nicely formatted and latest vulnerability data.

The driver reads the set of package leaf documents that are in the commit timestamp bounds for the catalog scan, generate CSV record instances in memory, and appends them to a temporary CSV record table in Azure Table Storage. When all of the catalog leaves have been process, batches of records from table storage are pulled into memory and merged into CSV blobs.

When all of the CSV blobs have been updated, the temporary table is deleted leaving just the updated CSV blobs.