Skip to content

rrp-bot/kg-ingest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kg-ingest

A generic GitHub org knowledge graph — ingests any org's repositories into Neo4j. Documentation is the primary source of understanding; code files (go.mod, Terraform, Makefile, Helm, docker-compose) are secondary.

All classification knowledge lives in config/tech_map.yaml. The ingestion engine has zero hardcoded org, repo, or technology names.

Quick start

# 1. Copy and populate the env file
cp .env.example .env
# Set INGEST_ORG=your-github-org  (and optionally GITHUB_TOKEN)

# 2. Start Neo4j and run ingestion
./run.sh up

# 3. Query the graph
./run.sh query

Neo4j Browser is available at http://localhost:7474 (credentials: neo4j / password).

Ingestion options

# Ingest an entire org (top N by stars)
./run.sh ingest -- --org myorg --top 20

# Ingest specific repos
./run.sh ingest -- --repos owner/repo1,owner/repo2

# Filter by name glob
./run.sh ingest -- --org myorg --filter "my-service-*"

Ingestion is incremental — safe to re-run. Existing nodes are merged, not duplicated.

Configuration

Edit config/tech_map.yaml to teach the engine about your technology stack:

Section Controls
patterns Architectural pattern detection from documentation prose
prose_keywords Technology mentions detected in README / docs
go_modules Go direct-dependency classification
terraform_providers / terraform_resources Terraform technology detection
compose_services docker-compose service classification
makefile_tools Tool usage detected in Makefiles
helm_charts Helm chart name classification
component_name_kinds Map repo/binary names to component kinds
component_subdirs Map cmd/ subdirectory names to component kinds
parsers Enable/disable optional file parsers (openapi, ocm_model_dsl)

No code changes are needed — only edit the YAML.

Graph schema

Organisation  -[:HAS_REPO]---------->  Repository
Repository    -[:CONTAINS]---------->  Component
Repository    -[:INTRODUCES_CONCEPT]-> Concept
Repository    -[:HAS_DOCUMENT]------->  Document
Repository    -[:REFERENCES]--------->  Repository
Repository    -[:DEFINES_SERVICE]---->  ApiService
Component     -[:USES_TECHNOLOGY]---->  Technology
Component     -[:FOLLOWS_PATTERN]---->  Pattern
Component     -[:EXPOSES_API]-------->  ApiContract
Component     -[:DEPENDS_ON]--------->  Component
ApiContract   -[:HAS_ENDPOINT]------->  HttpEndpoint
ApiService    -[:HAS_TYPE|HAS_RESOURCE|HAS_ENUM]-> ApiResource
ApiResource   -[:HAS_FIELD]---------->  ApiField

Named queries

Run ./run.sh query then type list to see all available queries, including:

  • overview — repos, languages, component counts
  • tech-stack — every technology and how widely it is used
  • shared-tech — technologies shared across multiple components
  • patterns — architectural patterns detected from documentation
  • concepts — domain concepts extracted from documentation
  • communications — protocols and message buses
  • databases — data stores per component
  • auth — authentication and authorisation technologies
  • platform-arch — cross-repo component dependency graph
  • cross-references — which repos reference each other in docs
  • endpoints — all HTTP API endpoints

Services

Service Purpose
neo4j Graph database (persisted in a Docker volume)
ingest Clones repos, extracts knowledge, writes graph
query Interactive Cypher REPL + named queries

Requirements

  • Docker + Docker Compose
  • GitHub token recommended (avoids API rate limits): set GITHUB_TOKEN in .env

About

Generic GitHub org knowledge graph — ingests any org into Neo4j, classification driven by config/tech_map.yaml

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors