Research Software Observatory – Data Pipeline

Metadata integration and quality assessment of research software.

Overview

This repository contains the data pipeline that powers the Research Software Observatory—a platform for monitoring and assessing the quality and FAIRness of research software in the life sciences.

It consolidates software records, resolves duplicates, and precomputes the quality and FAIRness statistics displayed in the Observatory’s interface.

-> 📄 Documentation

Pipeline

The ETL runs in modular stages, which can be executed independently or orchestrated end-to-end through the unified CLI command rsetl.

Transformation – Fetches raw records from source collections and standardizes them.
License normalization – Maps license strings to SPDX identifiers.
Blocking and recovery – Groups related software records from normalized data.
Metrics removal (optional) – Filters low-information OpenEBench metrics.
Conflict detection – Identifies inconsistent or duplicate records.
Simplification – Reduces block complexity for later processing.
Conversion to JSONL – Formats data for large-scale or LLM-based steps.
Disambiguation – Uses heuristics and AI-assisted agreement scoring to resolve conflicts.
Human integration – Incorporates curator decisions from Git-based annotations.
Merge – Produces final, merged software entries and updates the database.
FAIRsoft scores and statistics – Computes FAIR compliance metrics and aggregated statistics stored in the database to support visualization and longitudinal monitoring.
Similarity – Embeds tool descriptions and precomputes the top-10 nearest neighbours per tool to power "similar software" recommendations.

Each execution creates a versioned run directory under data/integration/runs/<run_id>/ with a manifest file tracking inputs, outputs, and environment metadata.

Getting started

# install in editable mode
pip install -e .

# set up environment variables (MongoDB + API tokens)
export MONGO_HOST=...
export MONGO_DB=...
export GITHUB_TOKEN=...
# etc.

# run full integration
rsetl

All intermediate and final files are automatically stored in timestamped directories, and a latest symlink always points to the most recent run.

Repository structure

adapters/cli/                # CLI entry points and integration scripts
scripts/                     # Auxiliary scripts (simplify, convert, cleanup)
domain/                      # Data models and logic
data/integration/runs/       # Versioned outputs per run

Name		Name	Last commit message	Last commit date
Latest commit History 1,709 Commits
.github/workflows		.github/workflows
data		data
docs		docs
human_annotations		human_annotations
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
Figure Indicators_1.pdf		Figure Indicators_1.pdf
LICENSE		LICENSE
README.md		README.md
README_old.md		README_old.md
content.html		content.html
content_clean.html		content_clean.html
docker-compose.yml		docker-compose.yml
hub_repo_group_name_distribution.png		hub_repo_group_name_distribution.png
missing_pretools_from_grouped_entries.jsonl		missing_pretools_from_grouped_entries.jsonl
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-mkdocs.txt		requirements-mkdocs.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research Software Observatory – Data Pipeline

Overview

Pipeline

Getting started

Repository structure

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Research Software Observatory – Data Pipeline

Overview

Pipeline

Getting started

Repository structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages