Notebooks

A collection of hands-on examples, helper utilities, Jupyter notebooks, Airflow DAGs, and data workflows showcasing how to work with the OKDP Platform. This repository is meant to help you explore OKDP capabilities around compute, object storage, data catalog, SQL engines, Spark, workflow orchestration, and analytics.

Over time, these examples will be extended with lakehouse-oriented features, such as:

Open table formats (e.g. Apache Iceberg and/or Delta Lake).
Shared metadata with stronger schema enforcement and evolution.
Snapshot-based table management (time travel, retention, cleanup).
Incremental processing and analytics-ready datasets, etc.

Notebooks

The notebooks analyze datasets stored as Parquet on S3-compatible storage (MinIO). The same underlying dataset is queried using Trino and Spark.

An index.ipynb notebook is also provided as an entry point.

Trino notebooks

The following notebooks query data using Trino:

Querying data using Trino (Python/SQLAlchemy).
Querying data using Trino (SQL engine).

These notebooks use Trino external tables defined over Parquet data stored in object storage and registered via a metadata service.

PySpark notebook

A PySpark notebook is included to showcase Spark-native exploratory data analysis on the same dataset.

Superset

Use Apache Superset (SQL Lab) to query Trino and build visualizations/dashboards on top of the same datasets.

Airflow

The airflow/ directory contains example DAGs orchestrated by Apache Airflow on the OKDP platform. They demonstrate how to:

Submit Spark jobs to Spark Operator via SparkApplication custom resources from a DAG.
Build daily ETL pipelines reading from and writing to S3-compatible storage (SeaweedFS).
Use Airflow gitSync to pull DAGs directly from this repository at runtime.

See airflow/README.md for the full list of DAGs and quick-start instructions.

Running the examples:

Using okdp-ui, deploy the following components:

Storage: SeaweedFS
Data Catalog: Hive Metastore
Interactive Query: Trino
Notebooks: Jupyter
DataViz: Apache Superset
Workflow orchestration: Apache Airflow
Applications: okdp-examples

About the datasets

At deployment time, the Helm chart:

Downloads public datasets.
Uploads them into object storage.
Creates the corresponding Trino external tables.

ℹ️ NOTE

The datasets are not bundled in this repository and are not baked into container images.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
airflow		airflow
docker		docker
helm/okdp-examples		helm/okdp-examples
notebooks		notebooks
.ct.yml		.ct.yml
.gitignore		.gitignore
.lintconf.yaml		.lintconf.yaml
.release-please-manifest.json		.release-please-manifest.json
LICENSE		LICENSE
README.md		README.md
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notebooks

Trino notebooks

PySpark notebook

Superset

Airflow

Running the examples:

About the datasets

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Notebooks

Trino notebooks

PySpark notebook

Superset

Airflow

Running the examples:

About the datasets

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages