Data Processing for Splink

This repository contains Python scripts to clean and merge CSV data in preparation for probabilistic record linkage using Splink. The codebase provides two main functionalities:

Data Cleaning: Standardizes and sanitizes CSV data with tailored rules ensuring high data quality.
Data Merging: Combines multiple pre-processed datasets into a single, coherent file, ensuring consistency and completeness.

Both processes are designed to facilitate the effective ingestion of your data into Splink.

Overview
Prerequisites
Installation
Usage
- Data Cleaning
- Data Merging
Directory Structure
Module Documentation
- data_cleaner.py
  - Constants & Configuration
  - DataCleaner Class Overview
- data_merger.py
  - Purpose and Functionality
  - Merging Logic
Feeding Data to Splink
Error Handling and Logging
Customization
License
Contributing

Overview

This repository prepares your CSV data for advanced record linkage tasks using Splink by:

Cleaning: Converting CSV files to Parquet (if not already available), standardizing text, validating fields (emails, phone numbers, names, addresses, dates), and applying mapping rules.
Merging: Integrating data from multiple sources or consecutive cleaning outputs into a unified dataset, while ensuring that all schema and content standards are preserved.

Prerequisites

Ensure the following are installed:

Python 3.11+ (developed and tested on Python 3.11.5)
Pandas – Data manipulation and analysis
NumPy – Numerical operations support
PyArrow – Fast Parquet processing

Installation

Clone the repository and install dependencies:

git clone https://your-repository-url.git
cd your-repository-folder
pip install -r requirements.txt

The requirements.txt should include (or be similar to):

pandas==2.2.3
numpy==2.1.2
pyarrow==19.0.1

Usage

Data Cleaning

The cleaning process is managed by the DataCleaner class in data_cleaner.py. It transforms an input CSV by:

CSV to Parquet Conversion:
Converts the CSV file into a Parquet format for optimized loading.
Data Loading & Validation:
Loads a Parquet file into a Pandas DataFrame. It also imports additional data (like regex patterns, honorifics, and mapping files) required for the cleaning rules.
Data Cleaning Operations:
- Standardization: Lowercases all text and cleans column names.
- Removal of Unwanted Characters: Filters out quotes and irregular characters.
- Field-Specific Cleaning: Validates and formats emails, phone numbers, names, addresses, and dates.
- Mapping & Replacement: Uses mapping CSV files (for states, cities, and address abbreviations) to guarantee consistency.
- Empty Values & Sorting: Replaces recognized empty indicators with NaN and optionally sorts by predefined keys (e.g., town).
Output Generation:
Saves a cleaned dataset as both a readable CSV and a compressed CSV (.gz), printing out a summary of data types and missing values.

To run the cleaning process, execute:

python data_cleaner.py

Data Merging

The merging functionality is implemented in data_merger.py. It is designed to:

Integrate multiple data sources:
Combine various cleaned datasets into one aggregated file.
Ensure Schema Consistency:
Validate that all files to be merged follow the same schema and data type conventions.
Handle Duplicates & Missing Data:
Incorporate checks to remove duplicates and appropriately merge missing or overlapping information.

To merge your datasets, simply run:

python data_merger.py

Additional parameters (like specifying file paths or merge keys) can be configured via constants or command-line arguments (if implemented in your version).

Directory Structure

A suggested directory layout:

.
├── data
│   ├── combined_data.csv              # Original or pre-merged CSV file(s)
│   ├── regex_patterns.csv             # Regex rules for cleaning
│   ├── honorifics.csv                 # List of honorifics to remove from names
│   ├── state_mapping.csv              # Mapping of state correct/incorrect abbreviations and spelling to full names
│   ├── city_mapping.csv               # Mapping of city correct/incorrect abbreviations and spelling to full names
│   └── address_abbreviations.csv      # Address abbreviation mappings
├── data_cleaner.py                    # Script for cleaning and standardizing data
├── data_merger.py                     # Script for merging multiple cleaned datasets
├── requirements.txt                   # Python dependency list
└── README.md                          # Documentation (this file)

Module Documentation

data_cleaner.py Module

Constants & Configuration

DATA_DIR:
Directory for your data files (default settings target Google Colab but can be modified for local environments).
CSV_DELIMITER, DEFAULT_ENCODING, PARQUET_COMPRESSION:
Define characteristics for CSV input/output.
DATE_FORMATS & EMPTY_INDICATORS:
Lists available date formats for parsing and custom markers for empty data.
UNWANTED_CHARACTERS:
Defines a regex pattern to remove undesirable characters.

DataCleaner Class Overview

Initialization:
The constructor (__init__) performs pre-checks such as verifying the existence of the input file, triggering CSV-to-Parquet conversion, and loading supplementary CSV files (regex, mappings).
Data Conversion:
Methods like convert_csv_to_parquet and load_parquet_file ensure efficient data loading.
Cleaning Operations:
A series of methods standardize text, clean special fields (emails, phone numbers, names, etc.), and enforce mappings from external CSV files.
Saving & Sorting:
Final cleaned data is written to CSV formats. Sorting (by, for example, town) is optionally applied to aid later processing.

data_merger.py Module

Purpose and Functionality

This module is focused on the aggregation of multiple data files. It is useful when:

Data is split across files or processed separately.
Merging tasks require combining similar datasets into one unified file before feeding into Splink.

Merging Logic

Schema Verification:
Before merging, each file’s structure (i.e., column names and data types) is confirmed to be consistent.
Duplicate Management:
The script includes routines to identify and remove duplicate rows.
Data Integration:
Cleaned datasets are concatenated, and keys used for merging (e.g., unique identifiers or common columns) are maintained to support later record linkage processes.
Output Creation:
The resulting merged dataset is saved in a CSV format, ensuring compatibility with subsequent Splink ingestion.

Feeding Data to Splink

Splink requires consistent, high-quality data for probabilistic record linkage. Both the cleaning and merging processes ensure that:

Consistency:
Text standardization (e.g., lowercasing) enhances matching accuracy.
Quality:
Rigorous cleaning routines reduce errors arising from missing or misformatted data.
Preparation:
Tailored field-specific transformations ensure that the final dataset fits Splink's expected input format.

For more details on Splink’s configuration, refer to the official documentation.

Error Handling and Logging

Validation Checks:
Both modules perform pre-execution validations. Missing critical input files or configuration errors trigger exceptions (like FileNotFoundError).
Data Integrity:
The code includes verification of regex patterns and data formats before executing transformations.
Logging:
Informative console messages track progress, report any discrepancies, and summarize conversions and cleaning activities.

Customization

The repository is highly configurable:

Data Paths & Formats:
Edit constants such as DATA_DIR, CSV_DELIMITER, and DEFAULT_ENCODING to match your local setup.
Mapping & Regex Files:
You can extend or modify the CSV files (e.g., regex_patterns.csv, state_mapping.csv) to suit changes in data formats or new rules.
Cleaning & Merging Rules:
Both clean_text and the merging logic can be adjusted to incorporate new field transformations or duplicate handling rules.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions and suggestions are welcome! Please open an issue or submit a pull request with improvements or bug fixes.

Conclusion

The codebase is designed as a preliminary exploratory data analyses to prepare your CSV data with high fidelity for Splink’s record linkage. By cleaning and merging your datasets with care, you ensure reliable matching and efficient processing in downstream applications.

For any questions, ideas, or contributions, feel free to reach out via the repository’s issue tracker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing for Splink

Table of Contents

Overview

Prerequisites

Installation

Usage

Data Cleaning

Data Merging

Directory Structure

Module Documentation

data_cleaner.py Module

Constants & Configuration

DataCleaner Class Overview

data_merger.py Module

Purpose and Functionality

Merging Logic

Feeding Data to Splink

Error Handling and Logging

Customization

License

Contributing

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitignore.txt		.gitignore.txt
LICENSE		LICENSE
README.md		README.md
data_cleaner.py		data_cleaner.py
data_merger.py		data_merger.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Processing for Splink

Table of Contents

Overview

Prerequisites

Installation

Usage

Data Cleaning

Data Merging

Directory Structure

Module Documentation

data_cleaner.py Module

Constants & Configuration

DataCleaner Class Overview

data_merger.py Module

Purpose and Functionality

Merging Logic

Feeding Data to Splink

Error Handling and Logging

Customization

License

Contributing

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages