Skip to content

enginux/DataCleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Processing for Splink

This repository contains Python scripts to clean and merge CSV data in preparation for probabilistic record linkage using Splink. The codebase provides two main functionalities:

  • Data Cleaning: Standardizes and sanitizes CSV data with tailored rules ensuring high data quality.
  • Data Merging: Combines multiple pre-processed datasets into a single, coherent file, ensuring consistency and completeness.

Both processes are designed to facilitate the effective ingestion of your data into Splink.

Table of Contents


Overview

This repository prepares your CSV data for advanced record linkage tasks using Splink by:

  • Cleaning: Converting CSV files to Parquet (if not already available), standardizing text, validating fields (emails, phone numbers, names, addresses, dates), and applying mapping rules.
  • Merging: Integrating data from multiple sources or consecutive cleaning outputs into a unified dataset, while ensuring that all schema and content standards are preserved.

Prerequisites

Ensure the following are installed:

  • Python 3.11+ (developed and tested on Python 3.11.5)
  • Pandas – Data manipulation and analysis
  • NumPy – Numerical operations support
  • PyArrow – Fast Parquet processing

Installation

Clone the repository and install dependencies:

git clone https://your-repository-url.git
cd your-repository-folder
pip install -r requirements.txt

The requirements.txt should include (or be similar to):

pandas==2.2.3
numpy==2.1.2
pyarrow==19.0.1

Usage

Data Cleaning

The cleaning process is managed by the DataCleaner class in data_cleaner.py. It transforms an input CSV by:

  1. CSV to Parquet Conversion:
    Converts the CSV file into a Parquet format for optimized loading.
  2. Data Loading & Validation:
    Loads a Parquet file into a Pandas DataFrame. It also imports additional data (like regex patterns, honorifics, and mapping files) required for the cleaning rules.
  3. Data Cleaning Operations:
    • Standardization: Lowercases all text and cleans column names.
    • Removal of Unwanted Characters: Filters out quotes and irregular characters.
    • Field-Specific Cleaning: Validates and formats emails, phone numbers, names, addresses, and dates.
    • Mapping & Replacement: Uses mapping CSV files (for states, cities, and address abbreviations) to guarantee consistency.
    • Empty Values & Sorting: Replaces recognized empty indicators with NaN and optionally sorts by predefined keys (e.g., town).
  4. Output Generation:
    Saves a cleaned dataset as both a readable CSV and a compressed CSV (.gz), printing out a summary of data types and missing values.

To run the cleaning process, execute:

python data_cleaner.py

Data Merging

The merging functionality is implemented in data_merger.py. It is designed to:

  • Integrate multiple data sources:
    Combine various cleaned datasets into one aggregated file.
  • Ensure Schema Consistency:
    Validate that all files to be merged follow the same schema and data type conventions.
  • Handle Duplicates & Missing Data:
    Incorporate checks to remove duplicates and appropriately merge missing or overlapping information.

To merge your datasets, simply run:

python data_merger.py

Additional parameters (like specifying file paths or merge keys) can be configured via constants or command-line arguments (if implemented in your version).


Directory Structure

A suggested directory layout:

.
├── data
│   ├── combined_data.csv              # Original or pre-merged CSV file(s)
│   ├── regex_patterns.csv             # Regex rules for cleaning
│   ├── honorifics.csv                 # List of honorifics to remove from names
│   ├── state_mapping.csv              # Mapping of state correct/incorrect abbreviations and spelling to full names
│   ├── city_mapping.csv               # Mapping of city correct/incorrect abbreviations and spelling to full names
│   └── address_abbreviations.csv      # Address abbreviation mappings
├── data_cleaner.py                    # Script for cleaning and standardizing data
├── data_merger.py                     # Script for merging multiple cleaned datasets
├── requirements.txt                   # Python dependency list
└── README.md                          # Documentation (this file)

Module Documentation

data_cleaner.py Module

Constants & Configuration

  • DATA_DIR:
    Directory for your data files (default settings target Google Colab but can be modified for local environments).
  • CSV_DELIMITER, DEFAULT_ENCODING, PARQUET_COMPRESSION:
    Define characteristics for CSV input/output.
  • DATE_FORMATS & EMPTY_INDICATORS:
    Lists available date formats for parsing and custom markers for empty data.
  • UNWANTED_CHARACTERS:
    Defines a regex pattern to remove undesirable characters.

DataCleaner Class Overview

  • Initialization:
    The constructor (__init__) performs pre-checks such as verifying the existence of the input file, triggering CSV-to-Parquet conversion, and loading supplementary CSV files (regex, mappings).
  • Data Conversion:
    Methods like convert_csv_to_parquet and load_parquet_file ensure efficient data loading.
  • Cleaning Operations:
    A series of methods standardize text, clean special fields (emails, phone numbers, names, etc.), and enforce mappings from external CSV files.
  • Saving & Sorting:
    Final cleaned data is written to CSV formats. Sorting (by, for example, town) is optionally applied to aid later processing.

data_merger.py Module

Purpose and Functionality

This module is focused on the aggregation of multiple data files. It is useful when:

  • Data is split across files or processed separately.
  • Merging tasks require combining similar datasets into one unified file before feeding into Splink.

Merging Logic

  • Schema Verification:
    Before merging, each file’s structure (i.e., column names and data types) is confirmed to be consistent.
  • Duplicate Management:
    The script includes routines to identify and remove duplicate rows.
  • Data Integration:
    Cleaned datasets are concatenated, and keys used for merging (e.g., unique identifiers or common columns) are maintained to support later record linkage processes.
  • Output Creation:
    The resulting merged dataset is saved in a CSV format, ensuring compatibility with subsequent Splink ingestion.

Feeding Data to Splink

Splink requires consistent, high-quality data for probabilistic record linkage. Both the cleaning and merging processes ensure that:

  • Consistency:
    Text standardization (e.g., lowercasing) enhances matching accuracy.
  • Quality:
    Rigorous cleaning routines reduce errors arising from missing or misformatted data.
  • Preparation:
    Tailored field-specific transformations ensure that the final dataset fits Splink's expected input format.

For more details on Splink’s configuration, refer to the official documentation.


Error Handling and Logging

  • Validation Checks:
    Both modules perform pre-execution validations. Missing critical input files or configuration errors trigger exceptions (like FileNotFoundError).
  • Data Integrity:
    The code includes verification of regex patterns and data formats before executing transformations.
  • Logging:
    Informative console messages track progress, report any discrepancies, and summarize conversions and cleaning activities.

Customization

The repository is highly configurable:

  • Data Paths & Formats:
    Edit constants such as DATA_DIR, CSV_DELIMITER, and DEFAULT_ENCODING to match your local setup.
  • Mapping & Regex Files:
    You can extend or modify the CSV files (e.g., regex_patterns.csv, state_mapping.csv) to suit changes in data formats or new rules.
  • Cleaning & Merging Rules:
    Both clean_text and the merging logic can be adjusted to incorporate new field transformations or duplicate handling rules.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Contributing

Contributions and suggestions are welcome! Please open an issue or submit a pull request with improvements or bug fixes.


Conclusion

The codebase is designed as a preliminary exploratory data analyses to prepare your CSV data with high fidelity for Splink’s record linkage. By cleaning and merging your datasets with care, you ensure reliable matching and efficient processing in downstream applications.

For any questions, ideas, or contributions, feel free to reach out via the repository’s issue tracker.

About

This repository contains a Python codebase dedicated to cleaning and standardizing CSV data, with a specific focus on preparing the dataset ready for Splink.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages