Skip to content

w-mayer/eviction_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Eviction-LLC Linkage Pipeline

In Virginia, landlords who file evictions through LLCs can be difficult to identify — corporate registration obscures who is behind repeated filings, and the same owner may operate through dozens of separate entities. This project links ~660,000 residential eviction court records to ~1.3 million LLC registrations from the Virginia State Corporation Commission using fuzzy name matching, then groups matched entities by shared registered agents, addresses, and attorneys to reveal networks of related LLCs. The analysis identifies serial eviction filers, measures how quickly newly formed LLCs begin filing, and surfaces clusters of entities operating through shared legal infrastructure. Built as part of housing justice research at the University of Virginia.

Data Sources

Source Records Key Fields
Virginia court eviction filings (cases_residential_only.txt, ~357 MB) ~660K residential cases plaintiff name, filing date, attorney, case outcome
Virginia SCC LLC registrations (LLC.csv, ~574 MB) ~1.3M entities entity name, registered agent, incorporation date, address

Both datasets are too large to include in this repository. The eviction records were provided by the UVA Center for Community Partnerships under a supervised research arrangement; LLC data comes from the Virginia State Corporation Commission, which makes entity registration records publicly available.

Methods

  1. Ingest — Download eviction records and LLC registrations from Box.com (src/pull_from_box.py)
  2. Clean — Normalize the raw LLC CSV to a consistent 31-field schema (src/cleaning.py)
  3. Preprocess — Lowercase names, strip punctuation, and remove organizational suffixes (LLC, Inc, Corp, etc.) to reduce noise before comparison
  4. Match — Use RapidFuzz with the WRatio scorer to fuzzy match each LLC-flagged plaintiff to the best candidate in the LLC registry. Matching runs in parallel via concurrent.futures.ProcessPoolExecutor across all available CPU cores (~50 GB RAM, <1 hour on 16 cores)
  5. Filter — Apply a confidence threshold (default 90, calibrated in notebooks/confidence_testing.ipynb) to retain only high-quality matches
  6. Analyze — Group matched entities by composite keys (registered agent + address + attorney) to identify networks, compute serial filer rates, and visualize patterns (src/grouping.py, notebooks/analysis.ipynb)

Why WRatio?

LLC names appear in inconsistent formats across court filings and SCC records (e.g., "APEX GLENWOOD VA LLC" vs. "Apex Glenwood VA LLC"). The WRatio scorer handles word reordering and partial matches better than simple ratio or token sort alternatives.

Key Findings

  • 546,089 eviction-LLC matches at confidence >= 90 after fuzzy matching ~660K filings against ~1.3M LLC registrations
  • 16,246 LLC networks identified by grouping on shared registered agent, address, and plaintiff attorney
  • Largest network: 5,076 LLCs sharing a single Richmond address and law firm, suggesting centralized legal representation for coordinated filing
  • 20.9% average serial filer rate within identified networks, with 139 networks where over 75% of entities are serial filers (5+ filings)
  • 967 networks filed within 30 days of LLC formation, suggesting entities created specifically for eviction activity

Selected Visualizations

Network size distribution — Most LLC networks are small (2-3 entities), but a long tail of large networks suggests concentrated ownership structures:

Network size distribution

Serial filing rates — Distribution of serial filer percentages within identified networks:

Serial filing histogram

Repository Structure

├── src/
│   ├── main.py              # CLI orchestrator — runs full pipeline
│   ├── matching.py           # Core fuzzy matching (RapidFuzz, parallel)
│   ├── cleaning.py           # Normalize raw LLC CSV to 31-field schema
│   ├── grouping.py           # Group matched LLCs by composite keys
│   └── pull_from_box.py      # Download source data from Box.com
├── notebooks/
│   ├── analysis.ipynb        # EDA: grouping, network stats, visualizations
│   └── confidence_testing.ipynb  # Threshold calibration for match confidence
├── OUTPUT/                   # Generated visualizations (PNGs tracked in git)
├── DATA/                     # Source data files (not tracked — too large)
├── requirements.txt
└── README.md

Requirements & Setup

Python 3.10+

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running the pipeline

# Full pipeline: pull data, clean CSV, run matching
python src/main.py --pull-data --file-id <BOX_FILE_ID> --run-matching --data-dir DATA

# Run only fuzzy matching (requires DATA/ files already present)
python src/main.py --run-matching --data-dir DATA

# Customize confidence threshold
python src/main.py --run-matching --data-dir DATA --confidence 90

# Network/grouping analysis
python src/grouping.py RA-Name Street1 plaintiff_attorney

License

This project is licensed under the MIT License.

About

Fuzzy matching pipeline linking 660K Virginia eviction filings to 1.3M LLC registrations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors