In Virginia, landlords who file evictions through LLCs can be difficult to identify — corporate registration obscures who is behind repeated filings, and the same owner may operate through dozens of separate entities. This project links ~660,000 residential eviction court records to ~1.3 million LLC registrations from the Virginia State Corporation Commission using fuzzy name matching, then groups matched entities by shared registered agents, addresses, and attorneys to reveal networks of related LLCs. The analysis identifies serial eviction filers, measures how quickly newly formed LLCs begin filing, and surfaces clusters of entities operating through shared legal infrastructure. Built as part of housing justice research at the University of Virginia.
| Source | Records | Key Fields |
|---|---|---|
Virginia court eviction filings (cases_residential_only.txt, ~357 MB) |
~660K residential cases | plaintiff name, filing date, attorney, case outcome |
Virginia SCC LLC registrations (LLC.csv, ~574 MB) |
~1.3M entities | entity name, registered agent, incorporation date, address |
Both datasets are too large to include in this repository. The eviction records were provided by the UVA Center for Community Partnerships under a supervised research arrangement; LLC data comes from the Virginia State Corporation Commission, which makes entity registration records publicly available.
- Ingest — Download eviction records and LLC registrations from Box.com (
src/pull_from_box.py) - Clean — Normalize the raw LLC CSV to a consistent 31-field schema (
src/cleaning.py) - Preprocess — Lowercase names, strip punctuation, and remove organizational suffixes (LLC, Inc, Corp, etc.) to reduce noise before comparison
- Match — Use RapidFuzz with the WRatio scorer to fuzzy match each LLC-flagged plaintiff to the best candidate in the LLC registry. Matching runs in parallel via
concurrent.futures.ProcessPoolExecutoracross all available CPU cores (~50 GB RAM, <1 hour on 16 cores) - Filter — Apply a confidence threshold (default 90, calibrated in
notebooks/confidence_testing.ipynb) to retain only high-quality matches - Analyze — Group matched entities by composite keys (registered agent + address + attorney) to identify networks, compute serial filer rates, and visualize patterns (
src/grouping.py,notebooks/analysis.ipynb)
LLC names appear in inconsistent formats across court filings and SCC records (e.g., "APEX GLENWOOD VA LLC" vs. "Apex Glenwood VA LLC"). The WRatio scorer handles word reordering and partial matches better than simple ratio or token sort alternatives.
- 546,089 eviction-LLC matches at confidence >= 90 after fuzzy matching ~660K filings against ~1.3M LLC registrations
- 16,246 LLC networks identified by grouping on shared registered agent, address, and plaintiff attorney
- Largest network: 5,076 LLCs sharing a single Richmond address and law firm, suggesting centralized legal representation for coordinated filing
- 20.9% average serial filer rate within identified networks, with 139 networks where over 75% of entities are serial filers (5+ filings)
- 967 networks filed within 30 days of LLC formation, suggesting entities created specifically for eviction activity
Network size distribution — Most LLC networks are small (2-3 entities), but a long tail of large networks suggests concentrated ownership structures:
Serial filing rates — Distribution of serial filer percentages within identified networks:
├── src/
│ ├── main.py # CLI orchestrator — runs full pipeline
│ ├── matching.py # Core fuzzy matching (RapidFuzz, parallel)
│ ├── cleaning.py # Normalize raw LLC CSV to 31-field schema
│ ├── grouping.py # Group matched LLCs by composite keys
│ └── pull_from_box.py # Download source data from Box.com
├── notebooks/
│ ├── analysis.ipynb # EDA: grouping, network stats, visualizations
│ └── confidence_testing.ipynb # Threshold calibration for match confidence
├── OUTPUT/ # Generated visualizations (PNGs tracked in git)
├── DATA/ # Source data files (not tracked — too large)
├── requirements.txt
└── README.md
Python 3.10+
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt# Full pipeline: pull data, clean CSV, run matching
python src/main.py --pull-data --file-id <BOX_FILE_ID> --run-matching --data-dir DATA
# Run only fuzzy matching (requires DATA/ files already present)
python src/main.py --run-matching --data-dir DATA
# Customize confidence threshold
python src/main.py --run-matching --data-dir DATA --confidence 90
# Network/grouping analysis
python src/grouping.py RA-Name Street1 plaintiff_attorneyThis project is licensed under the MIT License.

