This project demonstrates comprehensive data cleaning and exploratory data analysis techniques using SQL on a world layoffs dataset. The project includes systematic data cleaning operations to prepare raw data for analysis, followed by in-depth EDA to uncover patterns, trends, and insights about global layoff patterns from 2020-2023.
The dataset is stored in worldlayoffs.db, a SQLite database containing global company layoffs data with fields like company name, location, industry, layoff counts, dates, and funding information.
- SQLite3 installed on your system
- Basic knowledge of SQL and data cleaning concepts
-
Clone the repository:
git clone https://github.com/mrithip/world-layoffs-data-cleaning-sql.git cd world-layoffs-data-cleaning-sql -
Ensure SQLite3 is installed:
sqlite3 --version
Run the data cleaning script:
sqlite3 worldlayoffs.db < datacleaning.sqlRun the EDA script to explore cleaned data patterns:
sqlite3 worldlayoffs.db < EDA.sqlView the comprehensive insights and visualizations in insights.md.
The datacleaning.sql script performs the following comprehensive operations:
- Date Standardization: Converts layoff dates from MM/DD/YYYY format to ISO YYYY-MM-DD format using SQLite date functions
- Staging Table Creation: Creates backup tables to preserve original data during cleaning process
- Duplicate Removal: Identifies and removes duplicate records using ROW_NUMBER() window function and CTE
- Text Standardization: Trims whitespace from company and industry names for consistency
- Industry Normalization: Consolidates similar industry names (e.g., all Crypto variations to "Crypto")
- Country Standardization: Removes trailing punctuation from country names (e.g., "United States." to "United States")
- Missing Data Handling: Populates null industry values using self-joins to match companies with existing industry data
- Data Quality Filtering: Removes rows with no useful information (null layoff counts and percentages)
- Final Cleanup: Removes temporary columns and presents the clean dataset
The EDA.sql script performs comprehensive analysis on the cleaned dataset to uncover patterns and insights about global layoffs. The analysis includes:
- Maximum layoffs in a single event (12,000 employees)
- Companies with 100% layoffs (complete shutdowns)
- Total layoffs by major companies (Amazon: 18,150, Google: 12,000, Meta: 11,000)
- Layoffs timeline: March 2020 - March 2023
- Yearly trends: 2022 peak (160,661 layoffs), followed by 2023 (125,677)
- Monthly patterns: January highest (92,037), November second (55,758)
- Rolling totals using window functions
- Top companies by total layoffs and funding raised
- Industry impact: Consumer (45,182), Retail (43,613), Other (36,289)
- Company funding vs layoffs paradox analysis
- Country breakdown: US (256,559), India (35,993), Netherlands (17,220)
- Stage-wise analysis: Post-IPO (204,132), Series C/D impact
- Year-over-year top 5 companies rankings
- Rolling total calculations using CTEs and window functions
- Dense ranking for comparative analysis across years
After cleaning, the main table layoffs_staging2 contains standardized columns including:
- company (trimmed and standardized)
- location
- industry (normalized and populated)
- total_laid_off
- percentage_laid_off
- layoff_date (ISO format)
- stage
- country (standardized)
- funds_raised_millions
- Date parsing and conversion
- Common Table Expressions (CTEs)
- Window functions (ROW_NUMBER)
- Self-joins for data population
- String manipulation functions
- NULL handling strategies
- Duplicate detection and removal
- Aggregate functions (SUM, MAX, COUNT)
- GROUP BY and ORDER BY clauses
- Substring extraction for date/time analysis
- Rolling calculations with window functions
- Dense ranking for comparative analysis
- Multi-level grouping and filtering
Feel free to submit issues and enhancement requests.