Skip to content

Latest commit

 

History

History
122 lines (91 loc) · 4.4 KB

File metadata and controls

122 lines (91 loc) · 4.4 KB

World Layoffs Data Analysis: SQL Cleaning & EDA

This project demonstrates comprehensive data cleaning and exploratory data analysis techniques using SQL on a world layoffs dataset. The project includes systematic data cleaning operations to prepare raw data for analysis, followed by in-depth EDA to uncover patterns, trends, and insights about global layoff patterns from 2020-2023.

Dataset

The dataset is stored in worldlayoffs.db, a SQLite database containing global company layoffs data with fields like company name, location, industry, layoff counts, dates, and funding information.

Prerequisites

  • SQLite3 installed on your system
  • Basic knowledge of SQL and data cleaning concepts

Installation

  1. Clone the repository:

    git clone https://github.com/mrithip/world-layoffs-data-cleaning-sql.git
    cd world-layoffs-data-cleaning-sql
  2. Ensure SQLite3 is installed:

    sqlite3 --version

Usage

Data Cleaning

Run the data cleaning script:

sqlite3 worldlayoffs.db < datacleaning.sql

Exploratory Data Analysis

Run the EDA script to explore cleaned data patterns:

sqlite3 worldlayoffs.db < EDA.sql

View the comprehensive insights and visualizations in insights.md.

Data Cleaning Steps

The datacleaning.sql script performs the following comprehensive operations:

  1. Date Standardization: Converts layoff dates from MM/DD/YYYY format to ISO YYYY-MM-DD format using SQLite date functions
  2. Staging Table Creation: Creates backup tables to preserve original data during cleaning process
  3. Duplicate Removal: Identifies and removes duplicate records using ROW_NUMBER() window function and CTE
  4. Text Standardization: Trims whitespace from company and industry names for consistency
  5. Industry Normalization: Consolidates similar industry names (e.g., all Crypto variations to "Crypto")
  6. Country Standardization: Removes trailing punctuation from country names (e.g., "United States." to "United States")
  7. Missing Data Handling: Populates null industry values using self-joins to match companies with existing industry data
  8. Data Quality Filtering: Removes rows with no useful information (null layoff counts and percentages)
  9. Final Cleanup: Removes temporary columns and presents the clean dataset

Exploratory Data Analysis

The EDA.sql script performs comprehensive analysis on the cleaned dataset to uncover patterns and insights about global layoffs. The analysis includes:

Statistical Overview

  • Maximum layoffs in a single event (12,000 employees)
  • Companies with 100% layoffs (complete shutdowns)
  • Total layoffs by major companies (Amazon: 18,150, Google: 12,000, Meta: 11,000)

Temporal Analysis

  • Layoffs timeline: March 2020 - March 2023
  • Yearly trends: 2022 peak (160,661 layoffs), followed by 2023 (125,677)
  • Monthly patterns: January highest (92,037), November second (55,758)
  • Rolling totals using window functions

Company & Industry Insights

  • Top companies by total layoffs and funding raised
  • Industry impact: Consumer (45,182), Retail (43,613), Other (36,289)
  • Company funding vs layoffs paradox analysis

Geographic Distribution

  • Country breakdown: US (256,559), India (35,993), Netherlands (17,220)
  • Stage-wise analysis: Post-IPO (204,132), Series C/D impact

Advanced Analytics

  • Year-over-year top 5 companies rankings
  • Rolling total calculations using CTEs and window functions
  • Dense ranking for comparative analysis across years

Database Schema

After cleaning, the main table layoffs_staging2 contains standardized columns including:

  • company (trimmed and standardized)
  • location
  • industry (normalized and populated)
  • total_laid_off
  • percentage_laid_off
  • layoff_date (ISO format)
  • stage
  • country (standardized)
  • funds_raised_millions

Key SQL Techniques Demonstrated

Data Cleaning Techniques

  • Date parsing and conversion
  • Common Table Expressions (CTEs)
  • Window functions (ROW_NUMBER)
  • Self-joins for data population
  • String manipulation functions
  • NULL handling strategies
  • Duplicate detection and removal

Exploratory Data Analysis Techniques

  • Aggregate functions (SUM, MAX, COUNT)
  • GROUP BY and ORDER BY clauses
  • Substring extraction for date/time analysis
  • Rolling calculations with window functions
  • Dense ranking for comparative analysis
  • Multi-level grouping and filtering

Contributing

Feel free to submit issues and enhancement requests.