MLB SQL Analytics Casebook is a polished SQL portfolio project built around historical Major League Baseball data from the Lahman Baseball Database.
The project analyzes player schools, team salary spending, player career timelines, and player attributes using PostgreSQL. The goal is not only to write SQL queries, but to show how SQL can be used to answer business-style analytical questions, structure a reproducible project, and communicate results clearly.
This repository presents SQL as a practical analytics tool for:
- historical trend analysis
- sports analytics
- payroll and cumulative spending analysis
- career lifecycle analysis
- player attribute comparison
- data storytelling through structured case notes and results
Real analytics work is rarely just about writing one query.
The challenge is usually understanding the question, identifying the right tables, building the correct logic, validating the output, and explaining what the result actually means.
This project demonstrates that process through a sports analytics casebook.
Each analysis shows how I:
- translate business-style questions into SQL queries
- inspect and use relational database tables
- aggregate data across years, teams, schools, and players
- use CTEs and window functions to build multi-step analysis
- calculate running totals and decade-over-decade changes
- summarize findings in clean result files
- document the logic behind each analysis clearly
This project uses selected Lahman Baseball Database tables needed for the analysis.
| Table | Purpose |
|---|---|
players |
Player biographical details, birth dates, debut/final game dates, height, weight, batting side, and throwing hand |
salaries |
Player salary records by year and team |
schools |
Player-school-year relationships |
school_details |
School names and location metadata |
| Case | Analysis Area | Main Question | Main SQL Concepts |
|---|---|---|---|
| 01 | School Pipeline Analysis | What schools do MLB players attend, and how did school pipelines change over time? | Recursive CTEs, aggregation, ranking, joins |
| 02 | Team Salary Spending Analysis | How much do teams spend on player salaries, and when did cumulative spending pass major milestones? | CTEs, window functions, cumulative sums, threshold analysis |
| 03 | Player Career Analysis | What does each player's career look like from debut to final game? | Date logic, age calculation, joins, career span analysis |
| 04 | Player Attribute Comparison | How do player birthdays, batting sides, height, and weight compare across players and decades? | Self joins, conditional aggregation, percentages, LAG() |
mlb-sql-analytics-casebook/
│
├── .gitignore
├── LICENSE
├── README.md
├── DATA_LICENSE.md
│
├── assets
│ ├── repo-banner.png
│ └── schema.png
│
├── database
│ └── create_lahman_mlb_database.sql
│
├── queries
│ ├── 01_school_pipeline_analysis.sql
│ ├── 02_team_salary_spending_analysis.sql
│ ├── 03_player_career_analysis.sql
│ └── 04_player_attribute_comparison.sql
│
├── case-notes
│ ├── 01_school_pipeline_analysis.md
│ ├── 02_team_salary_spending_analysis.md
│ ├── 03_player_career_analysis.md
│ └── 04_player_attribute_comparison.md
│
└── results
├── 01_school_pipeline_results.md
├── 02_team_salary_results.md
├── 03_player_career_results.md
└── 04_player_attribute_results.md
database/
Contains the PostgreSQL setup script used to create and populate the required project tables.
queries/
Contains the clean SQL analysis files. Each file answers one major analysis area and is organized by question.
case-notes/
Contains the reasoning behind each analysis, including the business framing, SQL approach, concepts used, and interpretation guide.
results/
Contains the documented outputs and insights from each query file.
assets/
Contains the visual assets used in this README, including the repository banner and database schema diagram.
Each case follows a structured analytics process:
Read the business question
↓
Identify the relevant tables
↓
Build the SQL logic using CTEs, joins, and aggregations
↓
Run the query in PostgreSQL
↓
Review and validate the output
↓
Summarize the result in plain English
↓
Document the insight in the results file
This mirrors a real analytics workflow: start with a question, understand the data, build the query, validate the result, and communicate the finding clearly.
The first case analyzes which schools produced MLB players and how those school pipelines changed over time.
This case answers:
- In each decade, how many schools produced MLB players?
- Which schools produced the most players overall?
- Which schools were the top producers in each decade?
Key findings include:
- MLB school-linked player production increased sharply in the modern era.
- The strongest decades in the output were 1984-1993, 1994-2003, and 1974-1983.
- The top overall schools included the University of Texas at Austin, University of Southern California, Arizona State University, Stanford University, and University of Michigan.
- The top schools changed across eras, showing that MLB talent pipelines shifted over time.
Main SQL concepts:
- recursive CTEs
- decade bucketing
COUNT(DISTINCT ...)- ranking with window functions
- joins to lookup tables
Files:
queries/01_school_pipeline_analysis.sql
case-notes/01_school_pipeline_analysis.md
results/01_school_pipeline_results.md
The second case analyzes MLB payroll spending by team.
This case answers:
- Which teams appear among the highest salary spenders?
- How does each team's cumulative payroll change over time?
- When did each team first surpass $1 billion in cumulative spending?
Key findings include:
- The New York Yankees appear repeatedly among the highest payroll team-years.
- The Los Angeles Dodgers, Philadelphia Phillies, and Boston Red Sox also appear strongly in the highest-spending group.
- The Yankees were the first team in the output to surpass $1 billion in cumulative recorded salary spending.
- Cumulative payroll gives a clearer view of long-term team investment than annual spending alone.
Main SQL concepts:
- grouped aggregation
- average annual spending
- window functions
- cumulative sums
- threshold filtering
- milestone analysis
Files:
queries/02_team_salary_spending_analysis.sql
case-notes/02_team_salary_spending_analysis.md
results/02_team_salary_results.md
The third case analyzes player career timelines.
This case answers:
- How old was each player at their first game?
- How old was each player at their last game?
- How long was each player's career?
- What team did each player start and end with?
- How many players started and ended on the same team while playing for more than a decade?
Key findings include:
- The longest careers in the output span more than 25 years.
- The top career length in the sample is 35 years.
- Some players started and ended with the same team, while many others moved teams across their careers.
- 19 players started and ended on the same team while also playing for more than a decade.
Main SQL concepts:
- date construction with
MAKE_DATE - age calculation with
AGE - extracting years from dates
- joining player records to salary/team records
- career span calculation
- conditional aggregation
Files:
queries/03_player_career_analysis.sql
case-notes/03_player_career_analysis.md
results/03_player_career_results.md
The fourth case compares player attributes and tracks physical trends over time.
This case answers:
- Which players have the same birthday?
- For each team, what percentage of players bat right, left, and both?
- How have average height and weight at debut changed by decade?
Key findings include:
- A self join can identify player pairs with the same birthdate.
- Right-handed batters make up the largest share for every team shown in the sample.
- Average player weight increased substantially over time.
- Average player height also increased over time.
- The largest weight jump in the output occurred between 1991-2000 and 2001-2010.
Main SQL concepts:
- self joins
- pairwise comparison
- conditional aggregation
- percentage calculations
- recursive CTEs
- decade grouping
LAG()for decade-over-decade differences
Files:
queries/04_player_attribute_comparison.sql
case-notes/04_player_attribute_comparison.md
results/04_player_attribute_results.md
This project is designed to be run locally in PostgreSQL using pgAdmin.
Open pgAdmin and create a new PostgreSQL database manually.
Example database name:
mlb_sql_analytics
Open the following file:
database/create_lahman_mlb_database.sql
Run the full script inside your new PostgreSQL database.
This script creates and populates the tables required for the analysis.
Run the SQL files in the queries/ folder:
queries/01_school_pipeline_analysis.sql
queries/02_team_salary_spending_analysis.sql
queries/03_player_career_analysis.sql
queries/04_player_attribute_comparison.sql
For each analysis, review the matching notes and results:
case-notes/
results/
The case notes explain the analytical approach. The results files summarize the outputs and key insights.
This repository demonstrates practical SQL skills including:
SELECTqueries- filtering with
WHERE - joins across relational tables
- grouped aggregation with
GROUP BY - distinct counting with
COUNT(DISTINCT ...) - conditional aggregation with
CASE WHEN - recursive CTEs
- multi-step CTE workflows
- window functions
ROW_NUMBER()NTILE()LAG()- cumulative sums with
SUM() OVER (...) - date construction with
MAKE_DATE - age calculations with
AGE - percentage calculations
- threshold-based filtering
- ordered analytical reporting
Beyond SQL syntax, this project highlights analytical thinking skills including:
- translating business questions into SQL logic
- choosing the correct level of aggregation
- comparing results across decades
- identifying top performers and outliers
- building cumulative metrics
- interpreting trends over time
- validating outputs with sample results
- communicating technical findings clearly
- documenting analysis in a recruiter-friendly format
This project also demonstrates technical documentation through:
- a structured README
- organized SQL files
- separate case notes
- separate results summaries
- visual schema documentation
- clear folder organization
- data license attribution
- reproducible setup instructions
- PostgreSQL
- pgAdmin
- SQL
- Markdown
- GitHub
- Lahman Baseball Database
- Relational database analysis
- Schema documentation
This repository is designed as a data portfolio project for roles involving:
- data analysis
- business analytics
- business intelligence
- analytics engineering
- reporting automation
- SQL development
- sports analytics
- data storytelling
It demonstrates the ability to move beyond basic query writing and use SQL to answer structured analytical questions, produce clean outputs, and explain insights clearly.
This project uses selected tables from the Lahman Baseball Database.
The original data is credited to Sean Lahman and is made available through SABR. The Lahman data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
See:
DATA_LICENSE.md
for full data attribution and license details.
The goals of this project are to:
- practice advanced SQL using a real historical dataset
- build a polished SQL portfolio project
- analyze baseball data through business-style questions
- demonstrate SQL skills beyond basic querying
- document query logic and results professionally
- create a reproducible PostgreSQL analysis casebook
Adham Elkhouly
- Reporting & Analytics Specialist @ Nestlé
- Microsoft Power Platform Functional Consultant Associate
This repository was created as a portfolio project to demonstrate SQL investigation skills, structured problem-solving, and clear technical documentation.
This project will be licensed under the license included in the repository.
See the LICENSE file for details.
The Lahman Baseball Database data used in this project is covered separately under the terms described in DATA_LICENSE.md.

