Skip to content

Latest commit

 

History

History
118 lines (79 loc) · 6.13 KB

File metadata and controls

118 lines (79 loc) · 6.13 KB

📂 Topic 1.2: File Systems vs. DBMS (Why do we need databases?)

To truly respect SQL, you must understand exactly how traditional Operating System files fail under load. "File System vs DBMS" isn't just theory—it is a study in advanced computer science addressing race conditions, data corruption, and memory limitations.


1. Definition

  • File System (Early Days): The OS's built-in mechanism (like NTFS, FAT32, ext4) for grouping bytes on a disk into logical named "files" (like .txt, .csv).
  • DBMS (Modern Approach): A dedicated software layer that runs as a persistent service daemon. It intercepts data requests and manages them using complex structures (B-Trees, Hash maps) to optimize hardware usage securely.

2. Why This Concept Exists

If you write a simple Python script to log website visitors to logs.txt, it works fine for 10 users. But if 100,000 users visit concurrently, the Linux Operating System places a "File Lock" on logs.txt. 99,999 users will encounter an error or their data will be permanently lost. The DBMS was invented specifically to bypass these OS limitations.


3. Why We Use It (The Core Differences - PRO Level)

Here are the 5 major problems with File Systems that a DBMS completely fixes:

  1. Data Redundancy & Normalization:
    • System: Duplicating data wastes disk space and slows down OS page caching.
    • DBMS: Centralizes data into normalized tables. Pointers (Foreign Keys) link the data mathematically, keeping storage minimal.
  2. Data Isolation (Siloing):
    • System: Data is trapped in .xml, .csv, and .json in various folders. Writing a single report linking them is an engineering nightmare.
    • DBMS: Enforces a unified schema so all data can be joined across the entire system dynamically using SQL.
  3. Data Integrity (Constraints):
    • System: A file has no logic. It will accept the word Apple into a variable meant for Age.
    • DBMS: Enforces strict memory types down to the byte level (e.g., an INT occupies exactly 4 bytes). Rejecting invalid writes directly at the DBMS engine saves the application from crashing.
  4. Atomicity & WAL (All-or-Nothing):
    • System: Transferring money crashes halfway. The file is half-written and permanently corrupted.
    • DBMS: Uses Write-Ahead Logging (WAL). Before changing the table, it writes its intent to a secure log. If power fails mid-transfer, on reboot, the DBMS reads the log and perfectly "rolls back" the corrupted half-transfer.
  5. Concurrency (MVCC):
    • System: Relies on strict OS-level total file locks. "File in use."
    • DBMS: Uses Multi-Version Concurrency Control (MVCC) and Row-Level Locking. Instead of locking the whole table, it locks only the specific row being edited. It also generates "snapshots" so Reader A can safely read old data while Writer B is actively modifying it.

4. When to Use It

  • File System: Storing massive unstructured binary objects (BLOBs) like Images, 4K Videos, or application source code.
  • DBMS: Tracking the metadata of those images (who uploaded it, timestamp, exact Amazon S3 bucket path string), and all relational business logic.

5. How It Works (The Mechanics)

  • File System Approach: O(N) Time Complexity. The OS opens users.csv and reads it line-by-line into RAM until it finds "John". Terrible for performance.
  • DBMS Approach: O(log N) Time Complexity. The DBMS builds a B+ Tree Index (like a massive, highly optimized Book Index). It traverses tree nodes to jump directly to John's specific block on the hard drive, completely bypassing the need to read the other millions of rows.

6. Syntax / Implementation Difference

The difference in defensive programming:

File System (Python Code example - Manual & Error Prone):

file = open("users.csv", "a")
# As a developer, I have to write 50 lines of complex locking/validation logic here manually
file.write("John, Apple") 
file.close()

DBMS (SQL Code example - Strict & Safe):

-- All validation, locking, and concurrency is handled by the Engine safely.
INSERT INTO users (name, age) VALUES ('John', 'Apple');
-- ❌ ERROR 1366: Incorrect integer value: 'Apple' for column 'age'

7. Real-Life Examples

The Netflix Example: If Netflix used a file system, millions of concurrent video streams trying to update their "watch history text file" would melt the OS scheduler due to block locking. Instead, they use a DBMS, which queues writes in RAM (Buffer Pool) and efficiently flushed them to disk in batches, preventing disks from locking up under immense parallel pressure.


8. SQL Examples (MySQL)

This is how the DBMS enforces those strict Integrity rules that files can't:

CREATE TABLE Accounts (
    Account_ID INT PRIMARY KEY,
    Balance DECIMAL(15, 2) NOT NULL, 
    CONSTRAINT positive_balance CHECK (Balance >= 0) -- DBMS physically prevents negative balances!
);

9. Common Mistakes

  • "Databases don't use files." Internally, the DBMS must write to files (Linux ext4 or Windows NTFS). However, it uses raw I/O techniques, creating specific "Page Blocks" (usually 16KB sections) filled with raw binary arrays to bypass normal OS interference.
  • Storing Images in the DBMS: Storing massive Base64 strings or images in BLOB columns bloats the 16KB page blocks. This destroys caching efficiency. Always store the image path string in the DB, and the file on AWS S3/File System.

10. Tips & Best Practices

The Polyglot Architecture: In a master-level architecture, you don't use just one. You store standard files (Videos) on the OS File System -> You store logs and metrics in a NoSQL DBMS -> You store payments, user relationships, and critical data in a Relational DBMS (MySQL).


11. Mini Practice Tasks

  • Task 1: Look up "Write-Ahead Log (WAL) Database". Write down a one-sentence summary of how it saves a database if a server catches on fire mid-transaction.
  • Task 2: Memorize what O(log N) vs O(N) means for searching. Why is a DBMS B+Tree mathematically faster than an OS text file scan?