Skip to content

nabakrishna/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

Tiny Search Engine is a simple, ground-up search engine implemented in Java, which builds an index on a folder full of text documents, and then allows you to search this corpus interactively through the command line. Tiny Search Engine functions by tokenizing each of its documents, generating an inverted index mapping every word to a set of documents containing that word, and finally assigning scores to matching documents based on their TF-IDF score – a well-known formula that assigns high scores to documents with multiple occurrences of the query, but penalizes extremely frequent queries that don’t add much meaning to the result set.

Indexing pipeline

1. Read : Java NIO Files.readString() slurps each .txt file as UTF-8.
2. Normalize : Lowercase, strip punctuation via regex, split on whitespace — identical pipeline for docs and queries.
3. Post : Each token is inserted into HashMap<term, List<Posting>> with its document ID and position.
4. Rank : At query time, cumulative TF-IDF scores are computed per document, sorted descending, and returned.

Scoring Formula

score(d, q) = Σ TF(t,d) × IDF(t)
// TF(t,d) = 1 + log₁₀(count of t in d) — log-normalised
// IDF(t) = log₁₀(N / df(t)) — penalises common terms

Terms

TF (Term Frequency) – how frequently does your search term occur in the document? In case the search term “java” occurs in some documents ten times, this is much more likely to be relevant to your search than the document mentioning it only once. Nevertheless, rather than using the actual frequency, the engine applies logarithmics in order not to let the larger size of the document determine its relevance.

IDF (Inverse Document Frequency) – how common or rare is the term in relation to all the documents? The terms like “the” or “is”, which are used almost everywhere and do not provide any distinctive characteristic of the documents, receive low scores here.

📁 Project Structure

search-engine/
├── docs/                          ← Your .txt corpus (4 sample files included)
│   ├── algorithms.txt
│   ├── data_structures.txt
│   ├── java_basics.txt
│   └── search_engines.txt
├── src/main/java/com/tinysearch/
│   ├── Posting.java               ← Core data model
│   ├── DocumentParser.java        ← NIO file reading + Stream tokenization
│   ├── Indexer.java               ← Inverted index builder
│   ├── SearchResult.java          ← Ranked result value object
│   ├── Searcher.java              ← TF-IDF query engine
│   └── Main.java                  ← Interactive CLI
├── out/                           ← Compiled .class files go here
└── build_and_run.sh               ← One-command build script

run:------

Compile all sources

javac -d out src/main/java/com/tinysearch/*.java

Run (defaults to the docs/ folder)

java -cp out com.tinysearch.Main docs

Or run against any folder of .txt files

java -cp out com.tinysearch.Main /path/to/your/texts
---

💡 Class-by-Class Design Walkthrough

**`Posting.java`** — the atom of the entire engine. Stores a `docId` (int) and a `List<Integer>` of every word-position where a term appeared. `getTermFrequency()` is trivially `positions.size()`, keeping the data model clean.

**`DocumentParser.java`** — a pure utility class (no state). Uses `Files.readString(path)` (Java NIO, Java 11+) then a single Stream pipeline: lowercase → strip punctuation via regex `[^\p{L}\p{N}\s]` → split → filter blanks → collect to List.

**`Indexer.java`** — the heart of the engine. Walks a directory with `Files.walk()` (NIO), filters `.txt` files, assigns each an integer docId, and builds the `HashMap<String, List<Posting>>`. The `computeIfAbsent` idiom keeps the insertion code elegant.

**`SearchResult.java`** — an immutable value object. Implements `Comparable<SearchResult>` in *descending* score order so `Collections.sort(results)` produces the ranked list directly.

**`Searcher.java`** — executes TF-IDF ranking. For each query term:
- **TF** (log-normalised): `1 + log₁₀(count)` — dampens extreme frequencies
- **IDF**: `log₁₀(N / df)` — rare terms across the corpus score higher
- Final document score = **Σ TF × IDF** across all query terms (OR-combined)

**`Main.java`** — a full interactive CLI with ANSI colour output, plus special commands:

| Command | What it does |
|---|---|
| `java` | Ranked search for the word "java" |
| `search engines` | Multi-term OR search |
| `:stats` | Shows document count, unique terms, and the ID→filename table |
| `:inspect 2` | Lists every indexed term for document 2 |
| `:quit` | Exit |

---

🧠 TF-IDF in One Diagram

Query: "java"

Doc 0 (java_basics.txt): "java" appears 10× → TF = 1 + log₁₀(10) = 2.0 Doc 2 (search_engines.txt): "java" appears 0× → Not in posting list, skipped

IDF("java") = log₁₀(4 docs / 1 doc containing "java") = 0.602

Score(doc 0) = 2.0 × 0.602 = 1.204 ← wins

About

A search engine from scratch in Java language – inverted index, TF-IDF algorithm, and a command-line interface.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors