Tiny Search Engine is a simple, ground-up search engine implemented in Java, which builds an index on a folder full of text documents, and then allows you to search this corpus interactively through the command line. Tiny Search Engine functions by tokenizing each of its documents, generating an inverted index mapping every word to a set of documents containing that word, and finally assigning scores to matching documents based on their TF-IDF score – a well-known formula that assigns high scores to documents with multiple occurrences of the query, but penalizes extremely frequent queries that don’t add much meaning to the result set.
1. Read : Java NIO Files.readString() slurps each .txt file as UTF-8.
2. Normalize : Lowercase, strip punctuation via regex, split on whitespace — identical pipeline for docs and queries.
3. Post : Each token is inserted into HashMap<term, List<Posting>> with its document ID and position.
4. Rank : At query time, cumulative TF-IDF scores are computed per document, sorted descending, and returned.
score(d, q) = Σ TF(t,d) × IDF(t)
// TF(t,d) = 1 + log₁₀(count of t in d) — log-normalised
// IDF(t) = log₁₀(N / df(t)) — penalises common terms
TF (Term Frequency) – how frequently does your search term occur in the document? In case the search term “java” occurs in some documents ten times, this is much more likely to be relevant to your search than the document mentioning it only once. Nevertheless, rather than using the actual frequency, the engine applies logarithmics in order not to let the larger size of the document determine its relevance.
IDF (Inverse Document Frequency) – how common or rare is the term in relation to all the documents? The terms like “the” or “is”, which are used almost everywhere and do not provide any distinctive characteristic of the documents, receive low scores here.
search-engine/
├── docs/ ← Your .txt corpus (4 sample files included)
│ ├── algorithms.txt
│ ├── data_structures.txt
│ ├── java_basics.txt
│ └── search_engines.txt
├── src/main/java/com/tinysearch/
│ ├── Posting.java ← Core data model
│ ├── DocumentParser.java ← NIO file reading + Stream tokenization
│ ├── Indexer.java ← Inverted index builder
│ ├── SearchResult.java ← Ranked result value object
│ ├── Searcher.java ← TF-IDF query engine
│ └── Main.java ← Interactive CLI
├── out/ ← Compiled .class files go here
└── build_and_run.sh ← One-command build script
javac -d out src/main/java/com/tinysearch/*.java
java -cp out com.tinysearch.Main docs
java -cp out com.tinysearch.Main /path/to/your/texts
---
**`Posting.java`** — the atom of the entire engine. Stores a `docId` (int) and a `List<Integer>` of every word-position where a term appeared. `getTermFrequency()` is trivially `positions.size()`, keeping the data model clean.
**`DocumentParser.java`** — a pure utility class (no state). Uses `Files.readString(path)` (Java NIO, Java 11+) then a single Stream pipeline: lowercase → strip punctuation via regex `[^\p{L}\p{N}\s]` → split → filter blanks → collect to List.
**`Indexer.java`** — the heart of the engine. Walks a directory with `Files.walk()` (NIO), filters `.txt` files, assigns each an integer docId, and builds the `HashMap<String, List<Posting>>`. The `computeIfAbsent` idiom keeps the insertion code elegant.
**`SearchResult.java`** — an immutable value object. Implements `Comparable<SearchResult>` in *descending* score order so `Collections.sort(results)` produces the ranked list directly.
**`Searcher.java`** — executes TF-IDF ranking. For each query term:
- **TF** (log-normalised): `1 + log₁₀(count)` — dampens extreme frequencies
- **IDF**: `log₁₀(N / df)` — rare terms across the corpus score higher
- Final document score = **Σ TF × IDF** across all query terms (OR-combined)
**`Main.java`** — a full interactive CLI with ANSI colour output, plus special commands:
| Command | What it does |
|---|---|
| `java` | Ranked search for the word "java" |
| `search engines` | Multi-term OR search |
| `:stats` | Shows document count, unique terms, and the ID→filename table |
| `:inspect 2` | Lists every indexed term for document 2 |
| `:quit` | Exit |
---
Query: "java"
Doc 0 (java_basics.txt): "java" appears 10× → TF = 1 + log₁₀(10) = 2.0 Doc 2 (search_engines.txt): "java" appears 0× → Not in posting list, skipped
IDF("java") = log₁₀(4 docs / 1 doc containing "java") = 0.602
Score(doc 0) = 2.0 × 0.602 = 1.204 ← wins