SmartIndex is a MongoDB composite index recommendation tool. Given a CSV that describes an application's query patterns, it produces a minimal set of composite indexes that respect MongoDB's prefix rule, follow the ESR ordering convention, and place the most selective fields first.
- The problem
- How SmartIndex solves it
- Theory
- Pipeline
- Project layout
- Quick start
- CSV format
- Output
- Tests
Designing MongoDB composite indexes by hand is hard:
- Field order matters. A composite index
{a:1, b:1, c:1}only efficiently supports filters/sorts on its leftmost prefixes:{a},{a,b},{a,b,c}. It does not efficiently serve{b},{c},{b,c}, or{b,a}. - Different operators want different positions. Equality predicates, sort keys, and range predicates each behave differently inside an index and should appear in a specific order.
- Indexes are expensive. Each index costs memory, disk, and write throughput. A naive "one index per query" strategy explodes in both number and total size.
- Queries overlap. Many application queries share leading fields and can be served by a single, longer composite index — but only if the field order lines up exactly.
A correct recommendation must therefore choose which fields, in which order, per index, and then eliminate redundancy across indexes.
SmartIndex is a four-stage pipeline:
- Parse the CSV of query patterns into structured
Queryobjects. - Score each field's selectivity with Shannon entropy, either sampled live from MongoDB or estimated heuristically when no collection is available.
- Order the fields of each query using a small weighted directed graph that combines ESR priority with selectivity differences.
- Minimise the resulting candidate set with a trie-based MinSet optimiser that absorbs any index that is an exact leftmost prefix of a longer one.
The result is a small set of composite indexes, each annotated with the queries it covers, the partial-index condition it should use (if any), and a short justification.
A composite index in MongoDB is an ordered list of fields. The index can efficiently serve any query whose filter/sort uses a leftmost prefix of that list, in the same order:
Index: [A, B, C]
Efficient for: [A], [A, B], [A, B, C]
NOT efficient: [B], [C], [B, C], [B, A]
This is the single most important rule in SmartIndex. Both the ordering algorithm (which decides what the prefix should be) and the minimisation algorithm (which removes indexes already implied by a longer prefix) are built around it.
For a single composite index, MongoDB's recommended field order is:
Equality → Sort → Range
- Equality (E) fields filter to a single value and shrink the remaining keys the most, so they should appear first.
- Sort (S) fields can be served in order from the index only if they come after equality and before range fields.
- Range (R) fields (
$gt,$lt,$gte, …) scan a contiguous segment of the index. If they appear before sort or equality fields, they break the prefix and prevent the rest of the index from being used efficiently.
SmartIndex encodes this as a numeric priority used in graph edge weights:
ESR_PRIORITY = {"E": 3, "S": 2, "R": 1}When several fields share the same ESR class (e.g. all equality), SmartIndex breaks the tie using selectivity — a measure of how effectively a filter on that field reduces the number of matching documents.
A common shortcut is "count distinct values", but that can be misleading: a field with 1,000 distinct values where 99% of documents have the same value is in practice not very selective. Shannon entropy captures both the number of values and how evenly they are distributed:
where
Sample distributions:
| Distribution | Entropy (bits) |
|---|---|
[0.5, 0.5] |
1.000 |
[0.25, 0.25, 0.25, 0.25] |
2.000 |
[0.7, 0.1, 0.1, 0.1] |
1.356 |
[0.99, 0.01] |
0.081 |
Equality fields are scored directly from their entropy. Range
fields are continuous; SmartIndex either samples bucket counts from
MongoDB or falls back to the default distribution [0.5, 0.5], then
multiplies the result by a range penalty (0.4) so range fields
keep a meaningful score but do not displace equality fields. Sort
fields receive a smaller 0.8 penalty.
Implementation: smartindex/selectivity_calculator.py.
For each query, SmartIndex builds a small directed weighted graph:
- Nodes = the fields used in that query.
-
Edges = an edge
u → vwith weight$w(u, v)$ expresses how stronglyushould come beforev.
Edge weight combines three terms:
-
$w_{ESR}(u, v) = 10 \cdot (\text{prio}(u) - \text{prio}(v))$ — large positive ifubelongs to a higher ESR class thanv. -
$w_{sel}(u, v) = \text{sel}(u) - \text{sel}(v)$ — positive whenuis more selective. -
$w_{range}(u, v)$ — explicit penalty that pushes range fields toward the end of the index.
Each node is then ranked by its net pull:
Fields are sorted by score descending. The resulting list is the
candidate composite index for that query.
Why a graph instead of plain sorting? It lets us compose multiple
ordering signals (ESR, selectivity, the range penalty, and future
signals such as query frequency or partial-index bonuses) into a single
comparable quantity without writing brittle nested if chains.
Implementation: smartindex/graph_ordering.py.
After ordering, SmartIndex has one candidate index per unique query shape. Many of these indexes overlap — a longer index can serve every query whose required prefix it begins with.
A trie is the perfect structure for this. Each candidate is inserted as a path from the root; shared prefixes share trie nodes:
Candidates: [A, B, C], [A, B], [A], [D, E]
Trie:
root
├── A (candidate [A])
│ └── B (candidate [A, B])
│ └── C (candidate [A, B, C])
└── D
└── E (candidate [D, E])
The optimiser walks the trie and drops every candidate that has a strict descendant, since that descendant is a longer index with the shorter one as an exact prefix. The query coverage of the absorbed shorter indexes is propagated upward to the surviving longer index, so nothing is forgotten.
Critically, this only collapses true leftmost prefixes:
[A, B, C]absorbs[A, B]and[A]. ✅[A, C, B]does not absorb[A, B]— different order, different trie path. ✅[B, C]does not absorb[A, B, C]— not a prefix of it. ✅
Implementation: smartindex/minset.py.
CSV file
│
▼
CSVReader ─── parse rows, expand [or] alternations
│
▼
Encoder ─── short symbols (a, b, c, …) for graph stages
│
▼
SelectivityCalculator ── Shannon entropy per field (live or heuristic)
│
▼
GraphOrderer ─── per-query weighted graph → ordered field list
│
▼
IndexGenerator ─── dedupe identical orderings, merge query coverage
│
▼
MinSetOptimiser ─── trie-based prefix absorption
│
▼
(optional) MongoTester ── create indexes, run .explain(), collect metrics
│
▼
ReportGenerator ─── decode field names + print createIndex() lines
smartindex/
__init__.py
main.py # CLI entry point
models.py # Query / QueryField / IndexCandidate
csv_reader.py # CSV parsing
encoder.py # field name ↔ short symbol
selectivity_calculator.py # Shannon entropy (live + heuristic)
graph_ordering.py # per-query field ordering
index_generator.py # candidate index generation
minset.py # trie-based prefix optimiser
mongo_tester.py # optional .explain() runner
kendall_tau_explorer.py # optional add-on validator
greedy_remover.py # optional add-on optimiser
report.py # human-readable output
tests/
test_csv_reader.py
test_encoder.py
test_graph_ordering.py
test_index_generator.py
test_minset.py
test_kendall_tau_explorer.py
test_greedy_remover.py
synthetic_data/ # MongoDB synthetic data generators
safe/ # legacy backup scripts (not used)
test_query.csv # sample CSV (uses synthetic data fields)
query.csv # second sample CSV
# 1. (Optional) populate MongoDB with synthetic data
python synthetic_data\mongodbscript3.py
# 2. Run SmartIndex on a query CSV (offline, heuristic selectivity)
python -m smartindex.main test_query.csv
# 3. Run SmartIndex against MongoDB and benchmark each recommendation
python -m smartindex.main test_query.csv `
--mongo-uri mongodb://localhost:27017 `
--db testdb --collection orders --testWhen --mongo-uri, --db, and --collection are supplied,
SelectivityCalculator samples real distributions from MongoDB. When
--test is also passed, MongoTester creates each recommended index,
runs an .explain() for a representative query, collects the metrics,
and drops the index again.
Two opt-in validators run after the main pipeline and require a live
MongoDB connection. They are off by default; enable each with a -
flag:
| Flag | Tool | Question it answers |
|---|---|---|
-kt / --kendall-tau |
Kendall-Tau Explorer | Is there a better order near the predicted one? |
-gr / --greedy-remove |
Greedy Remover | Can we drop trailing fields without losing perf? |
Kendall-Tau Explorer (smartindex/kendall_tau_explorer.py)
generates permutations within Kendall-Tau distance ≤ --kt-distance
(default 2) of the graph-derived order, scores each via .explain(),
and keeps the best one. Tunables: --kt-distance, --kt-max-perms.
Greedy Remover (smartindex/greedy_remover.py)
repeatedly drops the trailing field of one index, re-scores the whole
workload, and accepts the removal only if total degradation stays
within --gr-degradation (default 0.10 = 10%). Stops when no further
removal is acceptable. Tunables: --gr-degradation, --gr-min-length.
python -m smartindex.main test_query.csv `
--mongo-uri mongodb://localhost:27017 --db testdb --collection orders `
--test -kt --kt-distance 2 -gr --gr-degradation 0.1Search/Combination,ESR,Range Criteria,Partial,Sorting Logic,Limit
"balance, name, string","R, E, E",None,None,"","200"
"string, date","E,R","date=30",None,"","200"
"name, age","E, R",None,None,"","200"
"city, age [or] fav","E, [R,R]","age=20,fav=10","age/lt/120","","200"
Column meanings:
| Column | Description |
|---|---|
Search/Combination |
Comma-separated field names used by the query |
ESR |
Per-field tag: E equality, S sort, R range |
Range Criteria |
Optional field=value or field op value constraints |
Partial |
Optional partial-index filter (e.g. balance/gt/50) |
Sorting Logic |
Optional sort description |
Limit |
Optional query .limit() value |
[or] expands a single row into one query per alternative:
city, age [or] fav → (city, age) and (city, fav)
E, [R,R] → (E, R) and (E, R)
Example run on test_query.csv:
[1] db.collection.createIndex({ "name": 1, "string": 1, "balance": 1, "age": 1 },
{ name: "si_b_c_a_e" })
encoded fields : ['b', 'c', 'a', 'e']
queries covered: [1, 3]
partial : balance/gt/50,name/ne/Chris Doe
reason : ESR + selectivity ordering of … (absorbs prefix)
Each line maps directly to a db.<collection>.createIndex(...)
command. The report also lists which input queries each index covers,
the partial filter (if any), and a short reason that records the
ordering decision.
python -m pytest -qThe test suite covers CSV parsing (including [or] expansion), the
encoder roundtrip, ESR + selectivity ordering, candidate generation
and deduplication, prefix absorption, non-prefix preservation, and an
end-to-end run from CSV to final indexes.