Skip to content

Id3arium/Etymon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Etymon

The true root hiding beneath the surface.

A word-connection engine built on GloVe embeddings. Give it two or more words and it finds the word that links them all — the etymon, the hidden root beneath them.

Give it a set of words and it surfaces the strongest shared association across all of them — useful for brainstorming, word games, finding a category that covers a list, or any time you want the link hiding beneath a group of words.

Runs two methods in parallel — fast set intersection and deeper best-first graph traversal — then merges and ranks every candidate by its strongest connection.

Quick Start

# Just run it — downloads GloVe and builds the graph automatically on first run
python server.py
# Open http://localhost:8080

First run will download GloVe embeddings (~822 MB), extract the 300d file, and build the neighbor graph (~2 min). Subsequent runs load the pre-built graph in seconds.

Screenshot

Etymon UI

The browser UI: enter a few words, get the words that connect them.

Command line usage

Find the words that connect a set of words (auto-builds the graph if needed):

$ python graph.py search cat lion

Targets: ['cat', 'lion']
Method: both
Time: 1ms

Results:
  dog                  0.565  [traversal]
    cat: cat → dog
    lion: lion → bear → dog
  cats                 0.558  [traversal]
    cat: cat → cats
    lion: lion → leopard → cats
  elephant             0.515  [traversal]
    cat: cat → monkey → elephant
    lion: lion → elephant
  ...

Each result shows the connecting word, its score (the weakest of its links, so higher means it sits close to every input), which method found it, and the path the traversal walked from each input.

Exclude specific words from the answers with --avoid — any word you list here won't be returned as a result (the connecting words still come from the same search; the listed words are just filtered out):

$ python graph.py search cat lion tiger --avoid king

Targets: ['cat', 'lion', 'tiger']
Method: both
Time: 2ms

Results:
  cats                 0.475  [traversal]
  elephant             0.474  [traversal]
  leopard              0.459  [both]
  ...

Explore a single word's nearest neighbors:

$ python graph.py neighbors engine --n 50

Top 50 neighbors of 'engine':
  engines              0.881
  cylinder             0.591
  diesel               0.589
  horsepower           0.577
  powered              0.567
  turbine              0.555
  ...

Build with custom settings, or point at an existing GloVe file:

# Larger vocabulary, more neighbors per word
python graph.py build --vocab 75000 --top-k 200

# Use a GloVe file you already have
python graph.py build ~/downloads/glove.6B.300d.txt

Architecture

┌────────────────────────────────────────────────────────┐
│  GloVe embeddings (50k words × 300 dimensions)        │
│  Auto-downloaded on first run from Stanford NLP        │
└──────────────────┬─────────────────────────────────────┘
                   │ build step (~2 min, one time)
                   ▼
┌────────────────────────────────────────────────────────┐
│  Neighbor graph (50k words × 150 neighbors each)       │
│  Stored as numpy arrays (~60 MB on disk)               │
└──────────────────┬─────────────────────────────────────┘
                   │ query time
                   ▼
┌────────────────────────────────────────────────────────┐
│  Search engine — runs BOTH methods, then merges        │
│                                                        │
│  A. Set intersection (fast, ~1ms)                      │
│     neighbors(word_A) ∩ neighbors(word_B)              │
│     Progressive widening: top-50 → top-100 → top-150  │
│                                                        │
│  B. Best-first traversal (deep)                        │
│     Walks the graph from each target independently,    │
│     using embedding similarity as heuristic, then      │
│     intersects the reachable sets                      │
│     Depth limit: 2     Node budget: 500 max explored   │
│     Similarity floor: 0.05 minimum                     │
│                                                        │
│  → Candidates from both are merged and ranked by       │
│    strongest connection (weakest-link score). The      │
│    best word wins regardless of which method found it. │
└────────────────────────────────────────────────────────┘

Tuning

All thresholds are configurable. Good starting points:

Parameter Default What it does
--vocab 50,000 Dictionary size. 50k covers most common English words.
--top-k 150 Neighbors per word. Higher = more creative leaps, more noise.
max_depth 2 Graph traversal depth. 2 is usually enough; 3 for desperate cases.
max_nodes 500 Safety valve on traversal. Prevents runaway searches.
min_similarity 0.05 Don't explore branches below this similarity. Prunes dead ends.

File Structure

Etymon/
├── graph.py      # Core engine: loading, building, searching
├── server.py     # Web server with JSON API
├── ui.html       # Browser UI
├── README.md     # This file
└── graph_data/   # Built graph (auto-created on first run)
    ├── words.json
    ├── embeddings.npy
    ├── neighbor_indices.npy
    ├── neighbor_scores.npy
    └── meta.json

About

Find the word that connects a set of words, using GloVe embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors