A Go implementation of Word2Vec / Doc2Vec (Paragraph Vector) for training word and document embeddings. Based on Tomas Mikolov's word2vec and doc2vec papers, with support for Semantic Word Embedding (SWE) synonym constraints.
- CBOW and Skip-Gram model architectures
- Negative Sampling and Hierarchical Softmax optimization
- Doc2Vec document embedding training (PV-DM / PV-DBOW)
- Online inference for new document vectors
- Semantic Word Embedding (SWE): ordinal knowledge constraints based on ACL-2015 paper
- word2words / word2docs / doc2docs / doc2words / sen2words / sen2docs
- Document likelihood estimation
- Leave-one-out keyword extraction
- Document similarity calculation (DocSimCal)
- Efficient model serialization via MessagePack
- Word Mover's Distance (WMD)
- Go >= 1.24
go build -o train train.go
go build -o knn knn.goOr use the included script:
./control buildTraining data format: one document per line, two TAB-separated columns (docid + space-tokenized text):
1 why does zhihu have some users with avatars and some without
2 the avatar feature is still in beta testing ...
Basic training (Skip-Gram + Negative Sampling):
./train data/zhihu_data.1wTraining with full parameters:
./train -corpus data/zhihu_data.1w \
-dim 100 \
-window 5 \
-iters 50 \
-neg \
-output my.modelTraining with synonym constraints (SWE):
./train -corpus data/zhihu_data.1w \
-swe data/synonym_constraints.txt \
-swe-coeff 0.1 \
-output swe.model./knn 2.modelInteractive operation selection:
please select operation type:
0:word2words
1:doc_likelihood
2:leave one out key words
3:sen2words
4:sen2docs
5:word2docs
6:doc2docs
7:doc2words
0
Enter text:网页
1 网页
0.78 不让
0.77 浏览
0.76 邮件
...
| Flag | Default | Description |
|---|---|---|
-corpus |
(required) | Training corpus file path (also accepts positional argument) |
-output |
2.model |
Output model file path |
-cbow |
false |
Use CBOW model (default: Skip-Gram) |
-hs |
false |
Use Hierarchical Softmax |
-neg |
true |
Use Negative Sampling |
-window |
5 |
Context window size |
-dim |
50 |
Word/document embedding dimension |
-iters |
50 |
Number of training iterations |
| Flag | Default | Description |
|---|---|---|
-swe |
(empty) | Semantic constraint file path; empty to disable SWE |
-swe-coeff |
0.1 |
Semantic loss weight; higher means stronger constraints |
-swe-hinge |
0.0 |
Hinge loss margin |
-swe-decay |
0.0 |
Weight decay coefficient (L2 regularization) |
-swe-addtime |
0.0 |
Start applying constraints after this training progress (%) |
An implementation of Semantic Word Embedding based on the ACL-2015 paper, which enhances word vector quality by introducing ordinal knowledge constraints.
Each line contains 4 space-separated words, expressing sim(A, B) > sim(C, D):
question answer eat fast
user person website eat
like good website technology
Lines starting with # are comments. See data/synonym_constraints.txt for a complete example.
On the Zhihu corpus, constraint satisfaction rate improved from 37% to 91%:
SWE initial: hinge_loss=3.7730 satisfy_rate=0.3714
SWE final: hinge_loss=0.1785 satisfy_rate=0.9143
├── train.go # Training entry point
├── knn.go # Query entry point (interactive KNN)
├── control # Build/deploy script
├── doc2vec/
│ ├── doc2vec.go # Core algorithms (training, inference, queries)
│ ├── swe.go # SWE synonym constraint implementation
│ └── wiretypes.go # Data structures and interface definitions
├── corpus/ # Corpus management (vocabulary, Huffman tree, document index)
├── neuralnet/ # Neural network layer (vector operations, weight matrices)
├── common/ # Common utility functions
├── segmenter/ # Chinese word segmentation (jiebago wrapper)
├── conf/ # Jieba segmentation dictionaries
├── data/
│ ├── zhihu_data.1w # Sample corpus (1000 Zhihu Q&A entries)
│ └── synonym_constraints.txt # Sample synonym constraints
├── interface/ # Thrift IDL definitions
├── SWE_Train.c # Reference: C implementation of SWE (standalone)
└── codewiki.md # Detailed code documentation
For full architecture details, algorithm explanations, and data structure documentation, see codewiki.md.
| Library | Purpose |
|---|---|
| tinylib/msgp | MessagePack model serialization |
| wangbin/jiebago | Chinese word segmentation (used during queries) |
| astaxie/beego/logs | Logging |
- Mikolov et al., Efficient Estimation of Word Representations in Vector Space (2013)
- Le & Mikolov, Distributed Representations of Sentences and Documents (2014)
- Dai et al., Document Embedding with Paragraph Vectors (2015)
- Liu et al., Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints (ACL 2015)
- Google word2vec (C)
- hiyijian/doc2vec (C++)
- iunderstand/SWE (C)