doc2vec-golang

A Go implementation of Word2Vec / Doc2Vec (Paragraph Vector) for training word and document embeddings. Based on Tomas Mikolov's word2vec and doc2vec papers, with support for Semantic Word Embedding (SWE) synonym constraints.

中文文档

Features

Quick Start

Requirements

Go >= 1.24

Build

go build -o train train.go
go build -o knn knn.go

Or use the included script:

./control build

Training

Training data format: one document per line, two TAB-separated columns (docid + space-tokenized text):

1	why does zhihu have some users with avatars and some without
2	the avatar feature is still in beta testing ...

Basic training (Skip-Gram + Negative Sampling):

./train data/zhihu_data.1w

Training with full parameters:

./train -corpus data/zhihu_data.1w \
        -dim 100 \
        -window 5 \
        -iters 50 \
        -neg \
        -output my.model

Training with synonym constraints (SWE):

./train -corpus data/zhihu_data.1w \
        -swe data/synonym_constraints.txt \
        -swe-coeff 0.1 \
        -output swe.model

Querying

./knn 2.model

Interactive operation selection:

please select operation type:
        0:word2words
        1:doc_likelihood
        2:leave one out key words
        3:sen2words
        4:sen2docs
        5:word2docs
        6:doc2docs
        7:doc2words
0
Enter text:网页
        1       网页
        0.78    不让
        0.77    浏览
        0.76    邮件
        ...

Training Parameters

Flag	Default	Description
`-corpus`	(required)	Training corpus file path (also accepts positional argument)
`-output`	`2.model`	Output model file path
`-cbow`	`false`	Use CBOW model (default: Skip-Gram)
`-hs`	`false`	Use Hierarchical Softmax
`-neg`	`true`	Use Negative Sampling
`-window`	`5`	Context window size
`-dim`	`50`	Word/document embedding dimension
`-iters`	`50`	Number of training iterations

SWE Synonym Constraint Parameters

Flag	Default	Description
`-swe`	(empty)	Semantic constraint file path; empty to disable SWE
`-swe-coeff`	`0.1`	Semantic loss weight; higher means stronger constraints
`-swe-hinge`	`0.0`	Hinge loss margin
`-swe-decay`	`0.0`	Weight decay coefficient (L2 regularization)
`-swe-addtime`	`0.0`	Start applying constraints after this training progress (%)

Semantic Word Embedding (SWE)

An implementation of Semantic Word Embedding based on the ACL-2015 paper, which enhances word vector quality by introducing ordinal knowledge constraints.

Constraint File Format

Each line contains 4 space-separated words, expressing sim(A, B) > sim(C, D):

question answer eat fast
user person website eat
like good website technology

Lines starting with # are comments. See data/synonym_constraints.txt for a complete example.

Results

On the Zhihu corpus, constraint satisfaction rate improved from 37% to 91%:

SWE initial: hinge_loss=3.7730 satisfy_rate=0.3714
SWE final:   hinge_loss=0.1785 satisfy_rate=0.9143

Project Structure

├── train.go              # Training entry point
├── knn.go                # Query entry point (interactive KNN)
├── control               # Build/deploy script
├── doc2vec/
│   ├── doc2vec.go        # Core algorithms (training, inference, queries)
│   ├── swe.go            # SWE synonym constraint implementation
│   └── wiretypes.go      # Data structures and interface definitions
├── corpus/               # Corpus management (vocabulary, Huffman tree, document index)
├── neuralnet/            # Neural network layer (vector operations, weight matrices)
├── common/               # Common utility functions
├── segmenter/            # Chinese word segmentation (jiebago wrapper)
├── conf/                 # Jieba segmentation dictionaries
├── data/
│   ├── zhihu_data.1w     # Sample corpus (1000 Zhihu Q&A entries)
│   └── synonym_constraints.txt  # Sample synonym constraints
├── interface/            # Thrift IDL definitions
├── SWE_Train.c           # Reference: C implementation of SWE (standalone)
└── codewiki.md           # Detailed code documentation

For full architecture details, algorithm explanations, and data structure documentation, see codewiki.md.

Dependencies

Library	Purpose
tinylib/msgp	MessagePack model serialization
wangbin/jiebago	Chinese word segmentation (used during queries)
astaxie/beego/logs	Logging

References

Mikolov et al., Efficient Estimation of Word Representations in Vector Space (2013)
Le & Mikolov, Distributed Representations of Sentences and Documents (2014)
Dai et al., Document Embedding with Paragraph Vectors (2015)
Liu et al., Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints (ACL 2015)
Google word2vec (C)
hiyijian/doc2vec (C++)
iunderstand/SWE (C)

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc2vec-golang

Features

Quick Start

Requirements

Build

Training

Querying

Training Parameters

SWE Synonym Constraint Parameters

Semantic Word Embedding (SWE)

Constraint File Format

Results

Project Structure

Dependencies

References

License

FilesExpand file tree

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

doc2vec-golang

Features

Quick Start

Requirements

Build

Training

Querying

Training Parameters

SWE Synonym Constraint Parameters

Semantic Word Embedding (SWE)

Constraint File Format

Results

Project Structure

Dependencies

References

License