Tokenizer

How to run

To create vocabulary and embeddings run

$ python3 create_vocab.py <embedding_size> <num_iterations>

Where embedding_size is the size (dimensions) of embeddings to be generated (eg - 128), num_iterations is the number of iterations to be run on dictionary to generate embeddings (eg - 10)

This will create two files - vocab.txt and embeddings (pickle dump)

To test the tokenizer run

$ python3 run_tokenizer.py <path/to/file/to/tokenize>

This will print each line and its tokenization on screen, redirect it to a file for better readability

Refrences

Dictionary - https://github.com/tusharlock10/Dictionary/blob/master/data.7z
Lemma list - https://github.com/skywind3000/lemma.en
Suffixes - https://en.wiktionary.org/wiki/Appendix:English_suffixes
Prefix - https://en.wikipedia.org/wiki/Prefix

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
data		data
.DS_Store		.DS_Store
README.md		README.md
create_vocab.py		create_vocab.py
dictionary_embeddings.py		dictionary_embeddings.py
embeddings		embeddings
example_text.txt		example_text.txt
generate_prefix_stem_suffix_embeddings.py		generate_prefix_stem_suffix_embeddings.py
generate_stems.py		generate_stems.py
lemma.en.txt		lemma.en.txt
modeling.py		modeling.py
prefixes.txt		prefixes.txt
run_tokenizer.py		run_tokenizer.py
suffixes.txt		suffixes.txt
tokenization.py		tokenization.py
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizer

How to run

Refrences

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

How to run

Refrences

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages