Skip to content

Commit 98f2700

Browse files
committed
Add warnings / messages about long texts / wrong doc_id's as I have been bitten myself by it
1 parent 28b7a0f commit 98f2700

3 files changed

Lines changed: 16 additions & 2 deletions

File tree

NEWS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
## CHANGES IN doc2vec VERSION 0.2.1
22

33
- Make sure words are only 100 characters when getting embeddings of documents (issue #20)
4+
- Limit documents to 1000 words by explicitely keeping only the first 1000 words per document + provide warning if doc_id contains spaces
45

56
## CHANGES IN doc2vec VERSION 0.2.0
67

R/paragraph2vec.R

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
#' where an additional vector for every paragraph is added directly in the training.
77
#' @param x a data.frame with columns doc_id and text or the path to the file on disk containing training data.\cr
88
#' Note that the text column should be of type character, should contain less than 1000 words where space or tab is
9-
#' used as a word separator and that the text should not contain newline characters as these are considered document delimiters.
9+
#' used as a word separator and that the text should not contain newline characters as these are considered document delimiters.\cr
10+
#' The doc_id should not contain spaces.
1011
#' @param type character string with the type of algorithm to use, either one of
1112
#' \itemize{
1213
#' \item{'PV-DM': Distributed Memory paragraph vectors}
@@ -99,6 +100,17 @@ paragraph2vec <- function(x,
99100
file_train <- x
100101
}else{
101102
stopifnot(is.data.frame(x) && all(c("doc_id", "text") %in% colnames(x)))
103+
nwords <- txt_count_words(x$text, pattern = "[ \t]")
104+
idx <- which(nwords >= 1000)
105+
if(length(idx) > 0){
106+
message(sprintf("Note: there are texts which are longer than 1000 words, for these we will take only the first 1000 words, example doc_id: %s", x$doc_id[sample(idx, size = 1)]))
107+
x$text[idx] <- sapply(strsplit(x$text[idx], split = "[ \t]"), FUN = function(x) paste(head(x, n = 1000), collapse = " "))
108+
}
109+
idx <- grepl(x$doc_id, pattern = "[ \t]")
110+
idx <- which(idx)
111+
if(length(idx) > 0){
112+
warning(sprintf("There are doc_id's containing spaces, make sure your doc_id has no spaces otherwise the doc_id will be everything before the space and the remainder will be a word which is considered part of the document, e.g look at doc_id: %s", x$doc_id[sample(idx, size = 1)]))
113+
}
102114
file_train <- tempfile(pattern = "textspace_", fileext = ".txt")
103115
on.exit({
104116
if (file.exists(file_train)) file.remove(file_train)

man/paragraph2vec.Rd

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)