Skip to content

Choosing a definition of 'word' #169

@xfq

Description

@xfq

The concept of a "word" is difficult to define, usually referring to a grammatical unit smaller than a phrase and containing one or more syllables.

Word separators differ across languages, and specs should not assume that words are always separated by spaces. Even for the same language, ancient and modern usage may differ.

I think there are two places where related guideline could be added: one is 6.1 Choosing text units for segmentation, indexing, etc., and the other is 9. Typographic support (possibly in 9.9 Miscellaneous).

Here are some examples:

In Arabic, short words like "and" (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means "universities and colleges", but there is only one space). In typesetting, these words can be treated as part of the word they are attached to.

Many scripts, such as Balinese, Batak, Tai Lue, and Khmer, do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.

Also, in Vietnamese written with the Latin alphabet and in Fraser script, spaces are used to separate syllables, not words.

In scripts like Chinese, Japanese, and Tibetan, there are no spaces at all (except for a few exceptions, such as textbooks for foreigners).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions