Measure vocabulary size and diversity

Paste any text and get three figures: total words, unique words (after lowercasing), and lexical diversity as a percentage. Lexical diversity is the type-token ratio: unique / total * 100. The transform runs in your browser; nothing uploads. For the unique words themselves, see find unique words.

Input
Line 1:1 LF cloud_done Saved locally
Result Vocabulary Size
0 lines 0 chars

Type-token ratio at a glance

The total-word count uses the regex \b[a-z']+\b on the lowercased input. Letters and apostrophes are inside words; hyphens, digits, punctuation, and whitespace are boundaries. The unique count is the size of the resulting Set. Lexical diversity is unique divided by total, expressed as a percentage to one decimal.

Lexical diversity is sometimes called the type-token ratio (TTR). A ratio near 100% means almost every word is unique (very short or very varied input). A ratio near 0% means heavy repetition. For natural prose, the ratio falls as the input grows, because common function words (the, of, and) repeat. A 1,000-word essay might score 40 to 50%; a 10,000-word novel chapter might score 25 to 30%.

Because the tokenizer here is letters-only (no digits), tokens like 2024 and 3rd are excluded. For a digit-aware unique count, see find unique words. For the same lexical diversity figure inside a fuller stats block, see text statistics.

How to use measure vocabulary size and diversity

  1. 1Paste or type your text into the input panel on the left.
  2. 2The three lines (total, unique, diversity) appear in the output panel as you type.
  3. 3Read the diversity percentage as a measure of repetition: lower means more repetition.
  4. 4Click Copy in the output header to copy the report.
  5. 5Use find unique words if you also want the list itself.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut Action
Ctrl FOpen the find & replace panel inside the input Plus
Ctrl ZUndo the last input change
Ctrl Shift ZRedo
Ctrl Shift EnterToggle fullscreen focus on the editor Plus
EscClose find & replace, or exit fullscreen
Ctrl KOpen the command palette to jump to any tool Plus
Ctrl SSave current workflow draft Plus
Ctrl PRun a saved workflow Plus

What this tool actually does

Tokenizer is \b[a-z']+\b on lowercased input

Letters and apostrophes only. The lowercase step folds The and the into one type. Digits are not part of the alphabet here, so number-only tokens are excluded.

Three-line output

Total words (every match), unique words (set size), lexical diversity (unique / total * 100, one decimal place). Each on its own line, plain text, easy to paste.

Lexical diversity is a percentage

The type-token ratio is (unique / total) * 100 rounded to one decimal place. A ratio of 100% means every word is unique; 50% means half are repeats; 0% is mathematically impossible for non-empty input.

Hyphenated and digit-bearing tokens are excluded

state-of-the-art splits on the hyphen into four words. 2024 contains no letters and is excluded entirely. iPhone7 matches as iphone (the digit ends the match).

Empty input reports zeros and 0.0%

When the regex finds nothing the report is Total words: 0, Unique words: 0, Lexical diversity: 0.0%. No divide-by-zero crash.

Worked example

Twelve total words after lowercasing; the appears 3 times and quick and fox twice each, leaving 9 unique types. Diversity = 9 / 12 = 75.0%.

Input
the quick brown fox jumps over the lazy dog. the fox is quick.
Output
Total words: 12
Unique words: 9
Lexical diversity: 75.0%

Settings reference

Output line Meaning
Total words: N Every match of \b[a-z']+\b after lowercasing.
Unique words: M Size of the lowercased Set of words.
Lexical diversity: P% (unique / total) * 100, rounded to 1 decimal.
Hyphenated word Split on the hyphen into separate tokens.
Number-only token Excluded (no letters).
Empty input All three lines report 0 / 0.0%.

FAQ

What is lexical diversity?
The type-token ratio: unique words divided by total words, here as a percentage. It measures repetition. Low values mean lots of repeats; high values mean almost every word is fresh.
Why is the ratio lower for longer texts?
Because common function words (the, of, and) keep appearing as the input grows, while new vocabulary slows down. This is normal; comparing diversity across different lengths is misleading.
Are The and the counted as one type?
Yes. The input is lowercased before the regex runs, so case never affects the unique count.
Why is 2024 not counted?
The tokenizer is [a-z']+ (letters and apostrophes only). Digit-only tokens have no letters and are skipped. For a digit-aware unique count, use find unique words.
Is the input sent anywhere?
No. The computation runs in your browser. Nothing uploads, nothing is logged.