Type-token ratio at a glance
The total-word count uses the regex \b[a-z']+\b on the lowercased input. Letters and apostrophes are inside words; hyphens, digits, punctuation, and whitespace are boundaries. The unique count is the size of the resulting Set. Lexical diversity is unique divided by total, expressed as a percentage to one decimal.
Lexical diversity is sometimes called the type-token ratio (TTR). A ratio near 100% means almost every word is unique (very short or very varied input). A ratio near 0% means heavy repetition. For natural prose, the ratio falls as the input grows, because common function words (the, of, and) repeat. A 1,000-word essay might score 40 to 50%; a 10,000-word novel chapter might score 25 to 30%.
Because the tokenizer here is letters-only (no digits), tokens like 2024 and 3rd are excluded. For a digit-aware unique count, see find unique words. For the same lexical diversity figure inside a fuller stats block, see text statistics.
How to use measure vocabulary size and diversity
- 1Paste or type your text into the input panel on the left.
- 2The three lines (total, unique, diversity) appear in the output panel as you type.
- 3Read the diversity percentage as a measure of repetition: lower means more repetition.
- 4Click Copy in the output header to copy the report.
- 5Use find unique words if you also want the list itself.
Keyboard shortcuts
Drive TextResult without touching the mouse.
| Shortcut | Action |
|---|---|
| Ctrl F | Open the find & replace panel inside the input Plus |
| Ctrl Z | Undo the last input change |
| Ctrl Shift Z | Redo |
| Ctrl Shift Enter | Toggle fullscreen focus on the editor Plus |
| Esc | Close find & replace, or exit fullscreen |
| Ctrl K | Open the command palette to jump to any tool Plus |
| Ctrl S | Save current workflow draft Plus |
| Ctrl P | Run a saved workflow Plus |
What this tool actually does
Tokenizer is \b[a-z']+\b on lowercased input
Letters and apostrophes only. The lowercase step folds The and the into one type. Digits are not part of the alphabet here, so number-only tokens are excluded.
Three-line output
Total words (every match), unique words (set size), lexical diversity (unique / total * 100, one decimal place). Each on its own line, plain text, easy to paste.
Lexical diversity is a percentage
The type-token ratio is (unique / total) * 100 rounded to one decimal place. A ratio of 100% means every word is unique; 50% means half are repeats; 0% is mathematically impossible for non-empty input.
Hyphenated and digit-bearing tokens are excluded
state-of-the-art splits on the hyphen into four words. 2024 contains no letters and is excluded entirely. iPhone7 matches as iphone (the digit ends the match).
Empty input reports zeros and 0.0%
When the regex finds nothing the report is Total words: 0, Unique words: 0, Lexical diversity: 0.0%. No divide-by-zero crash.
Worked example
Twelve total words after lowercasing; the appears 3 times and quick and fox twice each, leaving 9 unique types. Diversity = 9 / 12 = 75.0%.
the quick brown fox jumps over the lazy dog. the fox is quick.
Total words: 12 Unique words: 9 Lexical diversity: 75.0%
Settings reference
| Output line | Meaning |
|---|---|
Total words: N |
Every match of \b[a-z']+\b after lowercasing. |
Unique words: M |
Size of the lowercased Set of words. |
Lexical diversity: P% |
(unique / total) * 100, rounded to 1 decimal. |
| Hyphenated word | Split on the hyphen into separate tokens. |
| Number-only token | Excluded (no letters). |
| Empty input | All three lines report 0 / 0.0%. |
FAQ
What is lexical diversity?
Why is the ratio lower for longer texts?
the, of, and) keep appearing as the input grows, while new vocabulary slows down. This is normal; comparing diversity across different lengths is misleading.Are The and the counted as one type?
Why is 2024 not counted?
[a-z']+ (letters and apostrophes only). Digit-only tokens have no letters and are skipped. For a digit-aware unique count, use find unique words.