Vocabulary Size - Lexical Diversity Calculator

Type-token ratio at a glance

The total-word count uses the regex \b[a-z']+\b on the lowercased input. Letters and apostrophes are inside words; hyphens, digits, punctuation, and whitespace are boundaries. The unique count is the size of the resulting Set. Lexical diversity is unique divided by total, expressed as a percentage to one decimal.

Lexical diversity is sometimes called the type-token ratio (TTR). A ratio near 100% means almost every word is unique (very short or very varied input). A ratio near 0% means heavy repetition. For natural prose, the ratio falls as the input grows, because common function words (the, of, and) repeat. A 1,000-word essay might score 40 to 50%; a 10,000-word novel chapter might score 25 to 30%.

Because the tokenizer here is letters-only (no digits), tokens like 2024 and 3rd are excluded. For a digit-aware unique count, see find unique words. For the same lexical diversity figure inside a fuller stats block, see text statistics.

How to use measure vocabulary size and diversity

1Paste or type your text into the input panel on the left.
2The three lines (total, unique, diversity) appear in the output panel as you type.
3Read the diversity percentage as a measure of repetition: lower means more repetition.
4Click Copy in the output header to copy the report.
5Use find unique words if you also want the list itself.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut	Action
`Ctrl` `F`	Open the find & replace panel inside the input Plus
`Ctrl` `Z`	Undo the last input change
`Ctrl` `Shift` `Z`	Redo
`Ctrl` `Shift` `Enter`	Toggle fullscreen focus on the editor Plus
`Esc`	Close find & replace, or exit fullscreen
`Ctrl` `K`	Open the command palette to jump to any tool Plus
`Ctrl` `S`	Save current workflow draft Plus
`Ctrl` `P`	Run a saved workflow Plus

What this tool actually does

Tokenizer is `\b[a-z']+\b` on lowercased input

Letters and apostrophes only. The lowercase step folds The and the into one type. Digits are not part of the alphabet here, so number-only tokens are excluded.

Three-line output

Total words (every match), unique words (set size), lexical diversity (unique / total * 100, one decimal place). Each on its own line, plain text, easy to paste.

Lexical diversity is a percentage

The type-token ratio is (unique / total) * 100 rounded to one decimal place. A ratio of 100% means every word is unique; 50% means half are repeats; 0% is mathematically impossible for non-empty input.

Hyphenated and digit-bearing tokens are excluded

state-of-the-art splits on the hyphen into four words. 2024 contains no letters and is excluded entirely. iPhone7 matches as iphone (the digit ends the match).

Empty input reports zeros and 0.0%

When the regex finds nothing the report is Total words: 0, Unique words: 0, Lexical diversity: 0.0%. No divide-by-zero crash.

Worked example

Twelve total words after lowercasing; the appears 3 times and quick and fox twice each, leaving 9 unique types. Diversity = 9 / 12 = 75.0%.

Input

the quick brown fox jumps over the lazy dog. the fox is quick.

Output

Total words: 12
Unique words: 9
Lexical diversity: 75.0%

Settings reference

Output line	Meaning
`Total words: N`	Every match of `\b[a-z']+\b` after lowercasing.
`Unique words: M`	Size of the lowercased `Set` of words.
`Lexical diversity: P%`	`(unique / total) * 100`, rounded to 1 decimal.
Hyphenated word	Split on the hyphen into separate tokens.
Number-only token	Excluded (no letters).
Empty input	All three lines report 0 / 0.0%.

FAQ

What is lexical diversity?

The type-token ratio: unique words divided by total words, here as a percentage. It measures repetition. Low values mean lots of repeats; high values mean almost every word is fresh.

Why is the ratio lower for longer texts?

Because common function words (the, of, and) keep appearing as the input grows, while new vocabulary slows down. This is normal; comparing diversity across different lengths is misleading.

Are The and the counted as one type?

Yes. The input is lowercased before the regex runs, so case never affects the unique count.

Why is 2024 not counted?

The tokenizer is [a-z']+ (letters and apostrophes only). Digit-only tokens have no letters and are skipped. For a digit-aware unique count, use find unique words.

Is the input sent anywhere?

No. The computation runs in your browser. Nothing uploads, nothing is logged.

Also known as

vocabulary size calculator lexical diversity tool type token ratio unique word count vocabulary measure lexical richness ttr calculator vocab range tool