Measure text similarity

Text similarity scores how alike two passages are using the Jaccard coefficient over word sets. Paste both texts into the input pane separated by a line of ---; the tool tokenises each side into lowercase words, builds two sets, and reports the intersection over the union as a percentage. The score runs in your browser; nothing uploads. For exact line-by-line changes use diff.

Input
Line 1:1 LF cloud_done Saved locally
Result Text Similarity
0 lines 0 chars

Jaccard similarity in a single click

Jaccard similarity is a classic set-overlap measure: the size of the intersection divided by the size of the union, expressed as a value between 0 and 1. The text-similarity tool tokenises each side using the regex \b\w+\b, lowercases every token, then deduplicates the two lists into sets. Shared words are counted once, regardless of frequency, then divided by the total number of distinct words across both texts.

Output reports four numbers: the percentage score, the count of unique words in A, the count in B, and the count of words shared by both. A score of 100% means the two texts use the exact same vocabulary; 0% means no overlap at all. Word frequency is ignored on purpose, so doubling the word "the" in one side does not change the score.

Tokenisation is Latin-letter friendly out of the box. Hyphens split words (well-formed becomes well and formed) and punctuation is dropped. For a stricter character-level metric see Levenshtein distance; for the actual list of shared words see word set intersection.

How to use measure text similarity

  1. 1Paste text A into the input panel, then a line with ---, then paste text B.
  2. 2The Jaccard percentage and word counts appear in the output panel as you type.
  3. 3Click Copy to copy the score block, or Download to save it.
  4. 4Read the Shared words figure to see how much vocabulary the two texts have in common.
  5. 5For the actual shared word list pivot to word set intersection.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut Action
Ctrl FOpen the find & replace panel inside the input Plus
Ctrl ZUndo the last input change
Ctrl Shift ZRedo
Ctrl Shift EnterToggle fullscreen focus on the editor Plus
EscClose find & replace, or exit fullscreen
Ctrl KOpen the command palette to jump to any tool Plus
Ctrl SSave current workflow draft Plus
Ctrl PRun a saved workflow Plus

How the score is computed

Jaccard coefficient over word sets

The score is |A ∩ B| / |A ∪ B| where A and B are the sets of distinct lowercase words on each side. Multiplied by 100 it becomes the percentage shown in the output.

Case folded, frequency ignored

Every token is lowercased before being added to the set, so Fox and fox count as one word. Repeats inside a single text do not boost the score; sets store each word once.

Word boundary tokenisation

Words are extracted with the regex \b\w+\b. That picks up letters, digits, and underscores. Apostrophes split tokens (don't becomes don and t); strip punctuation in advance if you want different behaviour.

Four-line output block

Output reports Jaccard similarity as a percentage, Unique words in A, Unique words in B, and Shared words. The percentage is rounded to one decimal place.

Three-hyphen separator required

Both texts go in the same input pane, separated by a line containing exactly ---. Without that marker the tool returns a prompt asking for two halves.

Worked example

Three words (the, brown, fox) appear in both. The union has five distinct words, so 3 / 5 = 60%. To list the shared words themselves, use word set intersection.

Input
the quick brown fox
---
the slow brown fox
Output
Jaccard similarity: 60.0%
Unique words in A: 4
Unique words in B: 4
Shared words: 3

Settings reference

Behaviour Effect on output
Separator A line containing exactly --- splits text A from text B.
Tokeniser Regex \b\w+\b, lowercased before being added to the set.
Case Folded to lowercase before comparison.
Frequency Ignored. A word counted once per side regardless of how many times it appears.
Punctuation Dropped during tokenisation.
Score format Percentage with one decimal place plus three count lines.
Empty input A score of 0% with zero counts.

FAQ

What is Jaccard similarity?
It is the ratio of words shared by both texts to the total number of distinct words across both, expressed as a percentage. |A ∩ B| / |A ∪ B|. A value of 1.0 means identical vocabulary; 0.0 means none in common.
Why does my repeated word not change the score?
Jaccard works on sets, so each word is counted once per side. If you want a frequency-weighted measure, count words first with word set intersection and weight the result yourself.
How does this differ from Levenshtein distance?
Levenshtein measures the number of single-character edits between two strings. Jaccard ignores order and works at word level. Use Levenshtein for short strings (names, codes) and Jaccard for paragraphs and documents.
Is the comparison case sensitive?
No. Tokens are lowercased before being placed in the set, so Fox and fox count as one shared word.
Where does the text go?
Nowhere. The score is computed in your browser using JavaScript. No upload, no logging.