Jaccard similarity in a single click
Jaccard similarity is a classic set-overlap measure: the size of the intersection divided by the size of the union, expressed as a value between 0 and 1. The text-similarity tool tokenises each side using the regex \b\w+\b, lowercases every token, then deduplicates the two lists into sets. Shared words are counted once, regardless of frequency, then divided by the total number of distinct words across both texts.
Output reports four numbers: the percentage score, the count of unique words in A, the count in B, and the count of words shared by both. A score of 100% means the two texts use the exact same vocabulary; 0% means no overlap at all. Word frequency is ignored on purpose, so doubling the word "the" in one side does not change the score.
Tokenisation is Latin-letter friendly out of the box. Hyphens split words (well-formed becomes well and formed) and punctuation is dropped. For a stricter character-level metric see Levenshtein distance; for the actual list of shared words see word set intersection.
How to use measure text similarity
- 1Paste text A into the input panel, then a line with
---, then paste text B. - 2The Jaccard percentage and word counts appear in the output panel as you type.
- 3Click Copy to copy the score block, or Download to save it.
- 4Read the
Shared wordsfigure to see how much vocabulary the two texts have in common. - 5For the actual shared word list pivot to word set intersection.
Keyboard shortcuts
Drive TextResult without touching the mouse.
| Shortcut | Action |
|---|---|
| Ctrl F | Open the find & replace panel inside the input Plus |
| Ctrl Z | Undo the last input change |
| Ctrl Shift Z | Redo |
| Ctrl Shift Enter | Toggle fullscreen focus on the editor Plus |
| Esc | Close find & replace, or exit fullscreen |
| Ctrl K | Open the command palette to jump to any tool Plus |
| Ctrl S | Save current workflow draft Plus |
| Ctrl P | Run a saved workflow Plus |
How the score is computed
Jaccard coefficient over word sets
The score is |A ∩ B| / |A ∪ B| where A and B are the sets of distinct lowercase words on each side. Multiplied by 100 it becomes the percentage shown in the output.
Case folded, frequency ignored
Every token is lowercased before being added to the set, so Fox and fox count as one word. Repeats inside a single text do not boost the score; sets store each word once.
Word boundary tokenisation
Words are extracted with the regex \b\w+\b. That picks up letters, digits, and underscores. Apostrophes split tokens (don't becomes don and t); strip punctuation in advance if you want different behaviour.
Four-line output block
Output reports Jaccard similarity as a percentage, Unique words in A, Unique words in B, and Shared words. The percentage is rounded to one decimal place.
Three-hyphen separator required
Both texts go in the same input pane, separated by a line containing exactly ---. Without that marker the tool returns a prompt asking for two halves.
Worked example
Three words (the, brown, fox) appear in both. The union has five distinct words, so 3 / 5 = 60%. To list the shared words themselves, use word set intersection.
the quick brown fox --- the slow brown fox
Jaccard similarity: 60.0% Unique words in A: 4 Unique words in B: 4 Shared words: 3
Settings reference
| Behaviour | Effect on output |
|---|---|
| Separator | A line containing exactly --- splits text A from text B. |
| Tokeniser | Regex \b\w+\b, lowercased before being added to the set. |
| Case | Folded to lowercase before comparison. |
| Frequency | Ignored. A word counted once per side regardless of how many times it appears. |
| Punctuation | Dropped during tokenisation. |
| Score format | Percentage with one decimal place plus three count lines. |
| Empty input | A score of 0% with zero counts. |
FAQ
What is Jaccard similarity?
|A ∩ B| / |A ∪ B|. A value of 1.0 means identical vocabulary; 0.0 means none in common.Why does my repeated word not change the score?
How does this differ from Levenshtein distance?
Is the comparison case sensitive?
Fox and fox count as one shared word.