Text Similarity - Jaccard % Score Online

Jaccard similarity in a single click

Jaccard similarity is a classic set-overlap measure: the size of the intersection divided by the size of the union, expressed as a value between 0 and 1. The text-similarity tool tokenises each side using the regex \b\w+\b, lowercases every token, then deduplicates the two lists into sets. Shared words are counted once, regardless of frequency, then divided by the total number of distinct words across both texts.

Output reports four numbers: the percentage score, the count of unique words in A, the count in B, and the count of words shared by both. A score of 100% means the two texts use the exact same vocabulary; 0% means no overlap at all. Word frequency is ignored on purpose, so doubling the word "the" in one side does not change the score.

Tokenisation is Latin-letter friendly out of the box. Hyphens split words (well-formed becomes well and formed) and punctuation is dropped. For a stricter character-level metric see Levenshtein distance; for the actual list of shared words see word set intersection.

How to use measure text similarity

1Paste text A into the input panel, then a line with ---, then paste text B.
2The Jaccard percentage and word counts appear in the output panel as you type.
3Click Copy to copy the score block, or Download to save it.
4Read the Shared words figure to see how much vocabulary the two texts have in common.
5For the actual shared word list pivot to word set intersection.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut	Action
`Ctrl` `F`	Open the find & replace panel inside the input Plus
`Ctrl` `Z`	Undo the last input change
`Ctrl` `Shift` `Z`	Redo
`Ctrl` `Shift` `Enter`	Toggle fullscreen focus on the editor Plus
`Esc`	Close find & replace, or exit fullscreen
`Ctrl` `K`	Open the command palette to jump to any tool Plus
`Ctrl` `S`	Save current workflow draft Plus
`Ctrl` `P`	Run a saved workflow Plus

How the score is computed

Jaccard coefficient over word sets

The score is |A ∩ B| / |A ∪ B| where A and B are the sets of distinct lowercase words on each side. Multiplied by 100 it becomes the percentage shown in the output.

Case folded, frequency ignored

Every token is lowercased before being added to the set, so Fox and fox count as one word. Repeats inside a single text do not boost the score; sets store each word once.

Word boundary tokenisation

Words are extracted with the regex \b\w+\b. That picks up letters, digits, and underscores. Apostrophes split tokens (don't becomes don and t); strip punctuation in advance if you want different behaviour.

Four-line output block

Output reports Jaccard similarity as a percentage, Unique words in A, Unique words in B, and Shared words. The percentage is rounded to one decimal place.

Three-hyphen separator required

Both texts go in the same input pane, separated by a line containing exactly ---. Without that marker the tool returns a prompt asking for two halves.

Worked example

Three words (the, brown, fox) appear in both. The union has five distinct words, so 3 / 5 = 60%. To list the shared words themselves, use word set intersection.

Input

the quick brown fox
---
the slow brown fox

Output

Jaccard similarity: 60.0%
Unique words in A: 4
Unique words in B: 4
Shared words: 3

Settings reference

Behaviour	Effect on output
Separator	A line containing exactly `---` splits text A from text B.
Tokeniser	Regex `\b\w+\b`, lowercased before being added to the set.
Case	Folded to lowercase before comparison.
Frequency	Ignored. A word counted once per side regardless of how many times it appears.
Punctuation	Dropped during tokenisation.
Score format	Percentage with one decimal place plus three count lines.
Empty input	A score of 0% with zero counts.

FAQ

What is Jaccard similarity?

It is the ratio of words shared by both texts to the total number of distinct words across both, expressed as a percentage. |A ∩ B| / |A ∪ B|. A value of 1.0 means identical vocabulary; 0.0 means none in common.

Why does my repeated word not change the score?

Jaccard works on sets, so each word is counted once per side. If you want a frequency-weighted measure, count words first with word set intersection and weight the result yourself.

How does this differ from Levenshtein distance?

Levenshtein measures the number of single-character edits between two strings. Jaccard ignores order and works at word level. Use Levenshtein for short strings (names, codes) and Jaccard for paragraphs and documents.

Is the comparison case sensitive?

No. Tokens are lowercased before being placed in the set, so Fox and fox count as one shared word.

Where does the text go?

Nowhere. The score is computed in your browser using JavaScript. No upload, no logging.

Also known as

text similarity jaccard similarity compare two texts similarity percentage word overlap calculator text matching score document similarity compare paragraphs