Set intersection on words
Word set intersection treats each text as a bag of distinct words and returns the bag of words shared between them. Tokens are extracted with the regex \b[\w']+\b, which keeps letters, digits, underscores, and apostrophes. Each token is lowercased before being placed in a JavaScript Set; that handles deduplication for free. The intersection is computed by walking set A and keeping any token also in set B.
The split between text A and text B is a line containing exactly ---. Output is one word per line, alphabetically sorted. Frequency is ignored; a word that appears five times in A and twice in B contributes one line to the output. Punctuation is stripped during tokenisation so commas, full stops, and brackets do not interfere.
For the words in A but not B see word set difference; for whole-line overlap rather than word-level use find common lines; for a single Jaccard percentage instead of the actual word list see text similarity.
How to use word set intersection
- 1Paste text A into the input panel, then a line with
---, then paste text B. - 2The shared word list appears in the output panel, one word per line, sorted alphabetically.
- 3Click Copy to copy the list, or Download to save it as a
.txtfile. - 4For words unique to one side, pivot to word set difference.
- 5For a single overlap percentage use text similarity.
Keyboard shortcuts
Drive TextResult without touching the mouse.
| Shortcut | Action |
|---|---|
| Ctrl F | Open the find & replace panel inside the input Plus |
| Ctrl Z | Undo the last input change |
| Ctrl Shift Z | Redo |
| Ctrl Shift Enter | Toggle fullscreen focus on the editor Plus |
| Esc | Close find & replace, or exit fullscreen |
| Ctrl K | Open the command palette to jump to any tool Plus |
| Ctrl S | Save current workflow draft Plus |
| Ctrl P | Run a saved workflow Plus |
How the intersection is computed
Word boundary tokenisation
Tokens are extracted with the regex \b[\w']+\b. That picks up letters, digits, underscores, and apostrophes. don't stays as a single token; punctuation is dropped.
Case folded, deduplicated
Every token is lowercased before being added to the set. Fox and fox count as the same word. Frequency is ignored; sets store each word once.
Alphabetically sorted output
After computing the intersection the tool calls .sort(), so the words come out in standard JavaScript lexicographic order. That makes results stable across runs.
Set intersection algorithm
Both sides are converted to Set objects for O(1) lookup. The intersection walks set A and keeps any token also present in set B.
Three-hyphen separator
The split between A and B is a line containing exactly ---. The tool needs this marker to know where text A ends and text B begins.
Worked example
Two words (brown and fox) appear in both texts. the, quick, jumps, a, fast, and runs appear in only one side. For the words in A but not B see word set difference.
the quick brown fox jumps --- a fast brown fox runs
brown fox
Settings reference
| Behaviour | Effect on output |
|---|---|
| Separator | A line containing exactly --- splits text A from text B. |
| Tokeniser | Regex \b[\w']+\b, lowercased before set insertion. |
| Case | Folded to lowercase. |
| Frequency | Ignored. Each word listed once regardless of how often it appears. |
| Order | Alphabetically sorted via JavaScript .sort(). |
| Punctuation | Dropped during tokenisation. |
| Empty input | Empty output (no shared words). |
FAQ
Why is don't kept as one word?
\b[\w']+\b includes the apostrophe in the token character class, so contractions stay intact. don't, it's, and can't all behave as single tokens.Are matches case sensitive?
Fox and fox count as the same word. Use line diff if you need to detect casing changes.