How word extraction works here
The pattern \b[\w']+\b splits on anything that is not a word character or a straight apostrophe. Letters A-Z a-z, digits 0-9, underscore and ' stay together; everything else (spaces, punctuation, dashes, em-dashes, smart quotes) is a token break. So "don't" is one token and "check-in" is two (check and in).
Smart curly apostrophes (’) are not in the character class, so a contraction written with smart punctuation (don’t) splits into don and t. To keep curly apostrophes attached, run find and replace first to swap ’ for ', or switch to extract regex matches with a pattern like [\w’']+.
Output is one token per line in source order. Duplicates are kept; for a unique vocabulary list pipe the result through remove duplicate lines. To count tokens, use word counter on the original text.
How to use extract words from text
- 1Paste your text into the input panel.
- 2The output panel shows every word, one per line.
- 3Click Copy to copy the list.
- 4Click Download to save it as a plain-text file.
- 5For a unique vocabulary list, send the result to remove duplicate lines.
Keyboard shortcuts
Drive TextResult without touching the mouse.
| Shortcut | Action |
|---|---|
| Ctrl F | Open the find & replace panel inside the input Plus |
| Ctrl Z | Undo the last input change |
| Ctrl Shift Z | Redo |
| Ctrl Shift Enter | Toggle fullscreen focus on the editor Plus |
| Esc | Close find & replace, or exit fullscreen |
| Ctrl K | Open the command palette to jump to any tool Plus |
| Ctrl S | Save current workflow draft Plus |
| Ctrl P | Run a saved workflow Plus |
What counts as a word here
Letters, digits, underscore, straight apostrophe
The character class is [\w']: ASCII letters, digits, underscore and '. So fox, 2026, order_id and don't all match as single tokens.
Punctuation and dashes split words
Hyphens, dashes, slashes, dots and other punctuation are token breaks. check-in splits to check and in; up/down splits to up and down.
Smart apostrophes split contractions
Curly ’ is not in the word class, so don’t splits to don and t. Replace with straight ' first if your text uses smart punctuation.
Accented and non-Latin letters split words
The ASCII word class does not include é, ñ, Cyrillic, Greek, CJK or any other non-ASCII letter. café splits to caf alone (the é is dropped). For Unicode-aware tokenisation, use extract regex matches with the pattern [\p{L}\p{N}']+ and the gu flag.
Order preserved, duplicates kept
Tokens appear in source order. Duplicates are not removed; for a unique list pipe the output through remove duplicate lines.
Worked example
don't stays as one token because the apostrophe is in the word class. check-in splits because the hyphen is not. ORDER-4821 splits the same way.
The quick brown fox jumps over the lazy dog. Don't worry: check-in at 5pm. ORDER-4821 ships today.
The quick brown fox jumps over the lazy dog Don't worry check in at 5pm ORDER 4821 ships today
Settings reference
| Behaviour | Effect on output |
|---|---|
| Word body characters | Letters A-Z a-z, digits 0-9, underscore, straight '. |
| Hyphens and dashes | Token breaks. check-in splits. |
| Smart apostrophes | Token break. Replace ’ with ' first to keep contractions intact. |
| Accented and Unicode letters | Not in the word class. café splits. |
| Punctuation and symbols | Token break. Dropped from output. |
| Order and duplicates | Source order kept, duplicates kept. |
FAQ
Why is café coming out as caf?
\w shorthand, which is letters, digits and underscore only. Accented letters and other Unicode letters split tokens. For full Unicode tokenisation, use extract regex matches with [\p{L}\p{N}']+ and the gu flag.How do I keep contractions written with smart quotes?
’ for a straight ', then extract. Or use extract regex matches with [\w’']+.