Extract text from HTML

Extract text from HTML parses pasted markup with the browser's built-in DOMParser and returns just the readable text. Tags are removed, entities are decoded (& becomes &), and the text content of every element is concatenated. Two toggles let you control line breaks and whitespace. This differs from the regex-based strip HTML tags tool, which is faster on tiny snippets but does not decode entities.

Input
Line 1:1 LF cloud_done Saved locally
Result Extract Text from HTML
0 lines 0 chars

How HTML extraction works here

The tool runs new DOMParser().parseFromString(html, "text/html") and reads document.body.textContent. That gives you exactly what the browser would render as text, minus the styling. Entities like &amp;, < and &#x2014; are decoded to their characters. Script and style content is included by default because textContent walks every node; remove those upstream if needed.

The BR to Newline toggle replaces every <br> tag with a literal newline before parsing. Without it, <br> tags vanish silently because they have no text content. With it, you keep the visible line breaks from the original markup.

The Collapse WS toggle (on by default) flattens runs of whitespace (spaces, tabs, newlines) to a single space and trims each line. Switch it off to keep the raw whitespace from the source HTML, including the indentation around tags. Compared with strip HTML tags (regex-based), this tool decodes entities, walks nested tags correctly, and follows the browser's actual parsing rules.

How to use extract text from html

  1. 1Paste the HTML markup into the input panel.
  2. 2The plain-text content appears in the output panel.
  3. 3Toggle BR to Newline in the action bar to keep <br> line breaks.
  4. 4Toggle Collapse WS off if you want the raw whitespace from the source.
  5. 5Click Copy or Download to save the result.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut Action
Ctrl FOpen the find & replace panel inside the input Plus
Ctrl ZUndo the last input change
Ctrl Shift ZRedo
Ctrl Shift EnterToggle fullscreen focus on the editor Plus
EscClose find & replace, or exit fullscreen
Ctrl KOpen the command palette to jump to any tool Plus
Ctrl SSave current workflow draft Plus
Ctrl PRun a saved workflow Plus

What this tool actually does

DOMParser-based, not regex

The browser builds a real DOM tree from the HTML and the tool reads textContent. Nested tags, malformed markup and self-closing elements are all handled the way your browser handles them. Compare with strip HTML tags, which is regex-based and faster on small snippets.

HTML entities decoded

Named entities (&amp;, <, &copy;), numeric entities (&#8212;) and hex entities (&#x2014;) are all decoded to their characters. Regex-based strippers leave entities in place; this tool does not.

BR to Newline toggle

Off (default): <br> tags vanish because textContent ignores them. On: every <br> in the source is replaced by a newline before parsing, so visible line breaks come through.

Collapse WS toggle

On (default): runs of whitespace are flattened to single spaces and lines are trimmed. Off: the raw whitespace from the source HTML is preserved, including indentation around tags. With both BR to Newline and Collapse WS on, each line is collapsed individually so you keep the line breaks but lose the per-line indentation.

Script and style content included

Because textContent walks every node, the bodies of <script> and <style> tags come through as text. Strip those tags upstream with find and replace if you only want the visible body copy.

Worked example

With Collapse WS on (default) and BR to Newline on, you get one tidy line per <br> break. Notice &amp; decoded to &amp;, and the href attribute did not appear in the output.

Input
<p>Hello <strong>world</strong>!</p>
<p>Line one.<br>Line two.</p>
<a href="https://example.com">Visit &amp; say hi</a>
Output
Hello world! Line one. Line two. Visit & say hi

Settings reference

Setting Effect on output
BR to Newline off (default) <br> tags vanish. Line one and two run together unless other whitespace separates them.
BR to Newline on Every <br> becomes a literal newline before parsing.
Collapse WS on (default) Runs of whitespace (spaces, tabs, newlines) flatten to a single space; lines are trimmed.
Collapse WS off Raw whitespace from the source is kept, including indentation between tags.
HTML entities Always decoded. &amp; becomes &amp;.
Script and style bodies Included as text. Remove tags upstream if you do not want their content.

FAQ

How is this different from strip HTML tags?
Strip HTML tags is a regex-based tool that removes anything between < and >. It is fast for tiny snippets but does not decode entities and can choke on nested or malformed markup. This tool uses the browser's real DOMParser, which builds a proper DOM, decodes entities, and produces output identical to what the browser would render as text.
Why does my <br> tag disappear?
Because <br> elements have no text content, the DOM walker skips them. Turn on BR to Newline in the action bar to swap each <br> for a literal newline before parsing.
Are <script> and <style> bodies removed?
No. textContent walks every node, so script and style code comes through as text. If you only want the visible body copy, run find and replace with regex first to delete <script[^>]*>[\s\S]*?</script> and the equivalent for style, then paste the result here.
Are HTML entities decoded?
Yes. Named (&amp;), numeric (&#8212;) and hex (&#x2014;) entities all decode to their characters because the browser does it for you when it builds the DOM.
Is anything sent to a server?
No. DOMParser runs entirely in your browser. Nothing uploads, nothing is logged.