Extract Text from HTML Online

How HTML extraction works here

The tool runs new DOMParser().parseFromString(html, "text/html") and reads document.body.textContent. That gives you exactly what the browser would render as text, minus the styling. Entities like &, < and — are decoded to their characters. Script and style content is included by default because textContent walks every node; remove those upstream if needed.

The BR to Newline toggle replaces every <br> tag with a literal newline before parsing. Without it, <br> tags vanish silently because they have no text content. With it, you keep the visible line breaks from the original markup.

The Collapse WS toggle (on by default) flattens runs of whitespace (spaces, tabs, newlines) to a single space and trims each line. Switch it off to keep the raw whitespace from the source HTML, including the indentation around tags. Compared with strip HTML tags (regex-based), this tool decodes entities, walks nested tags correctly, and follows the browser's actual parsing rules.

How to use extract text from html

1Paste the HTML markup into the input panel.
2The plain-text content appears in the output panel.
3Toggle BR to Newline in the action bar to keep <br> line breaks.
4Toggle Collapse WS off if you want the raw whitespace from the source.
5Click Copy or Download to save the result.

Keyboard shortcuts

Drive TextResult without touching the mouse.

Shortcut	Action
`Ctrl` `F`	Open the find & replace panel inside the input Plus
`Ctrl` `Z`	Undo the last input change
`Ctrl` `Shift` `Z`	Redo
`Ctrl` `Shift` `Enter`	Toggle fullscreen focus on the editor Plus
`Esc`	Close find & replace, or exit fullscreen
`Ctrl` `K`	Open the command palette to jump to any tool Plus
`Ctrl` `S`	Save current workflow draft Plus
`Ctrl` `P`	Run a saved workflow Plus

What this tool actually does

DOMParser-based, not regex

The browser builds a real DOM tree from the HTML and the tool reads textContent. Nested tags, malformed markup and self-closing elements are all handled the way your browser handles them. Compare with strip HTML tags, which is regex-based and faster on small snippets.

HTML entities decoded

Named entities (&, <, ©), numeric entities (—) and hex entities (—) are all decoded to their characters. Regex-based strippers leave entities in place; this tool does not.

BR to Newline toggle

Off (default): <br> tags vanish because textContent ignores them. On: every <br> in the source is replaced by a newline before parsing, so visible line breaks come through.

Collapse WS toggle

On (default): runs of whitespace are flattened to single spaces and lines are trimmed. Off: the raw whitespace from the source HTML is preserved, including indentation around tags. With both BR to Newline and Collapse WS on, each line is collapsed individually so you keep the line breaks but lose the per-line indentation.

Script and style content included

Because textContent walks every node, the bodies of <script> and <style> tags come through as text. Strip those tags upstream with find and replace if you only want the visible body copy.

Worked example

With Collapse WS on (default) and BR to Newline on, you get one tidy line per <br> break. Notice & decoded to &, and the href attribute did not appear in the output.

Input

<p>Hello <strong>world</strong>!</p>
<p>Line one.<br>Line two.</p>
<a href="https://example.com">Visit &amp; say hi</a>

Output

Hello world! Line one. Line two. Visit & say hi

Settings reference

Setting	Effect on output
BR to Newline off (default)	`<br>` tags vanish. Line one and two run together unless other whitespace separates them.
BR to Newline on	Every `<br>` becomes a literal newline before parsing.
Collapse WS on (default)	Runs of whitespace (spaces, tabs, newlines) flatten to a single space; lines are trimmed.
Collapse WS off	Raw whitespace from the source is kept, including indentation between tags.
HTML entities	Always decoded. `&` becomes `&`.
Script and style bodies	Included as text. Remove tags upstream if you do not want their content.

FAQ

How is this different from strip HTML tags?

Strip HTML tags is a regex-based tool that removes anything between < and >. It is fast for tiny snippets but does not decode entities and can choke on nested or malformed markup. This tool uses the browser's real DOMParser, which builds a proper DOM, decodes entities, and produces output identical to what the browser would render as text.

Why does my <br> tag disappear?

Because <br> elements have no text content, the DOM walker skips them. Turn on BR to Newline in the action bar to swap each <br> for a literal newline before parsing.

Are <script> and <style> bodies removed?

No. textContent walks every node, so script and style code comes through as text. If you only want the visible body copy, run find and replace with regex first to delete <script[^>]*>[\s\S]*?</script> and the equivalent for style, then paste the result here.

Are HTML entities decoded?

Yes. Named (&), numeric (—) and hex (—) entities all decode to their characters because the browser does it for you when it builds the DOM.

Is anything sent to a server?

No. DOMParser runs entirely in your browser. Nothing uploads, nothing is logged.

Also known as

extract text from html html to text html text extractor pull text out of html dom text content html stripper parse html to plain text get text from markup

Extract text from HTML

How HTML extraction works here

How to use extract text from html

Keyboard shortcuts

What this tool actually does

DOMParser-based, not regex

HTML entities decoded

BR to Newline toggle

Collapse WS toggle

Script and style content included

Worked example

Settings reference

FAQ

Also known as

Explore another workspace

Text Formatting

Text Cleaning

Text Conversion

Find & Replace

Generators

Counters & Analysis

Encoding & Security

Text Extraction

Text Comparison

Text Styling