- html_entity_decode
- Parameters
- Return Values
- htmlspecialchars_decode
- Parameters
- Return Values
- Changelog
- Saved searches
- Use saved searches to filter your results more quickly
- License
- soundasleep/html2text
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
html_entity_decode
html_entity_decode() is the opposite of htmlentities() in that it converts HTML entities in the string to their corresponding characters.
More precisely, this function decodes all the entities (including all numeric entities) that a) are necessarily valid for the chosen document type — i.e., for XML, this function does not decode named entities that might be defined in some DTD — and b) whose character or characters are in the coded character set associated with the chosen encoding and are permitted in the chosen document type. All other entities are left as is.
Parameters
A bitmask of one or more of the following flags, which specify how to handle quotes and which document type to use. The default is ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401 .
Constant Name | Description |
---|---|
ENT_COMPAT | Will convert double-quotes and leave single-quotes alone. |
ENT_QUOTES | Will convert both double and single quotes. |
ENT_NOQUOTES | Will leave both double and single quotes unconverted. |
ENT_SUBSTITUTE | Replace invalid code unit sequences with a Unicode Replacement Character U+FFFD (UTF-8) or � (otherwise) instead of returning an empty string. |
ENT_HTML401 | Handle code as HTML 4.01. |
ENT_XML1 | Handle code as XML 1. |
ENT_XHTML | Handle code as XHTML. |
ENT_HTML5 | Handle code as HTML 5. |
An optional argument defining the encoding used when converting characters.
If omitted, encoding defaults to the value of the default_charset configuration option.
Although this argument is technically optional, you are highly encouraged to specify the correct value for your code if the default_charset configuration option may be set incorrectly for the given input.
The following character sets are supported:
Charset | Aliases | Description |
---|---|---|
ISO-8859-1 | ISO8859-1 | Western European, Latin-1. |
ISO-8859-5 | ISO8859-5 | Little used cyrillic charset (Latin/Cyrillic). |
ISO-8859-15 | ISO8859-15 | Western European, Latin-9. Adds the Euro sign, French and Finnish letters missing in Latin-1 (ISO-8859-1). |
UTF-8 | ASCII compatible multi-byte 8-bit Unicode. | |
cp866 | ibm866, 866 | DOS-specific Cyrillic charset. |
cp1251 | Windows-1251, win-1251, 1251 | Windows-specific Cyrillic charset. |
cp1252 | Windows-1252, 1252 | Windows specific charset for Western European. |
KOI8-R | koi8-ru, koi8r | Russian. |
BIG5 | 950 | Traditional Chinese, mainly used in Taiwan. |
GB2312 | 936 | Simplified Chinese, national standard character set. |
BIG5-HKSCS | Big5 with Hong Kong extensions, Traditional Chinese. | |
Shift_JIS | SJIS, SJIS-win, cp932, 932 | Japanese |
EUC-JP | EUCJP, eucJP-win | Japanese |
MacRoman | Charset that was used by Mac OS. | |
» | An empty string activates detection from script encoding (Zend multibyte), default_charset and current locale (see nl_langinfo() and setlocale() ), in this order. Not recommended. |
Note: Any other character sets are not recognized. The default encoding will be used instead and a warning will be emitted.
Return Values
Returns the decoded string.
htmlspecialchars_decode
This function is the opposite of htmlspecialchars() . It converts special HTML entities back to characters.
The converted entities are: & , " (when ENT_NOQUOTES is not set), ' (when ENT_QUOTES is set), < and > .
Parameters
A bitmask of one or more of the following flags, which specify how to handle quotes and which document type to use. The default is ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401 .
Constant Name | Description |
---|---|
ENT_COMPAT | Will convert double-quotes and leave single-quotes alone. |
ENT_QUOTES | Will convert both double and single quotes. |
ENT_NOQUOTES | Will leave both double and single quotes unconverted. |
ENT_SUBSTITUTE | Replace invalid code unit sequences with a Unicode Replacement Character U+FFFD (UTF-8) or � (otherwise) instead of returning an empty string. |
ENT_HTML401 | Handle code as HTML 4.01. |
ENT_XML1 | Handle code as XML 1. |
ENT_XHTML | Handle code as XHTML. |
ENT_HTML5 | Handle code as HTML 5. |
Return Values
Returns the decoded string.
Changelog
Version | Description |
---|---|
8.1.0 | flags changed from ENT_COMPAT to ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401 . |
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
A PHP component to convert HTML into a plain text format
License
soundasleep/html2text
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
html2text is a very simple script that uses DOM methods to convert HTML into a format similar to what would be rendered by a browser — perfect for places where you need a quick text representation. For example:
html> title>Ignored Titletitle> body> h1>Hello, World!h1> p>This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly. p>Even mismatched tags.p> div>A divdiv> div>Another divdiv> div>A divdiv>within a divdiv>div> a href pl-s">http://foo.com">A linka> body> html>
Hello, World! This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly. Even mismatched tags. A div Another div A div within a div [A link](http://foo.com)
You can use Composer to add the package to your project:
< "require": < "soundasleep/html2text": "~1.1" > >
And then use it quite simply:
$text = \Soundasleep\Html2Text::convert($html);
You can also include the supplied html2text.php and use $text = convert_html_to_text($html); instead.
Option | Default | Description |
---|---|---|
ignore_errors | false | Set to true to ignore any XML parsing errors. |
drop_links | false | Set to true to not render links as [http://foo.com](My Link) , but rather just My Link . |
char_set | ‘auto’ | Specify a specific character set. Pass multiple character sets (comma separated) to detect encoding, default is ASCII,UTF-8 |
Pass along options as a second argument to convert , for example:
$options = array( 'ignore_errors' => true, // other options go here ); $text = \Soundasleep\Html2Text::convert($html, $options);
Some very basic tests are provided in the tests/ directory. Run them with composer install && vendor/bin/phpunit .
Class ‘DOMDocument’ not found
You need to install the PHP XML extension for your PHP version. e.g. apt-get install php7.4-xml
html2text is licensed under MIT, making it suitable for both Eclipse and GPL projects.
Also see html2text_ruby, a Ruby implementation.
About
A PHP component to convert HTML into a plain text format