- Saved searches
- Use saved searches to filter your results more quickly
- mtibben/html2text
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- html2text
- How to install
- How to run unit tests
- Saved searches
- Use saved searches to filter your results more quickly
- License
- emludei/html_to_text
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
- Saved searches
- Use saved searches to filter your results more quickly
- kranemora/html2text
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
PHP library to convert HTML to formatted plain text
mtibben/html2text
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
A PHP library for converting HTML to formatted plain text.
composer require html2text/html2text
$html = new \Html2Text\Html2Text('Hello, "world"'); echo $html->getText(); // Hello, "WORLD"
This library started life on the blog of Jon Abernathy http://www.chuggnutt.com/html2text
A number of projects picked up the library and started using it — among those was RoundCube mail. They made a number of updates to it over time to suit their webmail client.
Now it has been extracted as a standalone library. Hopefully it can be of use to others.
About
PHP library to convert HTML to formatted plain text
html2text
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).
Usage: html2text [filename [encoding]]
Option | Description |
---|---|
—version | Show program’s version number and exit |
-h , —help | Show this help message and exit |
—ignore-links | Don’t include any formatting for links |
—escape-all | Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
—reference-links | Use reference links instead of links to create markdown |
—mark-code | Mark preformatted and code blocks with [code]. [/code] |
For a complete list of options see the docs
Or you can use it from within Python :
>>> import html2text >>> >>> print(html2text.html2text("Zed's dead baby, Zed's dead.
")) **Zed's** dead baby, _Zed's_ dead.
Or with some configuration options:
>>> import html2text >>> >>> h = html2text.HTML2Text() >>> # Ignore converting links from HTML >>> h.ignore_links = True >>> print h.handle("Hello, world!") Hello, world! >>> print(h.handle("
Hello, world!")) Hello, world! >>> # Don't Ignore links anymore, I like links >>> h.ignore_links = False >>> print(h.handle("
Hello, world!")) Hello, [world](https://www.google.com/earth/)!
Originally written by Aaron Swartz. This code is distributed under the GPLv3.
How to install
How to run unit tests
To see the coverage results:
then open the ./htmlcov/index.html file in your browser.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Simple library for extracting text from html
License
emludei/html_to_text
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
This simple library helps you to extract useful text information from html documents.
To install html_to_text, simply:
Prepare your html document and remove exess tags from it. To do this, you can use html_cleaner object returned by get_html_cleaner function. The get_html_cleaner function takes three parameters:
- remove_without_content : A set of tags which will be removed without their content.
- remove_with_content : A set of tags which will be removed with their content.
- convert_charrefs : If it is True, all character references will be automatically converted to the corresponding Unicode characters (default True).
To extract useful text from html document should use parser object returned by get_parser function. This function takes:
- tags_to_save : A set of tags for saving.
- tags_to_remove : A set of tags for removing.
- punctuation : Punctuation marks.
- min_allowed_weight : Minimum allowed weight for chunk (html block).
- save_attrs : If parameter is true, attributes of tag will be save, default False.
- tag_class : Tag class.
- tag_link : Tag link (‘a’ default).
- chunk_class : Chunk class.
- tag_wrapper : Wrapper for tags.
- chunks_wrapper : Wrapper for chunks (blocks with html).
- save_chunks_wrapper : Wrapper for ‘save’ chunks.
- splitter : HTMLSplitter instance. Which can split html document to chunks (little blocks with html).
- chunks_cleaner : HTMLChunksCleaner instance. Which can remove tags from chunks and calculate length of links.
- save_chunks_cleaner : HTMLChunksCleaner instance. Which can remove tags from chunks.
>>> from html_to_text import get_parser >>> parser = get_parser( . tags_to_save='title', 'h1','h2'>, . tags_to_remove='h1', 'h2', 'script', 'style'>, . min_allowed_weight=2.3 . ) . >>> parser.feed(cleaner.data) >>> print(parser.data) This is some text information. This is some text information. This is some text information. This is some text information. >>> print(parser.saved_tags) 'h1': ['This is h1 example.'], 'title': ['Example'], 'h2': ['This is h2 example.']>
About
Simple library for extracting text from html
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Convert HTML documents to plain text
kranemora/html2text
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Convert HTML documents to plain text.
composer require kranemora/html2text
$html =Welcome to html2text
The best html to text converter!
EOF; $html2Text = new \kranemora\Html2Text\Html2Text; $text = $html2Text->convert($html);
Welcome to html2text The best html to text converter!
Test Document Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur porttitor nisi nec finibus bibendum. Donec at elementum leo. Donec eu felis vehicula, efficitur est at, fringilla nisi. Donec congue tortor vel pulvinar mattis. Etiam id ornare magna. In dapibus et nisl eget convallis. Etiam eu feugiat ante. Phasellus vulputate nec velit nec sagittis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Ut gravida accumsan lorem, id viverra nunc ultrices quis. Duis in tristique ligula, vel semper urna. Dolor sit amet consectetur adipiscing elit. Curabitur porttitor nisi nec finibus bibendum Donec at elementum leo. Donec eu felis vehicula Efficitur est at. +-----------+---------------+-------+ | Position | Gender | Total | | |---------------| | | | Male | Female | | +-----------+------+--------+-------+ | Tutor | 5 | 8 | 13 | +-----------+------+--------+-------+ | Professor | 10 | 8 | 18 | +-----------+------+--------+-------+ Aenean a massa convallis - Ultrices magna vitae - Gravida velit - Nunc lobortis - Tortor nec auctor ultricies Curabitur bibendum eu diam et venenatis - Donec vitae enim suscipit - Porta nunc tincidunt - Consequat leo - Nunc eu risus rutrum Lorem ipsum - Facebook [https://www.facebook.com] - Twitter [https://www.twitter.com] - Linkedin [https://www.linkedin.com/] - Instagram [https://www.instagram.com] Lorem ipsum
[Ultrices magna vitae], [Gravida velit], [Nunc lobortis], [Tortor nec auctor ultricies] [Tortor nec auctor ultricies], [Nunc lobortis], [Gravida velit], [Ultrices magna vitae]
namespace kranemora\Html2Text\Parsers; use DOMElement; class OlParser extends BaseParser < // Overwrite this function and return the node in plain text public function getText(DOMElement $node) < $options = $this->getOptions(); // Gets the options that were set with Html2Tex :: setDefaultOptions // Write here the algorithm to convert the node to plain text return "node in plain text"; > >
Set the Parser to the HTML element
$options = [ 'ol' => [ 'break' => "\n", 'parser' => [ 'class' => '\kranemora\Html2Text\Parsers\OlParser', 'options' => [ 'reverse' => 0 ] ] ] ];
This project is licensed under the MIT license.