How to convert html entities to readable text?
I want html number entities like ę and want to convert it to real character. I have emails mostly from linkedin that look like this:
chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy
xclip -o -sel clip | html2text | less
but it didn’t convert the entities. Is there a way to have that text using command line tools? The only way I can think of is to use data:text/html,
4 Answers 4
With Free recode (formerly known as GNU recode ):
If you don’t have recode or HTML::Entities and only need to decode &#x; entities, you could do it by hand with:
perl -Mopen=locale -pe 's/([\da-f]+);/chr hex $1/gie'
Didn’t have html2text ; not sure it matters. This example fails with recode: Request ‘html’ is erroneous . Seems it needs to be run this way now with a range instead of a single identifier: recode html..utf-8 . A bit strange, but I guess it’s all similar translating codes at some levels.
@Pysis, you’ll notice the first version of this answer had html.. later changed to html in 2014. html alone definitely works with the latest version (git head from December 2019) or from 3.6 from 2008. Is it possible you have a very old version?
From How can I decode HTML entities? on StackOverflow, you may be able to implement a simple perl solution such as
perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt
e.g. using your example text
$ perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy
With -Mopen=locale , I/O is done in the locale’s character set. That includes input from email.txt . It looks like email.txt contains only ASCII characters (the whole point of encoding those characters using the &#x; notation I suppose), but if not you may need to adapt the above to also decode that file using the right charset (if it’s not the same as the locale’s one) instead of using open=locale .
HTML to UNFORMATTED plain text?
I’m looking for a way to convert a folder full of HTML files to plain text. What I want is for the text files to be as much as possible like what I’d get if I selected all the text in a web browser, copied it, and pasted the text into a plain text file. NO, REALLY, I WANT UNFORMATTED PLAIN TEXT. All of the solutions that I’m finding produce Markdown or something that looks like it, or tries to preserve layout, or uses asterisks and underscores to indicate text formatting, or preserves the content of scripts in the output file, or some clever goddam thing. All I want is the words written by the author in the order that the author wrote them. I don’t even care if the processing converts all of the list items in a list into a single paragraph, or even collapses the entire document into a single paragraph. Any of this is much better than giving me anything at all other than the actual language contained in the document. I’d love a terminal application or Python script, but I’ll take anything I can get.
yup, sed can do it, and a host of other utilities. This is a basic scrape for content I think, but you’re not saying whether you want the header information — there’s tags that don’t show in the body, including javascripts and such not in tags. Can you clarify that what you want it just the text content of a page?
@gronostaj That gets me closer, but isn’t perfect: some tags (
,
) are whitespace and really should be converted into space characters, because they separate actual words (as in «Here are some lines
in a quote»). OTOH, some tags (like for inline scripts) are or can be containers for things that don’t count as «plain text.»
html2text(1) — Linux man page
html2text -help
html2text -version
html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] [ -rcfile path ] [ -style ( compact | pretty ) ] [ -width width ] [ -o output-file ] [ -nobs ] [ -ascii ] [ input-url . ]
Description
html2text reads HTML documents from the input-urls, formats each of them into a stream of plain text characters, and writes the result to standard output (or into output-file, if the -o command line option is used).
Documents that are specified by a URL (RFC 1738) that begins with «http:» are retrieved with the Hypertext Transfer Protocol (RFC 1945). URLs that begin with «file:» and URLs that do not contain a colon specify local files. All other URLs are invalid.
If no input-urls are specified on the command line, html2text reads from standard input. A dash as the input-url is an alternate way to specify standard input.
html2text understands all HTML 3.2 constructs, but can render only part of them due to the limitations of the text output format. However, the program attempts to provide good substitutes for the elements it cannot render. html2text parses HTML 4 input, too, but not always as successful as other HTML processors. It also accepts syntactically incorrect input, and attempts to interpret it «reasonably».
The way html2text formats the HTML documents is controlled by formatting properties read from an RC file. html2text attempts to read $HOME/.html2textrc (or the file specified by the -rcfile command line option); if that file cannot be read, html2text attempts to read /etc/html2textrc. If no RC file can be read (or if the RC file does not override all formatting properties), then «reasonable» defaults are assumed. The RC file format is described in the html2textrc(5) manual page.
Options
By default, html2text uses ISO 8859-1 for the output. Specifying this option, plain ASCII is used instead. To find out how non-ASCII characters are rendered, refer to the file «ascii.substitutes».
This option is for diagnostic purposes: The HTML document is only parsed and not processed otherwise. In this mode of operation, html2text will report on parse errors and scan errors, which it does not in other modes of operation. Note that parse and scan errors are not fatal for html2text, but may cause mis-interpretation of the HTML code and/or portions of the document being swallowed. -debug-parser Let html2text report on the tokens being shifted, rules being applied, etc., while scanning the HTML document. This option is for diagnostic purposes. -debug-scanner Let html2text report on each lexical token scanned, while scanning the HTML document. This option is for diagnostic purposes. -help
Print command line summary and exit.
By default, html2text renders underlined letters with sequences like «underscore-backspace-character» and boldface letters like «character-backspace-character», which works fine when the output is piped into more(1), less(1), or similar. For other applications, or when redirecting the output into a file, it may be desirable not to render character attributes with such backspace sequences, which can be accomplished with this command line option. -o output-file Write the output to output-file instead of standard output. A dash as the output-file is an alternate way to specify the standard output. -rcfile path Attempt to read the file specified in path as RC file. -style ( compact | pretty ) Style pretty changes some of the default values of the formatting parameters documented in html2textrc(5). To find out which and how the formatting parameter defaults are changed, check the file «pretty.style». If this option is omitted, style compact is assumed as default. -unparse This option is for diagnostic purposes: Instead of formatting the parsed document, generate HTML code, that is guaranteed to be syntactically correct. If html2text has problems parsing a syntactically incorrect HTML document, this option may help you to understand what html2text thinks that the original HTML code means. -version Print program version and exit. -width width By default, html2text formats the HTML documents for a screen width of 79 characters. If redirecting the output into a file, or if your terminal has a width other than 80 characters, or if you just want to get an idea how html2text deals with large tables and different terminal widths, you may want to specify a different width.
Files
/etc/html2textrc System wide parser configuration file. $HOME/.html2textrc Personal parser configuration file, overrides the system wide values.
Conforming To
HTML 3.2 (HTML 3.2 Reference Specification — http://www.w3.org/TR/REC-html32),
RFC 1945 (Hypertext Transfer Protocol — HTTP).
Restrictions
html2text provides only a basic implementation of the Hypertext Transfer Protocol (HTTP). It requires the complete and exactly matching URL to be given as argument and will not follow redirections (HTTP 301/ 307).
html2text was written to convert HTML 3.2 documents. When using it with HTML 4 or even XHTML 1 documents, some constructs present only in these HTML versions might not be rendered.
Author
html2text was written up to version 1.2.2 by Arno Unkrig for GMRS Software GmbH, Unterschleissheim.
Echo HTML Into Text File [duplicate]
That depends on how bash was built. In the bash of Solaris 11.4 for instance, \x sequences are expanded by default and echo -e outputs -e (as POSIX currently requires). Use printf instead to get a consistent behaviour.
You can solve this by using single quotes instead of double-quotes. So, this should work as expected —
echo '\n\n\t\n\t\tHello World!
\n\t\n' > index.html
When you use single quotes, bash doesn’t try to interpret special characters and simply preserves the literal string.
You are triggering a history expansion in bash with ! . Either turn off history expansions with set +H , use a single quoted string, or use a here-document to write your HTML:
$ cat index.html Hello World!
END_HTML
Or, if you want to write out those encoded tabs and newlines as they are:
$ cat index.html \n\n\t\n\t\tHello World!
\n\t\n END_HTML
History expansions are not triggered within here-documents in bash .
Linked
Related
Hot Network Questions
Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.27.43548
Linux is a registered trademark of Linus Torvalds. UNIX is a registered trademark of The Open Group.
This site is not affiliated with Linus Torvalds or The Open Group in any way.
By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.