- NAME
- SYNOPSIS
- EXPORTS
- REQUIRES
- DESCRIPTION
- FUNCTIONS
- BUGS
- AUTHOR
- VERSION
- Saved searches
- Use saved searches to filter your results more quickly
- l/html-format
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.pod
- About
- Saved searches
- Use saved searches to filter your results more quickly
- nigelm/html-formatter
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.pod
- About
NAME
Convert::SimpleHtml2Text — Converts an HTML document to text.
SYNOPSIS
use Convert::SimpleHtml2Text; $plainText = simpleHtmlToText($htmlText); $plainText = simpleHtmlToText($htmlText, 1);
EXPORTS
Default: simpleHtmlToText
REQUIRES
DESCRIPTION
Converts an HTML document to text by stripping out all formatting tags and doing simple conversions. Multiple blanks are all removed on each line. Leading and trailing blanks are removed on each line. Multiple line breaks are removed.
Example: This block of HTML is an abbreviated version of Google’s search page, a refreshingly simple web page.
html> head> meta http-equiv="content-type" content="text/html; charset=UTF-8"> title>Googletitle> head> body bgcolor=#ffffff text=#000000 link=#0000cc vlink=#551a8b alink=#ff0000 onLoad=sf()> br> form action="/search" name=f> table cellspacing=0 cellpadding=0> tr> input type=submit value="I'm Feeling Lucky" name=btnI> td> td valign=top nowrap> • Advanced Search • Preferences • Language Tools td> tr> p> font size=-2>©2002 Googlefont> font size=-2>- Searching 3,083,324,652 web pagesfont> body> html>
Running that piece of HTML through the simpleHtmlToText function results in:
Google • Advanced Search • Preferences • Language Tools ©2002 Google - Searching 3,083,324,652 web pages
The optional keepImages flag allows you to retain a little bit of information about a graphic file—the base file name. With this input:
Some text here.
. assuming you have set the keepImages flag to true.
FUNCTIONS
simpleHtmlToText(text, keepImages)
simpleHtmlToText(text)
Convert an HTML document (represented by a string) into text, optionally keeping the image references. If keepImages is true, then image references will be condensed and retained. For example, will be replaced by [#hello.gif#] . The «[# #]» brackets are used for easy selection of images by other applications.
text — string; a string representing the HTML document.
keepImages — optional; a boolean indicating to keep image file names in output.
A string representing the text of the HTML document.
BUGS
AUTHOR
VERSION
$Revision: 8 $ $Date: 2006-12-19 21:13:43 -0800 (Tue, 19 Dec 2006) $
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
perl HTML::Formatter family of html to text/postscript/rtf conversion modules
l/html-format
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.pod
HTML::Formatter — Base class for HTML formatters
use HTML::FormatSomething; my $infile = "whatever.html"; my $outfile = "whatever.file"; open OUT, ">$outfile" or die "Can't write-open $outfile: $!\n"; print OUT HTML::FormatSomething->format_file( $infile, 'option1' => 'value1', 'option2' => 'value2', . ); close(OUT);
HTML::Formatter is a base class for classes that take HTML and format it to some output format. When you take an object of such a base class and call $formatter->format( $tree ) with an HTML::TreeBuilder (or HTML::Element) object, they return the
HTML formatters are able to format a HTML syntax tree into various printable formats. Different formatters produce output for different output media. Common for all formatters are that they will return the formatted output when the format() method is called. The format() method takes a HTML::Element object (usually the HTML::TreeBuilder root object) as parameter.
my $formatter = FormatterClass->new( option1 => value1, option2 => value2, . );
This creates a new formatter object with the given options.
$string = FormatterClass->format_file( $html_source, option1 => value1, option2 => value2, . );
Return a string consisting of the result of using the given class to format the given HTML file according to the given (optional) options. Internally it calls SomeClass->new( . )->format( . ) on a new HTML::TreeBuilder object based on the given HTML file.
$string = FormatterClass->format_string( $html_source, option1 => value1, option2 => value2, . );
Return a string consisting of the result of using the given class to format the given HTML source according to the given (optional) options. Internally it calls SomeClass->new( . )->format( . ) on a new HTML::TreeBuilder object based on the given source.
my $render_string = $formatter->format( $html_tree_object );
This renders the given HTML object according to the options set for $formatter.
After you’ve used a particular formatter object to format a particular HTML tree object, you probably should not use either again.
The three specific formatters:-
Format HTML into plain text
Format HTML into postscript
Format HTML into Rich Text Format
Also the HTML manipulation libraries used — HTML::TreeBuilder, HTML::Element and HTML::Tree
See perlmodinstall for information and options on installing Perl modules.
No bugs have been reported.
Please report any bugs or feature requests through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=HTML-Format.
The latest version of this module is available from the Comprehensive Perl Archive Network (CPAN). Visit http://www.perl.com/CPAN/ to find a CPAN site near you, or see http://search.cpan.org/dist/HTML-Format/.
The development version lives at http://github.com/nigelm/html-format and may be cloned from git://github.com/nigelm/html-format.git. Instead of sending patches, please fork this project using the standard git and github infrastructure.
This software is copyright (c) 2011 by Nigel Metheringham, 2002-2005 Sean M Burke, 1999-2002 Gisle Aas.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
About
perl HTML::Formatter family of html to text/postscript/rtf conversion modules
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
perl HTML::Formatter family of html to text/postscript/rtf conversion modules
nigelm/html-formatter
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.pod
HTML::Formatter — Base class for HTML formatters
use HTML::FormatSomething; my $infile = "whatever.html"; my $outfile = "whatever.file"; open OUT, ">$outfile" or die "Can't write-open $outfile: $!\n"; print OUT HTML::FormatSomething->format_file( $infile, 'option1' => 'value1', 'option2' => 'value2', . ); close(OUT);
HTML::Formatter is a base class for classes that take HTML and format it to some output format. When you take an object of such a base class and call $formatter- format( $tree )> with an HTML::TreeBuilder (or HTML::Element) object, they return the appropriately formatted string for the input HTML.
HTML formatters are able to format a HTML syntax tree into various printable formats. Different formatters produce output for different output media. Common for all formatters are that they will return the formatted output when the format() method is called. The format() method takes a HTML::Element object (usually the HTML::TreeBuilder root object) as parameter.
The distribution name has been changed to HTML-Formatter as detailed in «DISTRIBUTION NAME»
my $formatter = FormatterClass->new( option1 => value1, option2 => value2, . );
This creates a new formatter object with the given options.
$string = FormatterClass->format_file( $html_source, option1 => value1, option2 => value2, . );
Return a string consisting of the result of using the given class to format the given HTML file according to the given (optional) options. Internally it calls SomeClass->new( . )->format( . ) on a new HTML::TreeBuilder object based on the given HTML file.
$string = FormatterClass->format_string( $html_source, option1 => value1, option2 => value2, . );
Return a string consisting of the result of using the given class to format the given HTML source according to the given (optional) options. Internally it calls SomeClass->new( . )->format( . ) on a new HTML::TreeBuilder object based on the given source.
my $render_string = $formatter->format( $html_tree_object );
This renders the given HTML object according to the options set for $formatter.
After you’ve used a particular formatter object to format a particular HTML tree object, you probably should not use either again.
This module was originally named HTML-Format despite not containing a HTML::Format module within it. As rules on naming have been taken more seriously, and the PAUSE toolchain adapted so that getting the distribution indexed was more difficult, it became obvious that I should rename the distribution to HTML-Formatter matching the base HTML::Formatter module.
As of release 2.13 this is released as the HTML-Formatter distribution with corresponding changes to the git repository name and associated items.
Due to the way that the module is put together this should have no effect on code using the module. The only issues will be where the distribution name was used within dependancies.
The three specific formatters:-
Format HTML into plain text
Format HTML into postscript
Format HTML into Rich Text Format
Also the HTML manipulation libraries used — HTML::TreeBuilder, HTML::Element and HTML::Tree
Please report any bugs or feature requests through the issue tracker at http://rt.cpan.org/Public/Dist/Display.html?Name=HTML-Formatter. You will be notified automatically of any progress on your issue.
This is open source software. The code repository is available for public review and contribution under the terms of the license.
git clone https://github.com/nigelm/html-formatter.git
This software is copyright (c) 2016 by Nigel Metheringham, 2002-2005 Sean M Burke, 1999-2002 Gisle Aas.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
About
perl HTML::Formatter family of html to text/postscript/rtf conversion modules