Perl html to xml

Perl html to xml

lacika has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone, I�m having trouble trying to convert a HTML webpage to XML. I know there is some perl module out there capable of doing this, just don�t know which one. Fetching the HTML from the web is pretty straightforward, the only thing i need is to output fetched results to XML. I�m a initiate and a little lost and just can�t do this by myself. I humbly ask anyone to help with anything that might solve this problem. Many thanks in advance p.s.- Great service you guys run here 🙂

Probably the simplest way is to use the -asxml flag of tidy — which is written in perl 🙂 [There is a perl wrapper for TidyLib, called HTML::Tidy]. $ tidy -asxml foo.html > foo.xml

While we’re suggesting modules, I’d also point to libxml2 and associated utilities, which is probably installed if you have a recentish linux installation, and is available through that link if not. It also also has an associated Perl module XML::LibXML. The bonus is, if you install that stuff, you can process the resulting XML with Perl. The drawback to the tidy-based approach is that the libxml2 code is more generic, and so you’d have to work to get DOCTYPE lines to come out correctly; however, libxml2 also has a wider area of application.

Читайте также:  Таблицы истинности логических выражений python

If not P, what? Q maybe?
«Sidney Morgenbesser»

The HTML::Tree suite seems to have some XML capabilities. HTML::Element has an XML dump method: $h->as_XML(), which might be a first step, depending on what you want to do.

There is also a HTML::DOMbo module, which turns your HTML tree into an XML tree, and AFAICS, lets you use all of the DOM tools you want on it.

While I have been using HTML::Tree a lot recently (and I highly recommend it for doing most anything with HTML), I haven’t experimented with the XML stuff yet. But it seems promising.

    Re: Re: How to convert HTML to XML w/ Perl?
    by lacika (Initiate) on Feb 07, 2004 at 16:16 UTC
      First of all, I would like to thank you guys, Zaxo, Arturo, Anonymous Monk and Skillet Thief for the prompt response. I will try the modules you suggested and hopefully come back with a big smile on my face. Your help was very much appreciated! See you soon!

    PerlMonks lovingly hand-crafted by Tim Vroom.
    PerlMonks is a proud member of the The Perl Foundation.
    Marvelous Managed Hosting and Bandwidth Generously Provided by pair Networks
    Built with the Perl programming language.

    Источник

    Saved searches

    Use saved searches to filter your results more quickly

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

    html -> xml::document converter

    tony-o/perl6-html-parser-xml

    This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

    Name already in use

    A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

    Sign In Required

    Please sign in to use Codespaces.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching GitHub Desktop

    If nothing happens, download GitHub Desktop and try again.

    Launching Xcode

    If nothing happens, download Xcode and try again.

    Launching Visual Studio Code

    Your codespace will open once ready.

    There was a problem preparing your codespace, please try again.

    Latest commit

    Git stats

    Files

    Failed to load latest commit information.

    README.md

    This module will read HTML and attempt to build an XML::Document (https://github.com/supernovus/exemel/#xmldocument-xmlnode)

    • Automatically closes certain tags if certain other tags are encountered
    • Parses dirty HTML fairly well (AFAIK), submit a bug if it doesn’t
    • Perl6 Magicness

    Bugs/feature requests Maintenance mode

    my $html = LWP::Simple.get('http://some-non-https-site.com/'); my $parser = HTML::Parser::XML.new; $parser.parse($html); $parser.xmldoc; # XML::Document
    my $html = LWP::Simple.get('http://some-non-https-site.com/'); my $parser = HTML::Parser::XML.new; my $xmldoc = $parser.parse($html);

    Contact me, tony-o on irc.freenode #perl6 (tony-o)

    Источник

    Perl API to convert HTML to XML

    Use Cells Conversion REST API to create customized spreadsheet workflows in Perl. This is a professional solution to convert HTML to XML and other document formats online using Perl.

    Convert a HTML file to XML in Perl

    Converting file formats from HTML to XML is a complex task. All HTML to XML format transitions is performed by our Perl SDK while maintaining the source HTML spreadsheet’s main structural and logical content. Our Perl library is a professional solution to convert HTML to XML files online. This Cloud SDK gives Perl developers powerful functionality and perfect XML output.

    Code example in Perl using REST API to convert HTML to XML format

    # For complete examples and data files, please go to https://github.com/aspose-cells-cloud/aspose-cells-cloud-perl/  use strict;  use warnings;  use utf8;  use File::Slurp;  use AsposeCellsCloud::CellsApi;  my $config = AsposeCellsCloud::Configuration->new( client_id => $ENV'ProductClientId'>, client_secret => $ENV'ProductClientSecret'>);  my $instance = AsposeCellsCloud::CellsApi->new(AsposeCellsCloud::ApiClient->new( $config));  my $format = "xml";  my $Book1Data = undef;  my $result =undef;  my @fileinfos = stat("Book1.html");  my $filelength = $fileinfos[7];  open(DATA, ', "Book1.html") or die "file can not open, $!";  binmode(DATA);  read (DATA, $Book1Data, $filelength);  close (DATA);  $result = $instance->cells_workbook_put_convert_workbook(workbook => $Book1Data, format => $format);  open(my $fh, '>', "Dest.xml") or die "Could not open file!";  binmode $fh;  print $fh $result;  close $fh; 

    How to use Perl API to convert HTML to XML

    1. Create an account at Dashboard to get free API quota & authorization details
    2. Initialize CellsApi with Client Id, Client Secret, Base URL & API version
    3. Call cells_workbook_put_convert_workbook method to get the resultant stream

    Источник

    Working with HTML¶

    If you ever need to extract text and data from HTML documents, the libxml parser and DOM provide very useful tools. You might imagine that libxml would only work with XHTML and even then only strictly well-formed documents. In fact, the parser has an HTML mode that handles unclosed tags like and
    and is even able to recover from parse errors caused by poorly formed HTML.

    Let’s start with this mess of HTML tag soup:

    To read the file in, you’d use the load_html() method rather than load_xml() . You’ll almost certainly want to use the recover => 1 option to tell the parser to try to recover from parse errors and carry on to produce a DOM.

    #!/usr/bin/perl use 5.010; use strict; use warnings; use XML::LibXML; my $filename = 'untidy.html'; my $dom = XML::LibXML->load_html( location => $filename, recover => 1, ); say $dom->toStringHTML(); 

    When the DOM is serialised with toStringHTML() , some rudimentary formatting is applied automatically. Unfortunately there is no option to add indenting to the HTML output:

    While the document is being parsed, you’ll see messages like this on STDERR:

    untidy.html:2: HTML parser error : Opening and ending tag mismatch: i and b 

    Here's a paragraph with poorly nested ^ untidy.html:2: HTML parser error : Unexpected end tag : b

    Here's a paragraph with poorly nested ^

    You can turn off the error output with the suppress_errors option:

    my $dom = XML::LibXML->load_html( location => $filename, recover => 1, suppress_errors => 1, ); 

    That option doesn’t seem to work with all versions of XML::LibXML so you may want to use a routine like this that sends STDERR to /dev/null during parsing, but still allows other output to STDERR when the parse function returns:

    use File::Spec; sub parse_html_file  my($filename) = @_; local(*STDERR); open STDERR, '>>', File::Spec->devnull(); return XML::LibXML->load_html( location => $filename, recover => 1, suppress_errors => 1, ); >; 

    Querying HTML with XPath¶

    The main tool you’ll use for extracting data from HTML is the findnodes() method that was introduced in A Basic Example and XPath Expressions. For these examples, the source HTML comes from the CSS Zen Garden Project and is in the file css-zen-garden.html .

    This script locates every element inside the with an id attribute value of «zen-supporting» :

    my $filename = 'css-zen-garden.html'; my $dom = XML::LibXML->load_html( location => $filename, recover => 1, suppress_errors => 1, ); my $xpath = '//div[@id="zen-supporting"]//h3'; say "$_" foreach $dom->findnodes($xpath)->to_literal_list; 
    So What is This About? Participation Benefits Requirements
    use XML::LibXML; use URI::URL; use JSON qw(to_json); my $base_url = 'http://csszengarden.com/'; my $filename = 'css-zen-garden.html'; my $dom = XML::LibXML->load_html( location => $filename, recover => 1, suppress_errors => 1, ); my @designs; my $xpath = '//div[@id="design-selection"]//li'; foreach my $design ($dom->findnodes($xpath))  my($name, $designer) = $design->findnodes('./a')->to_literal_list; my($url) = $design->findnodes('./a/@href')->to_literal_list; $url = URI::URL->new($url, $base_url)->abs; push @designs,  name => $name, designer => $designer, url => "$url", >; > say to_json(\@designs, pretty => 1>); 
    [  "designer" : "Andrew Lohman", "url" : "http://csszengarden.com/221/", "name" : "Mid Century Modern" >,  "name" : "Garments", "url" : "http://csszengarden.com/220/", "designer" : "Dan Mall" >,  "name" : "Steel", "designer" : "Steffen Knoeller", "url" : "http://csszengarden.com/219/" >,  "designer" : "Trent Walton", "url" : "http://csszengarden.com/218/", "name" : "Apothecary" >,  "name" : "Screen Filler", "designer" : "Elliot Jay Stocks", "url" : "http://csszengarden.com/217/" >,  "name" : "Fountain Kiss", "designer" : "Jeremy Carlson", "url" : "http://csszengarden.com/216/" >,  "name" : "A Robot Named Jimmy", "designer" : "meltmedia", "url" : "http://csszengarden.com/215/" >,  "name" : "Verde Moderna", "designer" : "Dave Shea", "url" : "http://csszengarden.com/214/" > ] 

    In both these examples we were fortunate to be dealing with ‘semantic markup’ – where sections of the document could be readily identified using id attributes. If there were no id attributes, we could change the XPath expression to select using element text content instead:

    my $xpath = '//h3[contains(.,"Select a Design")]/..//li'; 

    Another common problem is finding that although your XPath expressions do match the content you want, they also match content you don’t want – for example from a block of navigation links. In these cases you might identify a block of uninteresting content using findnodes() and then use removeChild() to remove that whole section from the DOM before running your main XPath query. Because you’re only removing the nodes from the in-memory copy of the document, the original source remains unchanged. This technique is used in the spell-check script used to find typos in this document.

    Matching class names¶

    $xpath = '//li[contains(@class, "member")]'; 

    Источник

Оцените статью