I am receiving an xml file from a 3rd party which has an HTML element within one of the XML tags. I cannot work out how to unmarshal this to get the href URL. XML Example:

type Href struct < Link string `xml:"href"` >type Link struct < URL []Href `xml:"a"` >type XmlFile struct < HTMLTag []Link `xml:"SOME_HTML"` >myFile := []byte(`   google `) var output XmlFile err := xml.Unmarshal(myFile, &output) fmt.Println(output) //

Welcome @asnow!. . which I’m guessing you’ve chosen to keep your example simple doesn’t really tell us the kind of HTML you’re expecting. HTML (as opposed to XHTML) has tag omission and other shortform syntax which, when inserted as a string into XML will make go’s XML parser (or any other for that matter) fail hard. But it could well be that you’re receiving HTML in XML serialization; it’s not 100% clear from your question. So please show us actual HTML you’re receiving.

The example is pretty much exactly what I am receiving, only difference is some other fields around it and the tag is actually

3 Answers 3

type aElement struct < Href string `xml:"href,attr"` >type content struct < A aElement `xml:"a"` >func main() < test := `google` var result content if err := xml.Unmarshal([]byte(test), &result); err != nil < log.Fatal(err) >fmt.Println(result) > 

Parsing everything in the xml, assuming also there could be more than one a tag in the html or other tags (like a div ).

If this is not needed, just replace XmlFile.Links with XmlFile.Link of type Link (not []Link )

func main() < type Link struct < XMLName xml.Name `xml:"a"` URL string `xml:"href,attr"` Target string `xml:"target,attr"` Content string `xml:",chardata"` >type Div struct < XMLName xml.Name `xml:"div"` Classes string `xml:"class,attr"` Content string `xml:",chardata"` >type XmlFile struct < XMLName xml.Name `xml:"SOME_HTML"` Links []Link `xml:"a"` Divs []Div `xml:"div"` >myFile := []byte(`  google facebook fmt.Println(output) > 

Edit: Added more tags in the xml to show how to parse different tag types.

You can parse the example you posted using a regular XML parser, there are however a whole lot of exceptions to the XML syntax which are commonly accepted as valid HTML.

The simplest example I can think of is : all html interpreters I know of understand that
(an unclosed
tag) is the same as a self closing

If you don’t know how the HTML on the other end of the service is generated, you are better off using an HTML parser.

For example, there is the golang.go/x/net/html package which provides several functions to parse HTML :

func findFirstHref(n *html.Node, indent string) string < if n.Type == html.ElementNode < fmt.Println(" * scanning:" + indent + n.Data) >if n.Type == html.ElementNode && n.Data == "a" < for _, a := range n.Attr < if a.Key == "href" < return a.Val >> > for c := n.FirstChild; c != nil; c = c.NextSibling < href := findFirstHref(c, indent+" ") if href != "" < return href >> return "" > func main() < doc1, err := html.Parse(strings.NewReader(sample1)) if err != nil < fmt.Println(err) >else < fmt.Println("href in sample1:", findFirstHref(doc1, "")) >doc2, err := html.Parse(strings.NewReader(sample2)) if err != nil < fmt.Println(err) >else < fmt.Println("href in sample2:", findFirstHref(doc2, "")) >> const ( sample1 = `   google ` // sample2 is an invalid XML document (it has unclosed "
" tags): sample2 = `

line2 Some

` )


Parsing HTML files with Go

Is encoding/xml the best library to parse HTML table files like this one and exist some examples how to do it?

Test 1
Type Region
Type   Count Percent
T1   34,314 31.648%
T2   25,820 23.814%
T3   4,871 4.493%
Type   Count Percent
T4   34,314 31.648%
T5   11,187 10.318%
T6   25,820 23.814%

Have you tried — first Google result for «golang html parser» 😉

1 Answer 1

Strictly speaking, the only one kind of HTML which is guaranteed to be parsed by a conforming XML parser is XHTML, but despite the fact XHTML once has been thought of as coming to be the HTML standard, it has not really taken off the ground and these days it’s considered obsolete (in favor of the much hyped «HTML5» thing and all the ecosystem around it). The basic problem with HTML is that while it looks like XML it has different rules. One glaring distinction is that
is a perfectly legal HTML but is an unterminated element in XML (in the latter, it has to be spelled
), and there are a lot more differences.

On the other hand, your particular example looks quite XML’ish to me, so if you can guarantee your data, while being HTML, will always be a well-formed XML at the same time, you can just use the encoding/xml package. Otherwise go for , as suggested by @elithrar, or find some other package.

go-xslt is a Go module that performs basic XSLT 1.0 transformations via Libxslt.


You’ll need the development libraries for libxml2 and libxslt, along with those for liblzma and zlib. Install these via your package manager. For instance, if using apt then:

sudo apt install libxml2-dev libxslt1-dev liblzma-dev zlib1g-dev 

This module can be installed with the go get command:

go get -u 


 // style is an XSLT 1.0 stylesheet, as []byte. xs, err := xslt.NewStylesheet(style) if err != nil < panic(err) >defer xs.Close() // doc is an XML document to be transformed and res is the result of // the XSL transformation, both as []byte. res, err := xs.Transform(doc) if err != nil


Оцените статью