Python Programming Tutorials

Parsing XML with Go Programming

Welcome to part 11 of the Go programming tutorial series. In this tutorial, we're going to cover how we can go about parsing this XML document. Our previous code:

package main

import (
"fmt"
"io/ioutil"
"net/http"
)

func main() {
   resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
   bytes, _ := ioutil.ReadAll(resp.Body)
   string_body := string(bytes)
   fmt.Println(string_body)
   resp.Body.Close()
}

In this case, we're just reading the data like we might any source code. Since we want to parse it, it makes sense that we might want to look into available libraries to help us with this. Luckily for us, we can use the encoding/xml package to give us some aid.

package main

import (
  "encoding/xml"
  "fmt"
  "io/ioutil"
  "net/http"
)

Next, we're going to use some structs to help us in describing the structure of this document to Go and to later ascribe methods to it.

First, our sitemap has the following structure:

<sitemapindex>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-politics-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-blogs-politics-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-opinions-sitemap.xml</loc>
   </sitemap>
</sitemapindex>

So the tag sitemapindex encompasses/is the parent to the rest of the document. Then you have the sitemap tags that encase the final loc tags, which hold the URLs we're interested in. Our end-goal here is to create a "list" of URLs that we could then iterate over. In Go, that "list" is actually going to be a slice, which is the same as an array has fixed dimensions, a slice doesn't! Okay, so we're going to build a slice of URLs. We'll then make a struct for our data like so:

type Sitemapindex struct {
  Locations []Location `xml:"sitemap"`
}

In this case, we're creating our custom type called Sitemapindex using a struct. Inside, we're creating a slice, called Locations, which is of the Location type. Finally, we add the `xml:"sitemap"` syntax at the end for the parser to understand where it's looking when we go to unpack this with the encoding/xml package. Since we're throwing two things at you here, let's look at a simple array example:

An array with 6 integers might look something like: var ArrExample [6]int.

How about a slice? We could make a slice with SliceExample []float32. This would be a slice that's made up for float32 values.

Alright, back to our example, we've got a struct that at least gets us through to the sitemap XML tag, and then we know that we're going to build a slice called Locations, which is made up for data of the type Location. Obviously, Location isn't a built-in type, so let's make a struct for it too!

type Location struct {
  Loc string `xml:"loc"`
}

In this case, our Location type simply consists of a Loc variable, which is a string type, and then we again have the xml information giving the location of the tag that we'll be using for this variable.

Alright, we feel like at this point, we're probably all set. Let's see how that goes. We'll build our main function now:

func main() {
  resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
  bytes, _ := ioutil.ReadAll(resp.Body)
  var s Sitemapindex
  xml.Unmarshal(bytes, &s)
  fmt.Println(s.Locations)
}

The result here looks something like:

[{http://www.washingtonpost.com/news-politics-sitemap.xml} {http://www.washingtonpost.com/news-blogs-politics-sitemap.xml} {http://www.washingtonpost.com/news-opinions-sitemap.xml} {http://www.washingtonpost.com/news-blogs-opinions-sitemap.xml} {http://www.washingtonpost.com/news-local-sitemap.xml} {http://www.washingtonpost.com/news-blogs-local-sitemap.xml} {http://www.washingtonpost.com/news-sports-sitemap.xml} {http://www.washingtonpost.com/news-blogs-sports-sitemap.xml} {http://www.washingtonpost.com/news-national-sitemap.xml} {http://www.washingtonpost.com/news-blogs-national-sitemap.xml} {http://www.washingtonpost.com/news-world-sitemap.xml} {http://www.washingtonpost.com/news-blogs-world-sitemap.xml} {http://www.washingtonpost.com/news-business-sitemap.xml} {http://www.washingtonpost.com/news-blogs-business-sitemap.xml} {http://www.washingtonpost.com/news-technology-sitemap.xml} {http://www.washingtonpost.com/news-blogs-technology-sitemap.xml} {http://www.washingtonpost.com/news-lifestyle-sitemap.xml} {http://www.washingtonpost.com/news-blogs-lifestyle-sitemap.xml} {http://www.washingtonpost.com/news-entertainment-sitemap.xml} {http://www.washingtonpost.com/news-blogs-entertainment-sitemap.xml} {http://www.washingtonpost.com/news-blogs-goingoutguide-sitemap.xml} {http://www.washingtonpost.com/news-goingoutguide-sitemap.xml}]

Hmm, in general, we know square brackets to be associated with things like lists/arrays...etc, and curly braces to be associated more with other things. Each element in this Go slice is {url} rather than just straight up url. Why is this? It's fairly subtle, but each of these URLs is actually a Location type, which just so happens to have a Loc element that is the string we're looking for. We'd rather this slice that we're building to be a slice of strings, not a slice of Location types. To fix this, we need to build a string method to associate with our Location type.

func (l Location) String () string {
  return fmt.Sprintf(l.Loc)
}

Now we've made ourselves a method for the Location type that converts it to a string. We're using fmt.Sprintf() here, which takes in data, formats according to a specifier, and returns the resulting string. More information can be found at Sprintf in the Golang Documentation.

Full code up to this point:

package main

import (
  "encoding/xml"
  "fmt"
  "io/ioutil"
  "net/http"
)

type Sitemapindex struct {
  Locations []Location `xml:"sitemap"`
}

type Location struct {
  Loc string `xml:"loc"`
}

func (l Location) String() string {
  return fmt.Sprintf(l.Loc)
}

func main() {
  resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
  bytes, _ := ioutil.ReadAll(resp.Body)
  var s Sitemapindex
  xml.Unmarshal(bytes, &s)
  fmt.Println(s.Locations)
}

Now, when we run this, we get:

[http://www.washingtonpost.com/news-politics-sitemap.xml http://www.washingtonpost.com/news-blogs-politics-sitemap.xml http://www.washingtonpost.com/news-opinions-sitemap.xml http://www.washingtonpost.com/news-blogs-opinions-sitemap.xml http://www.washingtonpost.com/news-local-sitemap.xml http://www.washingtonpost.com/news-blogs-local-sitemap.xml http://www.washingtonpost.com/news-sports-sitemap.xml http://www.washingtonpost.com/news-blogs-sports-sitemap.xml http://www.washingtonpost.com/news-national-sitemap.xml http://www.washingtonpost.com/news-blogs-national-sitemap.xml http://www.washingtonpost.com/news-world-sitemap.xml http://www.washingtonpost.com/news-blogs-world-sitemap.xml http://www.washingtonpost.com/news-business-sitemap.xml http://www.washingtonpost.com/news-blogs-business-sitemap.xml http://www.washingtonpost.com/news-technology-sitemap.xml http://www.washingtonpost.com/news-blogs-technology-sitemap.xml http://www.washingtonpost.com/news-lifestyle-sitemap.xml http://www.washingtonpost.com/news-blogs-lifestyle-sitemap.xml http://www.washingtonpost.com/news-entertainment-sitemap.xml http://www.washingtonpost.com/news-blogs-entertainment-sitemap.xml http://www.washingtonpost.com/news-blogs-goingoutguide-sitemap.xml http://www.washingtonpost.com/news-goingoutguide-sitemap.xml]

Woo! That's more like what we were wanting! Now we just need to iterate through this list of sitemaps, visit them, and we can finally pull some information on the recent news...there's just one minor problem: We haven't talked at all about iterating in Go! In the next tutorial, we'll be talking about the for loop in Go!

Something I personally ran into that you might too in the future is with casing. As we learned earlier in this tutorial series, Go puts emphasis on the casing of elements to determine whether or not they are exported. In our case here, our XML tags are all lower-cased, which encouraged me to write the Location type as:

type Location struct {
  loc string `xml:"loc"`
}

And then of course this would be refenced again in our String method to be return fmt.Sprintf(l.loc). With no other changes other than this lower-casing of that variable, our slice comes up completely empty. Womp. That's rather unfortunate! The reason for this is the xml package we're using is only looking for exported values, so we *must* title-case. By this time, I had already forgotten about this, so it was a good reminder, and something you need to pay attention to if you're not used to it!

Now for the inevitable reality of Washington Post changing their structure, you can use the following code that simulates the same thing:

package main

import (
  "encoding/xml"
  "fmt"
)

var washPostXML = []byte(`
<sitemapindex>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-politics-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-blogs-politics-sitemap.xml</loc>
   </sitemap>
   <sitemap>
      <loc>http://www.washingtonpost.com/news-opinions-sitemap.xml</loc>
   </sitemap>
</sitemapindex>
`)

type Sitemapindex struct {
  Locations []Location `xml:"sitemap"`
}

type Location struct {
  Loc string `xml:"loc"`
}

func (e Location) String () string {
  return fmt.Sprintf(e.Loc)
}

func main() {
  bytes := washPostXML
  var s Sitemapindex
  xml.Unmarshal(bytes, &s)
  fmt.Println(s.Locations)
}

The next tutorial:

Introduction to the Go Programming Language
Go Language Syntax
Go Language Types
Pointers in Go Programming
Simple Web App in Go Programming
Structs in the Go Programming Language
Methods in Go Programming
Pointer Receivers in Go Programming
More Web Dev in Go Language
Acessing the Internet in Go
Parsing XML with Go Programming
Looping in Go Programming
Continuing our Go Web application
Mapping in Golang
Mapping Golang sitemap data
Golang Web App HTML Templating
Applying templating to our Golang web app
Goroutines - Concurrency in Goprogramming
Synchronizing Goroutines - Concurrency in Golang
Defer - Golang
Panic and Recover in Go Programming
Go Channels - Concurrency in Go
Go Channels buffering, iteration, and synchronization
Adding Concurrency to speed up our Golang Web Application