Welcome to part 11 of the Go programming tutorial series. In this tutorial, we're going to cover how we can go about parsing this XML document. Our previous code:
package main import ( "fmt" "io/ioutil" "net/http" ) func main() { resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) string_body := string(bytes) fmt.Println(string_body) resp.Body.Close() }
In this case, we're just reading the data like we might any source code. Since we want to parse it, it makes sense that we might want to look into available libraries to help us with this. Luckily for us, we can use the encoding/xml
package to give us some aid.
package main import ( "encoding/xml" "fmt" "io/ioutil" "net/http" )
Next, we're going to use some structs
to help us in describing the structure of this document to Go and to later ascribe methods to it.
First, our sitemap has the following structure:
<sitemapindex> <sitemap> <loc>http://www.washingtonpost.com/news-politics-sitemap.xml</loc> </sitemap> <sitemap> <loc>http://www.washingtonpost.com/news-blogs-politics-sitemap.xml</loc> </sitemap> <sitemap> <loc>http://www.washingtonpost.com/news-opinions-sitemap.xml</loc> </sitemap> </sitemapindex>
So the tag sitemapindex
encompasses/is the parent to the rest of the document. Then you have the sitemap
tags that encase the final loc
tags, which hold the URLs we're interested in. Our end-goal here is to create a "list" of URLs that we could then iterate over. In Go, that "list" is actually going to be a slice, which is the same as an array has fixed dimensions, a slice doesn't! Okay, so we're going to build a slice of URLs. We'll then make a struct for our data like so:
type Sitemapindex struct { Locations []Location `xml:"sitemap"` }
In this case, we're creating our custom type called Sitemapindex
using a struct
. Inside, we're creating a slice, called Locations
, which is of the Location
type. Finally, we add the `xml:"sitemap"`
syntax at the end for the parser to understand where it's looking when we go to unpack this with the encoding/xml
package. Since we're throwing two things at you here, let's look at a simple array example:
An array with 6 integers might look something like: var ArrExample [6]int
.
How about a slice? We could make a slice with SliceExample []float32
. This would be a slice that's made up for float32 values.
Alright, back to our example, we've got a struct that at least gets us through to the sitemap
XML tag, and then we know that we're going to build a slice called Locations
, which is made up for data of the type Location
. Obviously, Location
isn't a built-in type, so let's make a struct for it too!
type Location struct { Loc string `xml:"loc"` }
In this case, our Location
type simply consists of a Loc
variable, which is a string
type, and then we again have the xml information giving the location of the tag that we'll be using for this variable.
Alright, we feel like at this point, we're probably all set. Let's see how that goes. We'll build our main
function now:
func main() { resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) var s Sitemapindex xml.Unmarshal(bytes, &s) fmt.Println(s.Locations) }
The result here looks something like:
[{http://www.washingtonpost.com/news-politics-sitemap.xml} {http://www.washingtonpost.com/news-blogs-politics-sitemap.xml} {http://www.washingtonpost.com/news-opinions-sitemap.xml} {http://www.washingtonpost.com/news-blogs-opinions-sitemap.xml} {http://www.washingtonpost.com/news-local-sitemap.xml} {http://www.washingtonpost.com/news-blogs-local-sitemap.xml} {http://www.washingtonpost.com/news-sports-sitemap.xml} {http://www.washingtonpost.com/news-blogs-sports-sitemap.xml} {http://www.washingtonpost.com/news-national-sitemap.xml} {http://www.washingtonpost.com/news-blogs-national-sitemap.xml} {http://www.washingtonpost.com/news-world-sitemap.xml} {http://www.washingtonpost.com/news-blogs-world-sitemap.xml} {http://www.washingtonpost.com/news-business-sitemap.xml} {http://www.washingtonpost.com/news-blogs-business-sitemap.xml} {http://www.washingtonpost.com/news-technology-sitemap.xml} {http://www.washingtonpost.com/news-blogs-technology-sitemap.xml} {http://www.washingtonpost.com/news-lifestyle-sitemap.xml} {http://www.washingtonpost.com/news-blogs-lifestyle-sitemap.xml} {http://www.washingtonpost.com/news-entertainment-sitemap.xml} {http://www.washingtonpost.com/news-blogs-entertainment-sitemap.xml} {http://www.washingtonpost.com/news-blogs-goingoutguide-sitemap.xml} {http://www.washingtonpost.com/news-goingoutguide-sitemap.xml}]
Hmm, in general, we know square brackets to be associated with things like lists/arrays...etc, and curly braces to be associated more with other things. Each element in this Go slice
is {url}
rather than just straight up url
. Why is this? It's fairly subtle, but each of these URLs is actually a Location
type, which just so happens to have a Loc
element that is the string we're looking for. We'd rather this slice
that we're building to be a slice of strings, not a slice of Location
types. To fix this, we need to build a string method
to associate with our Location
type.
func (l Location) String () string { return fmt.Sprintf(l.Loc) }
Now we've made ourselves a method for the Location
type that converts it to a string. We're using fmt.Sprintf()
here, which takes in data, formats according to a specifier, and returns the resulting string. More information can be found at Sprintf in the Golang Documentation.
Full code up to this point:
package main import ( "encoding/xml" "fmt" "io/ioutil" "net/http" ) type Sitemapindex struct { Locations []Location `xml:"sitemap"` } type Location struct { Loc string `xml:"loc"` } func (l Location) String() string { return fmt.Sprintf(l.Loc) } func main() { resp, _ := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml") bytes, _ := ioutil.ReadAll(resp.Body) var s Sitemapindex xml.Unmarshal(bytes, &s) fmt.Println(s.Locations) }
Now, when we run this, we get:
[http://www.washingtonpost.com/news-politics-sitemap.xml http://www.washingtonpost.com/news-blogs-politics-sitemap.xml http://www.washingtonpost.com/news-opinions-sitemap.xml http://www.washingtonpost.com/news-blogs-opinions-sitemap.xml http://www.washingtonpost.com/news-local-sitemap.xml http://www.washingtonpost.com/news-blogs-local-sitemap.xml http://www.washingtonpost.com/news-sports-sitemap.xml http://www.washingtonpost.com/news-blogs-sports-sitemap.xml http://www.washingtonpost.com/news-national-sitemap.xml http://www.washingtonpost.com/news-blogs-national-sitemap.xml http://www.washingtonpost.com/news-world-sitemap.xml http://www.washingtonpost.com/news-blogs-world-sitemap.xml http://www.washingtonpost.com/news-business-sitemap.xml http://www.washingtonpost.com/news-blogs-business-sitemap.xml http://www.washingtonpost.com/news-technology-sitemap.xml http://www.washingtonpost.com/news-blogs-technology-sitemap.xml http://www.washingtonpost.com/news-lifestyle-sitemap.xml http://www.washingtonpost.com/news-blogs-lifestyle-sitemap.xml http://www.washingtonpost.com/news-entertainment-sitemap.xml http://www.washingtonpost.com/news-blogs-entertainment-sitemap.xml http://www.washingtonpost.com/news-blogs-goingoutguide-sitemap.xml http://www.washingtonpost.com/news-goingoutguide-sitemap.xml]
Woo! That's more like what we were wanting! Now we just need to iterate through this list of sitemaps, visit them, and we can finally pull some information on the recent news...there's just one minor problem: We haven't talked at all about iterating in Go! In the next tutorial, we'll be talking about the for
loop in Go!
Something I personally ran into that you might too in the future is with casing. As we learned earlier in this tutorial series, Go puts emphasis on the casing of elements to determine whether or not they are exported. In our case here, our XML tags are all lower-cased, which encouraged me to write the Location
type as:
type Location struct { loc string `xml:"loc"` }
And then of course this would be refenced again in our String
method to be return fmt.Sprintf(l.loc)
. With no other changes other than this lower-casing of that variable, our slice comes up completely empty. Womp. That's rather unfortunate! The reason for this is the xml
package we're using is only looking for exported values, so we *must* title-case. By this time, I had already forgotten about this, so it was a good reminder, and something you need to pay attention to if you're not used to it!
Now for the inevitable reality of Washington Post changing their structure, you can use the following code that simulates the same thing:
package main import ( "encoding/xml" "fmt" ) var washPostXML = []byte(` <sitemapindex> <sitemap> <loc>http://www.washingtonpost.com/news-politics-sitemap.xml</loc> </sitemap> <sitemap> <loc>http://www.washingtonpost.com/news-blogs-politics-sitemap.xml</loc> </sitemap> <sitemap> <loc>http://www.washingtonpost.com/news-opinions-sitemap.xml</loc> </sitemap> </sitemapindex> `) type Sitemapindex struct { Locations []Location `xml:"sitemap"` } type Location struct { Loc string `xml:"loc"` } func (e Location) String () string { return fmt.Sprintf(e.Loc) } func main() { bytes := washPostXML var s Sitemapindex xml.Unmarshal(bytes, &s) fmt.Println(s.Locations) }