How to extract XPath in Golang

Sometimes you need to extract data points from HTML using Xpath (Read what is XPath first). First, install htmlquery package:

go get github.com/antchfx/htmlquery

Here is the code and read explanation after:

package main

import (
  "bytes"
  "github.com/antchfx/htmlquery"
  "strings"
  "fmt"
)

// ExtractXPath function that accepts a single XPath expression and returns a single string
func ExtractXPath(htmlStr string, xpathExpr string) (string, error) {
  // Load the HTML document
  var buffer bytes.Buffer
  buffer.WriteString(htmlStr)
  doc, err := htmlquery.Parse(&buffer)
  if err != nil {
    return "", err
  }

  // Find the nodes matching the XPath expression
  nodes := htmlquery.Find(doc, xpathExpr)
  var content []string

  // Iterate over the nodes and extract the content
  for _, node := range nodes {
    content = append(content, htmlquery.InnerText(node))
  }
  // Join the extracted content if multiple nodes were found
  result := strings.Join(content, " ")

  return result, nil
}

func main() {
  htmlStr := `
    <html>
      <head>
        <title>Test Page</title>
      </head>
      <body>
        <div class="content">
          <p>Hello, World!</p>
          <p>This is a test.</p>
        </div>
      </body>
    </html>`

  xpathExpr := "//div[@class='content']/p"

  content, err := ExtractXPath(htmlStr, xpathExpr)
  if err != nil {
    fmt.Println("Error:", err)
  } else {
    fmt.Println("Extracted content:", content)
  }
}

You will receive the output:

Extracted content: Hello, World! This is a test.

How it works in Go

Luckily, there is an open-source lib htmlquery for that. Install it first:

go get github.com/antchfx/htmlquery

Then, do a basic query against the document:

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
  panic(`not a valid XPath expression.`)
}

See more examples at the doc.

Extracting multiple Xpath elements in Golang

package extract

import (
  "bytes"
  "github.com/antchfx/htmlquery"
  "strings"
)

type Rules = map[string]string
type Content = map[string]string

func XPath(htmlStr string, filter Rules) (Content, error) {
  // Load the HTML document
  var buffer bytes.Buffer
  buffer.WriteString(htmlStr)
  doc, err := htmlquery.Parse(&buffer)
  if err != nil {
    return nil, err
  }

  result := make(Content)

  // Iterate over the filter to apply each XPath expression
  for key, xpathExpr := range filter {
    // Find the nodes matching the XPath expression
    nodes := htmlquery.Find(doc, xpathExpr)
    var content []string

    // Iterate over the nodes and extract the content
    for _, node := range nodes {
      content = append(content, htmlquery.InnerText(node))
    }
    // Join the extracted content if multiple nodes were found
    result[key] = strings.Join(content, " ")
  }

  return result, nil
}

Extracting multiple Xpath elements requires more complicated code. First, define two maps: for extracting rules and for the result. Each rule has its own key, which will be used for the result map after extraction.

Then, iterate over filter rules and find elements for each rule. Extract it and put it into the resulting map under a certain key.

Here is the usage example:

filter := Rules{
    "Title":          "//title/text()",
    "Header":         "//h1/text()",
    "link_more_info": "//a[contains(text(),'More information')]/@href",
    "link_fb":        "//a[contains(text(),'Another link fb')]/@href",
  }

  content, err := XPath(html, filter)
  if err != nil {
    fmt.Println("Error: %s", err)
  }
  fmt.Printf("Extracted content: %v\n", content)

The result map will be:

map[
  Header:Example Domain
  Title:Example Domain
  link_fb:https://fb.com/test
  link_more_info:https://www.iana.org/domains/example
]

XPath is a powerful tool for selecting nodes in an XML document. In this article, we will show you how to extract XPath in Golang.

How it works in Go

Extracting multiple Xpath elements in Golang