Sometimes you need to extract data points from HTML using Xpath (Read what is XPath first). First, install htmlquery package:
go get github.com/antchfx/htmlqueryHere is the code and read explanation after:
package main
import ( "bytes" "github.com/antchfx/htmlquery" "strings" "fmt")
// ExtractXPath function that accepts a single XPath expression and returns a single stringfunc ExtractXPath(htmlStr string, xpathExpr string) (string, error) { // Load the HTML document var buffer bytes.Buffer buffer.WriteString(htmlStr) doc, err := htmlquery.Parse(&buffer) if err != nil { return "", err }
// Find the nodes matching the XPath expression nodes := htmlquery.Find(doc, xpathExpr) var content []string
// Iterate over the nodes and extract the content for _, node := range nodes { content = append(content, htmlquery.InnerText(node)) } // Join the extracted content if multiple nodes were found result := strings.Join(content, " ")
return result, nil}
func main() { htmlStr := ` <html> <head> <title>Test Page</title> </head> <body> <div class="content"> <p>Hello, World!</p> <p>This is a test.</p> </div> </body> </html>`
xpathExpr := "//div[@class='content']/p"
content, err := ExtractXPath(htmlStr, xpathExpr) if err != nil { fmt.Println("Error:", err) } else { fmt.Println("Extracted content:", content) }}You will receive the output:
Extracted content: Hello, World! This is a test.Luckily, there is an open-source lib htmlquery for that. Install it first:
go get github.com/antchfx/htmlqueryThen, do a basic query against the document:
nodes, err := htmlquery.QueryAll(doc, "//a")if err != nil { panic(`not a valid XPath expression.`)}See more examples at the doc.
package extract
import ( "bytes" "github.com/antchfx/htmlquery" "strings")
type Rules = map[string]stringtype Content = map[string]string
func XPath(htmlStr string, filter Rules) (Content, error) { // Load the HTML document var buffer bytes.Buffer buffer.WriteString(htmlStr) doc, err := htmlquery.Parse(&buffer) if err != nil { return nil, err }
result := make(Content)
// Iterate over the filter to apply each XPath expression for key, xpathExpr := range filter { // Find the nodes matching the XPath expression nodes := htmlquery.Find(doc, xpathExpr) var content []string
// Iterate over the nodes and extract the content for _, node := range nodes { content = append(content, htmlquery.InnerText(node)) } // Join the extracted content if multiple nodes were found result[key] = strings.Join(content, " ") }
return result, nil}Extracting multiple Xpath elements requires more complicated code. First, define two maps: for extracting rules and for the result. Each rule has its own key, which will be used for the result map after extraction.
Then, iterate over filter rules and find elements for each rule. Extract it and put it into the resulting map under a certain key.
Here is the usage example:
filter := Rules{ "Title": "//title/text()", "Header": "//h1/text()", "link_more_info": "//a[contains(text(),'More information')]/@href", "link_fb": "//a[contains(text(),'Another link fb')]/@href", }
content, err := XPath(html, filter) if err != nil { fmt.Println("Error: %s", err) } fmt.Printf("Extracted content: %v\n", content)The result map will be:
map[ Header:Example Domain Title:Example Domain link_fb:https://fb.com/test link_more_info:https://www.iana.org/domains/example]