Web Scraping

What is XPath?

XPath (XML Path Language) is a query language used to navigate through elements and attributes in an XML document. It provides a syntax for defining parts of an XML document and is widely used in XML processing, web scraping, and data extraction tasks.

Key Concepts

1. Nodes

XPath treats XML documents as a tree of nodes. There are seven types of nodes:

Element nodes
Attribute nodes
Text nodes
Namespace nodes
Processing instruction nodes
Comment nodes
Document (root) nodes

2. Path Expressions

XPath uses path expressions to select nodes or node-sets in XML documents. Examples:

/bookstore/book - Selects all book elements that are children of bookstore
//title - Selects all title elements anywhere in the document
@lang - Selects attributes named "lang"

3. Axes

Define relationships between nodes:

child (default)
parent
ancestor
descendant
following-sibling
preceding-sibling

4. Predicates

Used to find specific nodes or values:

/bookstore/book[1] - First book element
/bookstore/book[price>35] - Books with price > 35

5. Functions

XPath includes built-in functions for:

String manipulation (concat(), substring())
Numeric operations (sum(), round())
Boolean logic (not(), true())
Node manipulation (count(), last())

XPath Versions

Version	Year	Key Features
XPath 1.0	1999	Basic path expressions
XPath 2.0	2007	Extended data types, sequences
XPath 3.0	2014	Higher-order functions, JSON support
XPath 3.1	2017	Maps and arrays enhancements

Common Use Cases

XML document navigation and data extraction
Web scraping (often used with Selenium/BeautifulSoup)
XSLT transformations
XQuery and XPointer operations
XML schema (XSD) definitions

Example Syntax

//div[@class='product']/h2/text()  # Get text from h2 in product divs
/bookstore/book[last()]           # Select last book element
//*[contains(@class,'warning')]   # Find elements with 'warning' class

Advantages

Concise syntax for document navigation
Platform and language independent
Supported by most XML parsers and web scraping tools
Powerful filtering capabilities

Limitations

XPath 1.0 lacks modern data types
Complex expressions can become difficult to read
Performance considerations with large documents