What is XPath?
XPath (XML Path Language) is a query language used to navigate through elements and attributes in an XML document. It provides a syntax for defining parts of an XML document and is widely used in XML processing, web scraping, and data extraction tasks.
Key Concepts
1. Nodes
XPath treats XML documents as a tree of nodes. There are seven types of nodes:
- Element nodes
- Attribute nodes
- Text nodes
- Namespace nodes
- Processing instruction nodes
- Comment nodes
- Document (root) nodes
2. Path Expressions
XPath uses path expressions to select nodes or node-sets in XML documents. Examples:
/bookstore/book
- Selects all book elements that are children of bookstore//title
- Selects all title elements anywhere in the document@lang
- Selects attributes named "lang"
3. Axes
Define relationships between nodes:
child
(default)parent
ancestor
descendant
following-sibling
preceding-sibling
4. Predicates
Used to find specific nodes or values:
/bookstore/book[1]
- First book element/bookstore/book[price>35]
- Books with price > 35
5. Functions
XPath includes built-in functions for:
- String manipulation (
concat()
,substring()
) - Numeric operations (
sum()
,round()
) - Boolean logic (
not()
,true()
) - Node manipulation (
count()
,last()
)
XPath Versions
Version | Year | Key Features |
---|---|---|
XPath 1.0 | 1999 | Basic path expressions |
XPath 2.0 | 2007 | Extended data types, sequences |
XPath 3.0 | 2014 | Higher-order functions, JSON support |
XPath 3.1 | 2017 | Maps and arrays enhancements |
Common Use Cases
- XML document navigation and data extraction
- Web scraping (often used with Selenium/BeautifulSoup)
- XSLT transformations
- XQuery and XPointer operations
- XML schema (XSD) definitions
Example Syntax
//div[@class='product']/h2/text() # Get text from h2 in product divs
/bookstore/book[last()] # Select last book element
//*[contains(@class,'warning')] # Find elements with 'warning' class
Advantages
- Concise syntax for document navigation
- Platform and language independent
- Supported by most XML parsers and web scraping tools
- Powerful filtering capabilities
Limitations
- XPath 1.0 lacks modern data types
- Complex expressions can become difficult to read
- Performance considerations with large documents