What is XPath?

XPath (XML Path Language) is a query language used to navigate through elements and attributes in an XML document. It provides a syntax for defining parts of an XML document and is widely used in XML processing, web scraping, and data extraction tasks.

Key Concepts

1. Nodes

XPath treats XML documents as a tree of nodes. There are seven types of nodes:

  • Element nodes
  • Attribute nodes
  • Text nodes
  • Namespace nodes
  • Processing instruction nodes
  • Comment nodes
  • Document (root) nodes

2. Path Expressions

XPath uses path expressions to select nodes or node-sets in XML documents. Examples:

  • /bookstore/book - Selects all book elements that are children of bookstore
  • //title - Selects all title elements anywhere in the document
  • @lang - Selects attributes named "lang"

3. Axes

Define relationships between nodes:

  • child (default)
  • parent
  • ancestor
  • descendant
  • following-sibling
  • preceding-sibling

4. Predicates

Used to find specific nodes or values:

  • /bookstore/book[1] - First book element
  • /bookstore/book[price>35] - Books with price > 35

5. Functions

XPath includes built-in functions for:

  • String manipulation (concat(), substring())
  • Numeric operations (sum(), round())
  • Boolean logic (not(), true())
  • Node manipulation (count(), last())

XPath Versions

Version Year Key Features
XPath 1.0 1999 Basic path expressions
XPath 2.0 2007 Extended data types, sequences
XPath 3.0 2014 Higher-order functions, JSON support
XPath 3.1 2017 Maps and arrays enhancements

Common Use Cases

  1. XML document navigation and data extraction
  2. Web scraping (often used with Selenium/BeautifulSoup)
  3. XSLT transformations
  4. XQuery and XPointer operations
  5. XML schema (XSD) definitions

Example Syntax

//div[@class='product']/h2/text()  # Get text from h2 in product divs
/bookstore/book[last()]           # Select last book element
//*[contains(@class,'warning')]   # Find elements with 'warning' class

Advantages

  • Concise syntax for document navigation
  • Platform and language independent
  • Supported by most XML parsers and web scraping tools
  • Powerful filtering capabilities

Limitations

  • XPath 1.0 lacks modern data types
  • Complex expressions can become difficult to read
  • Performance considerations with large documents