Python

Web Scraping in Python

1. What is Web Scraping?

Libraries:
- requests (simple HTTP requests)
- urllib (built-in library)

Example:

import requests
response = requests.get('https://example.com')
html_content = response.text

Popular Parsers:
- BeautifulSoup (most popular)
- lxml (fast XML/HTML parser)
- pyquery (jQuery-like syntax)

BeautifulSoup Example:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'lxml')
title = soup.find('h1').text

XPath (with lxml):

from lxml import html
tree = html.fromstring(html_content)
prices = tree.xpath('//span[@class="price"]/text()')

JavaScript-rendered pages:
- Tools:
  - Selenium (browser automation)
  - Playwright (modern alternative to Selenium)
  - Scrapy-Splash (for Scrapy integration)

Selenium Example:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
dynamic_content = driver.page_source

Rate Limiting: Add delays between requests

import time
time.sleep(2)  # 2-second delay

User-Agent Rotation:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'
}

Error Handling:

try:
    response = requests.get(url, timeout=10)
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

Scrapy:
- Full-featured framework
- Built-in support for:
  - Request throttling
  - Pipeline processing
  - Export formats (JSON/CSV/XML)
PySpider:
- Distributed architecture
- Web-based UI