Web Scraping in Python

1. What is Web Scraping?

  • Automated extraction of data from websites
  • Converts unstructured web data into structured format
  • Common use cases: Price monitoring, market research, lead generation

2. Core Components

2.1 HTTP Requests

  • Libraries:

    • requests (simple HTTP requests)
    • urllib (built-in library)
  • Example:

    import requests
    response = requests.get('https://example.com')
    html_content = response.text
    

2.2 HTML Parsing

  • Popular Parsers:

    • BeautifulSoup (most popular)
    • lxml (fast XML/HTML parser)
    • pyquery (jQuery-like syntax)
  • BeautifulSoup Example:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_content, 'lxml')
    title = soup.find('h1').text
    

2.3 Data Extraction

  • CSS Selectors:

    soup.select('div.product > h3.title')
    
  • XPath (with lxml):

    from lxml import html
    tree = html.fromstring(html_content)
    prices = tree.xpath('//span[@class="price"]/text()')
    

3. Handling Dynamic Content

  • JavaScript-rendered pages:

    • Tools:
      • Selenium (browser automation)
      • Playwright (modern alternative to Selenium)
      • Scrapy-Splash (for Scrapy integration)
  • Selenium Example:

    from selenium import webdriver
    driver = webdriver.Chrome()
    driver.get('https://example.com')
    dynamic_content = driver.page_source
    

4. Common Challenges & Solutions

Challenge Solution
Pagination Loop through page numbers
Rate Limiting Use delays between requests
CAPTCHAs CAPTCHA solving services
Honeypot Traps Avoid hidden fields
IP Blocking Rotate proxies/user-agents

5. Best Practices

  1. Check robots.txt: Respect website's scraping policies

  2. Rate Limiting: Add delays between requests

    import time
    time.sleep(2)  # 2-second delay
    
  3. User-Agent Rotation:

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'
    }
    
  4. Error Handling:

    try:
        response = requests.get(url, timeout=10)
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    
  • Scrapy:
    • Full-featured framework
    • Built-in support for:
      • Request throttling
      • Pipeline processing
      • Export formats (JSON/CSV/XML)
  • PySpider:
    • Distributed architecture
    • Web-based UI

7. Example Workflow

  1. Identify target website
  2. Inspect page structure (DevTools)
  3. Write extraction logic
  4. Implement data storage
  5. Add error handling
  6. Schedule/maintain scraper

8. Storage Options

  • CSV files
  • Databases (SQL/NoSQL)
  • Cloud storage (S3, Google Drive)
  • JSON files

9. Advanced Techniques

  • Parallel scraping (multithreading)
  • Distributed scraping (Scrapy Cluster)
  • Machine learning for data extraction
  • Browser fingerprint rotation