Web Scraping in Python
1. What is Web Scraping?
- Automated extraction of data from websites
- Converts unstructured web data into structured format
- Common use cases: Price monitoring, market research, lead generation
2. Core Components
2.1 HTTP Requests
-
Libraries:
requests
(simple HTTP requests)urllib
(built-in library)
-
Example:
import requests response = requests.get('https://example.com') html_content = response.text
2.2 HTML Parsing
-
Popular Parsers:
BeautifulSoup
(most popular)lxml
(fast XML/HTML parser)pyquery
(jQuery-like syntax)
-
BeautifulSoup Example:
from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'lxml') title = soup.find('h1').text
2.3 Data Extraction
-
CSS Selectors:
soup.select('div.product > h3.title')
-
XPath (with lxml):
from lxml import html tree = html.fromstring(html_content) prices = tree.xpath('//span[@class="price"]/text()')
3. Handling Dynamic Content
-
JavaScript-rendered pages:
- Tools:
Selenium
(browser automation)Playwright
(modern alternative to Selenium)Scrapy-Splash
(for Scrapy integration)
- Tools:
-
Selenium Example:
from selenium import webdriver driver = webdriver.Chrome() driver.get('https://example.com') dynamic_content = driver.page_source
4. Common Challenges & Solutions
Challenge | Solution |
---|---|
Pagination | Loop through page numbers |
Rate Limiting | Use delays between requests |
CAPTCHAs | CAPTCHA solving services |
Honeypot Traps | Avoid hidden fields |
IP Blocking | Rotate proxies/user-agents |
5. Best Practices
-
Check robots.txt: Respect website's scraping policies
-
Rate Limiting: Add delays between requests
import time time.sleep(2) # 2-second delay
-
User-Agent Rotation:
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...' }
-
Error Handling:
try: response = requests.get(url, timeout=10) except requests.exceptions.RequestException as e: print(f"Request failed: {e}")
6. Popular Frameworks
- Scrapy:
- Full-featured framework
- Built-in support for:
- Request throttling
- Pipeline processing
- Export formats (JSON/CSV/XML)
- PySpider:
- Distributed architecture
- Web-based UI
7. Example Workflow
- Identify target website
- Inspect page structure (DevTools)
- Write extraction logic
- Implement data storage
- Add error handling
- Schedule/maintain scraper
8. Storage Options
- CSV files
- Databases (SQL/NoSQL)
- Cloud storage (S3, Google Drive)
- JSON files
9. Advanced Techniques
- Parallel scraping (multithreading)
- Distributed scraping (Scrapy Cluster)
- Machine learning for data extraction
- Browser fingerprint rotation