Back to Portfolio

Building an AI-Powered Web Scraping Infrastructure: From Traditional Selectors to Vision Models

The Evolution of Web Scraping

Web scraping has evolved dramatically. What once required brittle CSS selectors and constant maintenance now leverages AI to understand pages like humans do. This is the journey of building a production-grade scraping infrastructure that combines traditional reliability with cutting-edge AI capabilities.

The Problem with Traditional Scraping

Traditional web scraping breaks constantly:

# Breaks when class names change
soup.find('div', class_='product-price-2023')

# Breaks when structure changes
driver.find_element(By.XPATH, '/html/body/div[3]/div[2]/span[1]')

# Breaks when site redesigns
data = response.css('.old-layout-selector::text').get()

Real-world impact:

The Modern Solution: AI-Powered Scraping Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Scraping Orchestration                  β”‚
β”‚              (N8N Workflows / Airflow)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚          β”‚          β”‚
          β–Ό          β–Ό          β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Firecrawlβ”‚ β”‚Playwrightβ”‚ β”‚ GPT-4o   β”‚
    β”‚   API    β”‚ β”‚ Stealth  β”‚ β”‚ Vision   β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
         β”‚            β”‚             β”‚
         β”‚ Markdown   β”‚ Screenshots β”‚ Visual
         β”‚ + Schema   β”‚ + HTML      β”‚ Understanding
         β”‚            β”‚             β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   AI Processing Layer    β”‚
         β”‚  - Claude 3.5 Sonnet     β”‚
         β”‚  - GPT-4o / GPT-4o-mini  β”‚
         β”‚  - Gemini 1.5 Pro        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚             β”‚
         β–Ό            β–Ό             β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Schema  β”‚ β”‚Transform β”‚ β”‚ Validate β”‚
    β”‚Generationβ”‚ β”‚ & Clean  β”‚ β”‚ & Store  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture Components

1. Multi-Tier Scraping Strategy

The infrastructure uses a waterfall approach for maximum reliability:

Tier 1: Firecrawl API (Fastest)

from firecrawl import FirecrawlApp

firecrawl = FirecrawlApp(api_key=FIRECRAWL_API_KEY)

# AI-powered markdown extraction with schema
result = firecrawl.scrape_url(
    url="https://example.com/product",
    formats=['markdown', 'html'],
    actions=[
        {"type": "wait", "milliseconds": 2000},
        {"type": "click", "selector": ".load-more"}
    ],
    extract={
        "schema": {
            "type": "object",
            "properties": {
                "product_name": {"type": "string"},
                "price": {"type": "number"},
                "description": {"type": "string"},
                "images": {"type": "array"}
            }
        }
    }
)

Advantages:

When to use: Standard e-commerce, blogs, news sites, public APIs


Tier 2: Playwright + LLM (Most Flexible)

from playwright.async_api import async_playwright
import anthropic

async def scrape_with_ai(url: str, extraction_prompt: str):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0...',
            viewport={'width': 1920, 'height': 1080}
        )
        
        page = await context.new_page()
        await page.goto(url, wait_until='networkidle')
        
        # Wait for dynamic content
        await page.wait_for_selector('.main-content', timeout=10000)
        
        # Screenshot for vision models
        screenshot = await page.screenshot(full_page=True)
        
        # Get clean HTML
        html = await page.content()
        
        await browser.close()
        
        # Send to Claude with vision
        client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
        
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": base64.b64encode(screenshot).decode()
                        }
                    },
                    {
                        "type": "text",
                        "text": f"""Analyze this webpage and extract:
                        {extraction_prompt}
                        
                        Return as JSON with proper structure.
                        HTML for reference: {html[:5000]}"""
                    }
                ]
            }]
        )
        
        return json.loads(response.content[0].text)

Advantages:

When to use: SPAs, protected content, complex interactions, high-value targets


Tier 3: GPT-4o Vision (Visual Understanding)

import openai
from PIL import Image
import io

def extract_with_vision(screenshot: bytes, prompt: str):
    # Optimize image size for API
    img = Image.open(io.BytesIO(screenshot))
    img.thumbnail((2000, 2000), Image.Resampling.LANCZOS)
    
    buffered = io.BytesIO()
    img.save(buffered, format="PNG", optimize=True)
    img_base64 = base64.b64encode(buffered.getvalue()).decode()
    
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"""Extract information from this screenshot:
                        {prompt}
                        
                        Return structured JSON matching this schema:
                        {json.dumps(target_schema, indent=2)}"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{img_base64}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=4096,
        temperature=0.1  # Low temperature for consistent extraction
    )
    
    return json.loads(response.choices[0].message.content)

Advantages:

When to use: Heavy client-side rendering, PDFs rendered as images, visual-first content

2. Intelligent Schema Generation

Instead of hardcoding schemas, let AI generate them:

def generate_extraction_schema(sample_urls: list[str], description: str):
    """AI generates optimal extraction schema from examples"""
    
    # Scrape samples
    samples = [scrape_sample(url) for url in sample_urls[:3]]
    
    prompt = f"""
    Analyze these {len(samples)} sample pages and generate an optimal 
    extraction schema.
    
    Goal: {description}
    
    Samples:
    {json.dumps(samples, indent=2)[:3000]}
    
    Generate a JSON schema that:
    1. Captures all relevant fields
    2. Uses appropriate data types
    3. Handles optional fields
    4. Includes validation rules
    
    Return only the JSON schema, no explanation.
    """
    
    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    
    schema = json.loads(response.content[0].text)
    
    # Validate schema works on samples
    for sample in samples:
        validate_against_schema(sample, schema)
    
    return schema

Benefits:

3. Anti-Detection & Stealth

Modern sites have sophisticated bot detection. Combat strategies:

Residential Proxies with Rotation

PROXY_POOL = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    # ... 100+ proxies
]

async def get_rotating_context(playwright):
    proxy = random.choice(PROXY_POOL)
    
    context = await playwright.chromium.launch(
        proxy={"server": proxy},
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox'
        ]
    )
    
    # Inject stealth scripts
    await context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
        
        Object.defineProperty(navigator, 'plugins', {
            get: () => [1, 2, 3, 4, 5]
        });
    """)
    
    return context

Browser Fingerprint Randomization

from playwright_stealth import stealth_async

async def create_stealth_browser():
    browser = await playwright.chromium.launch(
        headless=True,
        args=['--disable-blink-features=AutomationControlled']
    )
    
    context = await browser.new_context(
        viewport={
            'width': random.randint(1366, 1920),
            'height': random.randint(768, 1080)
        },
        user_agent=random.choice(USER_AGENTS),
        locale=random.choice(['en-US', 'en-GB', 'en-CA']),
        timezone_id=random.choice(TIMEZONES),
        geolocation={'latitude': 40.7128, 'longitude': -74.0060},
        permissions=['geolocation']
    )
    
    await stealth_async(context)
    return context

Human-Like Behavior

import numpy as np

async def human_like_scroll(page):
    """Scrolls like a human, not a bot"""
    viewport_height = page.viewport_size['height']
    total_height = await page.evaluate('document.body.scrollHeight')
    
    current_position = 0
    
    while current_position < total_height:
        # Random scroll distance (200-600px)
        scroll_amount = np.random.normal(400, 100)
        scroll_amount = max(200, min(600, scroll_amount))
        
        # Smooth scroll with easing
        await page.evaluate(f"""
            window.scrollBy({{
                top: {scroll_amount},
                behavior: 'smooth'
            }});
        """)
        
        current_position += scroll_amount
        
        # Random pause (0.5-2 seconds)
        await asyncio.sleep(np.random.uniform(0.5, 2.0))
        
        # Occasionally scroll back up (10% chance)
        if random.random() < 0.1:
            await page.evaluate("""
                window.scrollBy({top: -100, behavior: 'smooth'});
            """)
            await asyncio.sleep(0.5)

async def human_like_typing(page, selector, text):
    """Types like a human with realistic delays"""
    await page.click(selector)
    
    for char in text:
        await page.keyboard.type(char)
        # Typing speed: 50-150ms per character
        delay = np.random.normal(100, 30)
        await asyncio.sleep(delay / 1000)
        
        # Occasional longer pause (thinking)
        if random.random() < 0.1:
            await asyncio.sleep(random.uniform(0.5, 1.5))

4. Rate Limiting & Request Management

Adaptive Rate Limiting

from datetime import datetime, timedelta
import asyncio

class AdaptiveRateLimiter:
    def __init__(self, base_delay: float = 2.0):
        self.base_delay = base_delay
        self.current_delay = base_delay
        self.last_request = None
        self.consecutive_errors = 0
        self.success_streak = 0
        
    async def wait(self):
        """Intelligently waits between requests"""
        if self.last_request:
            elapsed = (datetime.now() - self.last_request).total_seconds()
            remaining = self.current_delay - elapsed
            
            if remaining > 0:
                # Add jitter (Β±20%)
                jitter = remaining * random.uniform(-0.2, 0.2)
                await asyncio.sleep(remaining + jitter)
        
        self.last_request = datetime.now()
    
    def on_success(self):
        """Gradually decrease delay on success"""
        self.consecutive_errors = 0
        self.success_streak += 1
        
        # After 10 successes, decrease delay by 10%
        if self.success_streak >= 10:
            self.current_delay = max(
                self.base_delay * 0.5,  # Never go below 50% of base
                self.current_delay * 0.9
            )
            self.success_streak = 0
    
    def on_error(self, status_code: int = None):
        """Exponentially backoff on errors"""
        self.success_streak = 0
        self.consecutive_errors += 1
        
        if status_code == 429:  # Rate limited
            self.current_delay *= 3
        elif status_code and status_code >= 500:  # Server error
            self.current_delay *= 2
        else:  # Other errors
            self.current_delay *= 1.5
        
        # Cap at 60 seconds
        self.current_delay = min(60, self.current_delay)

# Usage
limiter = AdaptiveRateLimiter(base_delay=2.0)

for url in urls:
    await limiter.wait()
    
    try:
        result = await scrape(url)
        limiter.on_success()
    except Exception as e:
        limiter.on_error(getattr(e, 'status_code', None))
        await asyncio.sleep(limiter.current_delay)

5. Data Validation & Quality Assurance

AI-Powered Validation

def validate_with_ai(extracted_data: dict, url: str) -> dict:
    """Use AI to validate extracted data quality"""
    
    prompt = f"""
    Validate this extracted data from {url}:
    
    {json.dumps(extracted_data, indent=2)}
    
    Check for:
    1. Missing required fields
    2. Unrealistic values (e.g., negative prices, future dates)
    3. Formatting issues
    4. Inconsistencies
    5. Placeholder or dummy data
    
    Return JSON:
    {{
        "is_valid": true/false,
        "confidence": 0.0-1.0,
        "issues": ["list", "of", "problems"],
        "corrected_data": {{corrected version if possible}}
    }}
    """
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper for validation
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    
    validation_result = json.loads(response.choices[0].message.content)
    
    if validation_result['confidence'] > 0.8:
        return validation_result['corrected_data']
    elif validation_result['confidence'] < 0.5:
        raise ValueError(f"Low confidence extraction: {validation_result['issues']}")
    else:
        return extracted_data

Production Infrastructure Setup

Dockerized Scraping Workers

# Dockerfile
FROM mcr.microsoft.com/playwright/python:v1.40.0-jammy

WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium

# Copy application
COPY . .

# Environment
ENV PYTHONUNBUFFERED=1
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright

CMD ["python", "worker.py"]
# docker-compose.yml
version: '3.8'

services:
  scraper-worker:
    build: .
    image: ai-scraper-worker
    restart: always
    deploy:
      replicas: 4
    environment:
      - REDIS_URL=redis://redis:6379
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - FIRECRAWL_API_KEY=${FIRECRAWL_API_KEY}
      - WORKER_CONCURRENCY=5
    mem_limit: 3g
    networks:
      - scraping-network
    depends_on:
      - redis
      - postgres

  redis:
    image: redis:7-alpine
    restart: always
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    networks:
      - scraping-network

  postgres:
    image: postgres:15-alpine
    restart: always
    environment:
      - POSTGRES_DB=scraping
      - POSTGRES_USER=scraper
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - scraping-network

  scheduler:
    build: .
    image: ai-scraper-worker
    command: python scheduler.py
    restart: always
    environment:
      - REDIS_URL=redis://redis:6379
    networks:
      - scraping-network
    depends_on:
      - redis

volumes:
  redis_data:
  postgres_data:

networks:
  scraping-network:
    driver: bridge

Lessons Learned

1. AI Models Have Different Strengths

GPT-4o Vision:

Claude 3.5 Sonnet:

GPT-4o-mini:

2. Proxies Are Essential

Learned the hard way:

Cost vs. Success:

3. Caching Saves Significant Costs

Without caching:

With 1-hour cache:

Conclusion

Modern AI-powered web scraping represents a paradigm shift from brittle, maintenance-heavy scrapers to intelligent, adaptive systems. The key achievements:

Technical:

Business:

Scalability:

The future of web scraping isn't about parsing HTMLβ€”it's about teaching AI to understand web content like humans do. This infrastructure proves that combining traditional reliability with modern AI capabilities creates scraping systems that are both powerful and maintainable.


Tech Stack Summary

Component Technology Purpose
Orchestration N8N / Airflow Job scheduling & workflow
Scraping Firecrawl, Playwright Content extraction
AI Models Claude 3.5, GPT-4o, GPT-4o-mini Intelligent extraction
Queue Redis + RQ Job distribution
Storage PostgreSQL Structured data
Caching Redis Performance optimization
Proxies Residential proxy pool Anti-detection
Monitoring Prometheus + Grafana Metrics & alerts
Containerization Docker Compose Deployment

Total infrastructure cost: ~$500-800/month for 1M pages/month

Full code examples available on request.