Files
cannaiq/backend/src/_deprecated/scraper-v2/README.md
Kelly a35976b9e9 chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00

6.8 KiB

Scraper V2 - Scrapy-Inspired Web Scraping Framework

A robust, production-ready web scraping framework inspired by Scrapy's architecture, built with TypeScript and Puppeteer.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Scraper Engine                          │
│  (Main orchestrator - controls data flow)                    │
└──────┬──────────────────────────────────────────────────────┘
       │
       ├──> Request Scheduler (Priority Queue + Deduplication)
       │    │
       │    └──> Middleware Engine
       │         ├── User Agent Rotation
       │         ├── Proxy Rotation
       │         ├── Rate Limiting (Adaptive)
       │         ├── Retry Logic (Exponential Backoff)
       │         ├── Bot Detection
       │         └── Stealth Mode
       │
       ├──> Downloader (HTTP + Browser Hybrid)
       │    ├── Tries HTTP first (fast)
       │    └── Falls back to Puppeteer (JS-heavy sites)
       │
       ├──> Spider (Parsing Logic)
       │    ├── Parse Category Pages
       │    ├── Parse Product Pages
       │    └── Extract Data
       │
       └──> Pipeline Engine
            ├── Validation Pipeline
            ├── Sanitization Pipeline
            ├── Deduplication Pipeline
            ├── Image Processing Pipeline
            ├── Stats Pipeline
            └── Database Pipeline

Key Features

1. Request Scheduling

  • Priority queue with deduplication
  • Request fingerprinting
  • Automatic retry queue management
  • Prevents duplicate requests

2. Middleware System

  • User Agent Rotation: Rotates through realistic user agents
  • Proxy Rotation: Uses database-stored proxies (activates on retries)
  • Adaptive Rate Limiting: Adjusts delay based on error rate
    • Base delay: 2s
    • Increases on errors (up to 30s)
    • Decreases on success
    • Adds random jitter to avoid patterns
  • Exponential Backoff: Smart retry with increasing delays
  • Bot Detection: Monitors for captchas/blocks
  • Stealth Mode: Hides automation markers

3. Hybrid Downloader

  • Tries lightweight HTTP requests first
  • Automatically falls back to Puppeteer if needed
  • Auto-scrolling for lazy-loaded content
  • Single browser instance with page reuse
  • Proper resource cleanup

4. Item Pipelines

  • Validation: Ensures data quality
  • Sanitization: Cleans and normalizes data
  • Deduplication: Prevents duplicate items
  • Image Processing: Converts to full-size URLs
  • Stats: Tracks scraping statistics
  • Database: Upserts products with conflict resolution

5. Navigation Discovery

  • Automatically detects Dutchie vs custom menus
  • Extracts navigation structure
  • Creates category hierarchy
  • Builds proper category URLs

Usage

Scrape a Single Category

import { scrapeCategory } from './scraper-v2';

await scrapeCategory(storeId, categoryId);

Scrape Entire Store

import { scrapeStore } from './scraper-v2';

await scrapeStore(storeId);

Discover Categories

import { discoverCategories } from './scraper-v2';

await discoverCategories(storeId);

Advanced Usage

import { ScraperEngine, DutchieSpider } from './scraper-v2';

// Create engine with custom concurrency
const engine = new ScraperEngine(concurrency: 1);
const spider = new DutchieSpider(engine);

// Scrape with monitoring
await spider.scrapeStore(storeId);

// Get statistics
const stats = engine.getStats();
console.log(`Success rate: ${stats.requestsSuccess}/${stats.requestsTotal}`);

Error Handling

Retryable Errors

  • Network timeouts
  • Connection errors
  • 5xx server errors

Non-Retryable Errors

  • 404 Not Found
  • Parse errors
  • Validation failures

Error Flow

  1. Error occurs during request
  2. Middleware processes error
  3. If retryable and under max retries:
    • Wait with exponential backoff
    • Requeue request with lower priority
    • Try again
  4. If non-retryable or max retries exceeded:
    • Log error
    • Call error handler if provided
    • Continue with next request

Rate Limiting

The adaptive rate limiter adjusts delays based on server behavior:

Base Delay: 2 seconds
Error Multiplier: 1.5^(error_count)
Max Delay: 30 seconds
Jitter: ±20% random variation

Example delays:

  • 0 errors: ~2s
  • 1 error: ~3s
  • 2 errors: ~4.5s
  • 3 errors: ~6.75s
  • 5+ errors: ~15-30s

Statistics

The engine tracks:

  • Total requests / Success / Failed
  • Items scraped / Saved / Dropped
  • Error count
  • Duration
  • Data quality metrics (images, THC, descriptions)

Migration from V1

The new scraper fixes these V1 issues:

  1. No more stopping mid-scrape - Robust error handling
  2. Gets all images - Better image extraction + full-size URLs
  3. Follows all links - Proper request scheduling
  4. Builds navigation correctly - Improved category discovery
  5. No duplicate requests - Request fingerprinting
  6. Adaptive to server load - Smart rate limiting
  7. Better retry logic - Exponential backoff
  8. Proxy support - Automatic proxy rotation on errors

Configuration

Rate Limiting

const rateLimitMiddleware = new RateLimitMiddleware();
rateLimitMiddleware.setBaseDelay(3000); // 3 seconds

Concurrency

const engine = new ScraperEngine(concurrency: 1); // Sequential

Retries

engine.enqueue({
  url: 'https://example.com',
  maxRetries: 5, // Override default of 3
  // ...
});

Troubleshooting

Scraper stops unexpectedly

  • Check logs for error patterns
  • Review bot detection warnings
  • Ensure proxies are working

Missing images

  • Check StatsPipeline output for image extraction rate
  • Verify image URLs in product pages
  • Check MinIO connectivity

Navigation not building

  • Run POST /stores/:id/discover-categories
  • Check if site is Dutchie or custom
  • Review navigation link extraction logs

Performance

Typical performance (1 category):

  • 50 products: ~2-3 minutes
  • 100 products: ~4-6 minutes
  • 200 products: ~8-12 minutes

Time includes:

  • Rate limiting delays
  • Page rendering waits
  • Auto-scrolling
  • Product detail fetching

Future Enhancements

  • Multi-threaded browser instances
  • Distributed scraping across servers
  • Screenshot capture on errors
  • HTML caching for debugging
  • Webhook notifications on completion
  • GraphQL API integration for Dutchie