Files

Kelly a35976b9e9 chore: Clean up deprecated code and docs

- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 22:17:40 -07:00

6.8 KiB

Raw Blame History

Scraper V2 - Scrapy-Inspired Web Scraping Framework

A robust, production-ready web scraping framework inspired by Scrapy's architecture, built with TypeScript and Puppeteer.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Scraper Engine                          │
│  (Main orchestrator - controls data flow)                    │
└──────┬──────────────────────────────────────────────────────┘
       │
       ├──> Request Scheduler (Priority Queue + Deduplication)
       │    │
       │    └──> Middleware Engine
       │         ├── User Agent Rotation
       │         ├── Proxy Rotation
       │         ├── Rate Limiting (Adaptive)
       │         ├── Retry Logic (Exponential Backoff)
       │         ├── Bot Detection
       │         └── Stealth Mode
       │
       ├──> Downloader (HTTP + Browser Hybrid)
       │    ├── Tries HTTP first (fast)
       │    └── Falls back to Puppeteer (JS-heavy sites)
       │
       ├──> Spider (Parsing Logic)
       │    ├── Parse Category Pages
       │    ├── Parse Product Pages
       │    └── Extract Data
       │
       └──> Pipeline Engine
            ├── Validation Pipeline
            ├── Sanitization Pipeline
            ├── Deduplication Pipeline
            ├── Image Processing Pipeline
            ├── Stats Pipeline
            └── Database Pipeline

Key Features

1. Request Scheduling

Priority queue with deduplication
Request fingerprinting
Automatic retry queue management
Prevents duplicate requests

2. Middleware System

User Agent Rotation: Rotates through realistic user agents
Proxy Rotation: Uses database-stored proxies (activates on retries)
Adaptive Rate Limiting: Adjusts delay based on error rate
- Base delay: 2s
- Increases on errors (up to 30s)
- Decreases on success
- Adds random jitter to avoid patterns
Exponential Backoff: Smart retry with increasing delays
Bot Detection: Monitors for captchas/blocks
Stealth Mode: Hides automation markers

3. Hybrid Downloader

Tries lightweight HTTP requests first
Automatically falls back to Puppeteer if needed
Auto-scrolling for lazy-loaded content
Single browser instance with page reuse
Proper resource cleanup

4. Item Pipelines

Validation: Ensures data quality
Sanitization: Cleans and normalizes data
Deduplication: Prevents duplicate items
Image Processing: Converts to full-size URLs
Stats: Tracks scraping statistics
Database: Upserts products with conflict resolution

Automatically detects Dutchie vs custom menus
Extracts navigation structure
Creates category hierarchy
Builds proper category URLs

Usage

Scrape a Single Category

import { scrapeCategory } from './scraper-v2';

await scrapeCategory(storeId, categoryId);

Scrape Entire Store

import { scrapeStore } from './scraper-v2';

await scrapeStore(storeId);

Discover Categories

import { discoverCategories } from './scraper-v2';

await discoverCategories(storeId);

Advanced Usage

import { ScraperEngine, DutchieSpider } from './scraper-v2';

// Create engine with custom concurrency
const engine = new ScraperEngine(concurrency: 1);
const spider = new DutchieSpider(engine);

// Scrape with monitoring
await spider.scrapeStore(storeId);

// Get statistics
const stats = engine.getStats();
console.log(`Success rate: ${stats.requestsSuccess}/${stats.requestsTotal}`);

Error Handling

Retryable Errors

Network timeouts
Connection errors
5xx server errors

Non-Retryable Errors

404 Not Found
Parse errors
Validation failures

Error Flow

Error occurs during request
Middleware processes error
If retryable and under max retries:
- Wait with exponential backoff
- Requeue request with lower priority
- Try again
If non-retryable or max retries exceeded:
- Log error
- Call error handler if provided
- Continue with next request

Rate Limiting

The adaptive rate limiter adjusts delays based on server behavior:

Base Delay: 2 seconds
Error Multiplier: 1.5^(error_count)
Max Delay: 30 seconds
Jitter: ±20% random variation

Example delays:

0 errors: ~2s
1 error: ~3s
2 errors: ~4.5s
3 errors: ~6.75s
5+ errors: ~15-30s

Statistics

The engine tracks:

Total requests / Success / Failed
Items scraped / Saved / Dropped
Error count
Duration
Data quality metrics (images, THC, descriptions)

Migration from V1

The new scraper fixes these V1 issues:

✅ No more stopping mid-scrape - Robust error handling
✅ Gets all images - Better image extraction + full-size URLs
✅ Follows all links - Proper request scheduling
✅ Builds navigation correctly - Improved category discovery
✅ No duplicate requests - Request fingerprinting
✅ Adaptive to server load - Smart rate limiting
✅ Better retry logic - Exponential backoff
✅ Proxy support - Automatic proxy rotation on errors

Configuration

Rate Limiting

const rateLimitMiddleware = new RateLimitMiddleware();
rateLimitMiddleware.setBaseDelay(3000); // 3 seconds

Concurrency

const engine = new ScraperEngine(concurrency: 1); // Sequential

Retries

engine.enqueue({
  url: 'https://example.com',
  maxRetries: 5, // Override default of 3
  // ...
});

Troubleshooting

Scraper stops unexpectedly

Check logs for error patterns
Review bot detection warnings
Ensure proxies are working

Missing images

Check StatsPipeline output for image extraction rate
Verify image URLs in product pages
Check MinIO connectivity

Run POST /stores/:id/discover-categories
Check if site is Dutchie or custom
Review navigation link extraction logs

Performance

Typical performance (1 category):

50 products: ~2-3 minutes
100 products: ~4-6 minutes
200 products: ~8-12 minutes

Time includes:

Rate limiting delays
Page rendering waits
Auto-scrolling
Product detail fetching

Future Enhancements

Multi-threaded browser instances
Distributed scraping across servers
Screenshot capture on errors
HTML caching for debugging
Webhook notifications on completion
GraphQL API integration for Dutchie

6.8 KiB

Raw Blame History

Scraper V2 - Scrapy-Inspired Web Scraping Framework

Architecture

Key Features

1. Request Scheduling

2. Middleware System

3. Hybrid Downloader

4. Item Pipelines

5. Navigation Discovery

Usage

Scrape a Single Category

Scrape Entire Store

Discover Categories

Advanced Usage

Error Handling

Retryable Errors

Non-Retryable Errors

Error Flow

Rate Limiting

Statistics

Migration from V1

Configuration

Rate Limiting

Concurrency

Retries

Troubleshooting

Scraper stops unexpectedly

Missing images

Navigation not building

Performance

Future Enhancements

6.8 KiB Raw Blame History

Scraper V2 - Scrapy-Inspired Web Scraping Framework

Architecture

Key Features

1. Request Scheduling

2. Middleware System

3. Hybrid Downloader

4. Item Pipelines

5. Navigation Discovery

Usage

Scrape a Single Category

Scrape Entire Store

Discover Categories

Advanced Usage

Error Handling

Retryable Errors

Non-Retryable Errors

Error Flow

Rate Limiting

Statistics

Migration from V1

Configuration

Rate Limiting

Concurrency

Retries

Troubleshooting

Scraper stops unexpectedly

Missing images

Navigation not building

Performance

Future Enhancements

6.8 KiB

Raw Blame History