- Move deprecated directories to src/_deprecated/: - hydration/ (old pipeline approach) - scraper-v2/ (old Puppeteer scraper) - canonical-hydration/ (merged into tasks) - Unused services: availability, crawler-logger, geolocation, etc - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser - Archive outdated docs to docs/_archive/: - ANALYTICS_RUNBOOK.md - ANALYTICS_V2_EXAMPLES.md - BRAND_INTELLIGENCE_API.md - CRAWL_PIPELINE.md - TASK_WORKFLOW_2024-12-10.md - WORKER_TASK_ARCHITECTURE.md - ORGANIC_SCRAPING_GUIDE.md - Add docs/CODEBASE_MAP.md as single source of truth - Add warning files to deprecated/archived directories - Slim down CLAUDE.md to essential rules only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.8 KiB
6.8 KiB
Scraper V2 - Scrapy-Inspired Web Scraping Framework
A robust, production-ready web scraping framework inspired by Scrapy's architecture, built with TypeScript and Puppeteer.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Scraper Engine │
│ (Main orchestrator - controls data flow) │
└──────┬──────────────────────────────────────────────────────┘
│
├──> Request Scheduler (Priority Queue + Deduplication)
│ │
│ └──> Middleware Engine
│ ├── User Agent Rotation
│ ├── Proxy Rotation
│ ├── Rate Limiting (Adaptive)
│ ├── Retry Logic (Exponential Backoff)
│ ├── Bot Detection
│ └── Stealth Mode
│
├──> Downloader (HTTP + Browser Hybrid)
│ ├── Tries HTTP first (fast)
│ └── Falls back to Puppeteer (JS-heavy sites)
│
├──> Spider (Parsing Logic)
│ ├── Parse Category Pages
│ ├── Parse Product Pages
│ └── Extract Data
│
└──> Pipeline Engine
├── Validation Pipeline
├── Sanitization Pipeline
├── Deduplication Pipeline
├── Image Processing Pipeline
├── Stats Pipeline
└── Database Pipeline
Key Features
1. Request Scheduling
- Priority queue with deduplication
- Request fingerprinting
- Automatic retry queue management
- Prevents duplicate requests
2. Middleware System
- User Agent Rotation: Rotates through realistic user agents
- Proxy Rotation: Uses database-stored proxies (activates on retries)
- Adaptive Rate Limiting: Adjusts delay based on error rate
- Base delay: 2s
- Increases on errors (up to 30s)
- Decreases on success
- Adds random jitter to avoid patterns
- Exponential Backoff: Smart retry with increasing delays
- Bot Detection: Monitors for captchas/blocks
- Stealth Mode: Hides automation markers
3. Hybrid Downloader
- Tries lightweight HTTP requests first
- Automatically falls back to Puppeteer if needed
- Auto-scrolling for lazy-loaded content
- Single browser instance with page reuse
- Proper resource cleanup
4. Item Pipelines
- Validation: Ensures data quality
- Sanitization: Cleans and normalizes data
- Deduplication: Prevents duplicate items
- Image Processing: Converts to full-size URLs
- Stats: Tracks scraping statistics
- Database: Upserts products with conflict resolution
5. Navigation Discovery
- Automatically detects Dutchie vs custom menus
- Extracts navigation structure
- Creates category hierarchy
- Builds proper category URLs
Usage
Scrape a Single Category
import { scrapeCategory } from './scraper-v2';
await scrapeCategory(storeId, categoryId);
Scrape Entire Store
import { scrapeStore } from './scraper-v2';
await scrapeStore(storeId);
Discover Categories
import { discoverCategories } from './scraper-v2';
await discoverCategories(storeId);
Advanced Usage
import { ScraperEngine, DutchieSpider } from './scraper-v2';
// Create engine with custom concurrency
const engine = new ScraperEngine(concurrency: 1);
const spider = new DutchieSpider(engine);
// Scrape with monitoring
await spider.scrapeStore(storeId);
// Get statistics
const stats = engine.getStats();
console.log(`Success rate: ${stats.requestsSuccess}/${stats.requestsTotal}`);
Error Handling
Retryable Errors
- Network timeouts
- Connection errors
- 5xx server errors
Non-Retryable Errors
- 404 Not Found
- Parse errors
- Validation failures
Error Flow
- Error occurs during request
- Middleware processes error
- If retryable and under max retries:
- Wait with exponential backoff
- Requeue request with lower priority
- Try again
- If non-retryable or max retries exceeded:
- Log error
- Call error handler if provided
- Continue with next request
Rate Limiting
The adaptive rate limiter adjusts delays based on server behavior:
Base Delay: 2 seconds
Error Multiplier: 1.5^(error_count)
Max Delay: 30 seconds
Jitter: ±20% random variation
Example delays:
- 0 errors: ~2s
- 1 error: ~3s
- 2 errors: ~4.5s
- 3 errors: ~6.75s
- 5+ errors: ~15-30s
Statistics
The engine tracks:
- Total requests / Success / Failed
- Items scraped / Saved / Dropped
- Error count
- Duration
- Data quality metrics (images, THC, descriptions)
Migration from V1
The new scraper fixes these V1 issues:
- ✅ No more stopping mid-scrape - Robust error handling
- ✅ Gets all images - Better image extraction + full-size URLs
- ✅ Follows all links - Proper request scheduling
- ✅ Builds navigation correctly - Improved category discovery
- ✅ No duplicate requests - Request fingerprinting
- ✅ Adaptive to server load - Smart rate limiting
- ✅ Better retry logic - Exponential backoff
- ✅ Proxy support - Automatic proxy rotation on errors
Configuration
Rate Limiting
const rateLimitMiddleware = new RateLimitMiddleware();
rateLimitMiddleware.setBaseDelay(3000); // 3 seconds
Concurrency
const engine = new ScraperEngine(concurrency: 1); // Sequential
Retries
engine.enqueue({
url: 'https://example.com',
maxRetries: 5, // Override default of 3
// ...
});
Troubleshooting
Scraper stops unexpectedly
- Check logs for error patterns
- Review bot detection warnings
- Ensure proxies are working
Missing images
- Check StatsPipeline output for image extraction rate
- Verify image URLs in product pages
- Check MinIO connectivity
Navigation not building
- Run
POST /stores/:id/discover-categories - Check if site is Dutchie or custom
- Review navigation link extraction logs
Performance
Typical performance (1 category):
- 50 products: ~2-3 minutes
- 100 products: ~4-6 minutes
- 200 products: ~8-12 minutes
Time includes:
- Rate limiting delays
- Page rendering waits
- Auto-scrolling
- Product detail fetching
Future Enhancements
- Multi-threaded browser instances
- Distributed scraping across servers
- Screenshot capture on errors
- HTML caching for debugging
- Webhook notifications on completion
- GraphQL API integration for Dutchie