# Scraper V2 - Scrapy-Inspired Web Scraping Framework A robust, production-ready web scraping framework inspired by Scrapy's architecture, built with TypeScript and Puppeteer. ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Scraper Engine │ │ (Main orchestrator - controls data flow) │ └──────┬──────────────────────────────────────────────────────┘ │ ├──> Request Scheduler (Priority Queue + Deduplication) │ │ │ └──> Middleware Engine │ ├── User Agent Rotation │ ├── Proxy Rotation │ ├── Rate Limiting (Adaptive) │ ├── Retry Logic (Exponential Backoff) │ ├── Bot Detection │ └── Stealth Mode │ ├──> Downloader (HTTP + Browser Hybrid) │ ├── Tries HTTP first (fast) │ └── Falls back to Puppeteer (JS-heavy sites) │ ├──> Spider (Parsing Logic) │ ├── Parse Category Pages │ ├── Parse Product Pages │ └── Extract Data │ └──> Pipeline Engine ├── Validation Pipeline ├── Sanitization Pipeline ├── Deduplication Pipeline ├── Image Processing Pipeline ├── Stats Pipeline └── Database Pipeline ``` ## Key Features ### 1. **Request Scheduling** - Priority queue with deduplication - Request fingerprinting - Automatic retry queue management - Prevents duplicate requests ### 2. **Middleware System** - **User Agent Rotation**: Rotates through realistic user agents - **Proxy Rotation**: Uses database-stored proxies (activates on retries) - **Adaptive Rate Limiting**: Adjusts delay based on error rate - Base delay: 2s - Increases on errors (up to 30s) - Decreases on success - Adds random jitter to avoid patterns - **Exponential Backoff**: Smart retry with increasing delays - **Bot Detection**: Monitors for captchas/blocks - **Stealth Mode**: Hides automation markers ### 3. **Hybrid Downloader** - Tries lightweight HTTP requests first - Automatically falls back to Puppeteer if needed - Auto-scrolling for lazy-loaded content - Single browser instance with page reuse - Proper resource cleanup ### 4. **Item Pipelines** - **Validation**: Ensures data quality - **Sanitization**: Cleans and normalizes data - **Deduplication**: Prevents duplicate items - **Image Processing**: Converts to full-size URLs - **Stats**: Tracks scraping statistics - **Database**: Upserts products with conflict resolution ### 5. **Navigation Discovery** - Automatically detects Dutchie vs custom menus - Extracts navigation structure - Creates category hierarchy - Builds proper category URLs ## Usage ### Scrape a Single Category ```typescript import { scrapeCategory } from './scraper-v2'; await scrapeCategory(storeId, categoryId); ``` ### Scrape Entire Store ```typescript import { scrapeStore } from './scraper-v2'; await scrapeStore(storeId); ``` ### Discover Categories ```typescript import { discoverCategories } from './scraper-v2'; await discoverCategories(storeId); ``` ### Advanced Usage ```typescript import { ScraperEngine, DutchieSpider } from './scraper-v2'; // Create engine with custom concurrency const engine = new ScraperEngine(concurrency: 1); const spider = new DutchieSpider(engine); // Scrape with monitoring await spider.scrapeStore(storeId); // Get statistics const stats = engine.getStats(); console.log(`Success rate: ${stats.requestsSuccess}/${stats.requestsTotal}`); ``` ## Error Handling ### Retryable Errors - Network timeouts - Connection errors - 5xx server errors ### Non-Retryable Errors - 404 Not Found - Parse errors - Validation failures ### Error Flow 1. Error occurs during request 2. Middleware processes error 3. If retryable and under max retries: - Wait with exponential backoff - Requeue request with lower priority - Try again 4. If non-retryable or max retries exceeded: - Log error - Call error handler if provided - Continue with next request ## Rate Limiting The adaptive rate limiter adjusts delays based on server behavior: ``` Base Delay: 2 seconds Error Multiplier: 1.5^(error_count) Max Delay: 30 seconds Jitter: ±20% random variation ``` Example delays: - 0 errors: ~2s - 1 error: ~3s - 2 errors: ~4.5s - 3 errors: ~6.75s - 5+ errors: ~15-30s ## Statistics The engine tracks: - Total requests / Success / Failed - Items scraped / Saved / Dropped - Error count - Duration - Data quality metrics (images, THC, descriptions) ## Migration from V1 The new scraper fixes these V1 issues: 1. ✅ **No more stopping mid-scrape** - Robust error handling 2. ✅ **Gets all images** - Better image extraction + full-size URLs 3. ✅ **Follows all links** - Proper request scheduling 4. ✅ **Builds navigation correctly** - Improved category discovery 5. ✅ **No duplicate requests** - Request fingerprinting 6. ✅ **Adaptive to server load** - Smart rate limiting 7. ✅ **Better retry logic** - Exponential backoff 8. ✅ **Proxy support** - Automatic proxy rotation on errors ## Configuration ### Rate Limiting ```typescript const rateLimitMiddleware = new RateLimitMiddleware(); rateLimitMiddleware.setBaseDelay(3000); // 3 seconds ``` ### Concurrency ```typescript const engine = new ScraperEngine(concurrency: 1); // Sequential ``` ### Retries ```typescript engine.enqueue({ url: 'https://example.com', maxRetries: 5, // Override default of 3 // ... }); ``` ## Troubleshooting ### Scraper stops unexpectedly - Check logs for error patterns - Review bot detection warnings - Ensure proxies are working ### Missing images - Check StatsPipeline output for image extraction rate - Verify image URLs in product pages - Check MinIO connectivity ### Navigation not building - Run `POST /stores/:id/discover-categories` - Check if site is Dutchie or custom - Review navigation link extraction logs ## Performance Typical performance (1 category): - 50 products: ~2-3 minutes - 100 products: ~4-6 minutes - 200 products: ~8-12 minutes Time includes: - Rate limiting delays - Page rendering waits - Auto-scrolling - Product detail fetching ## Future Enhancements - [ ] Multi-threaded browser instances - [ ] Distributed scraping across servers - [ ] Screenshot capture on errors - [ ] HTML caching for debugging - [ ] Webhook notifications on completion - [ ] GraphQL API integration for Dutchie