# Scraper V2 - Scrapy-Inspired Web Scraping Framework

A robust, production-ready web scraping framework inspired by Scrapy's architecture, built with TypeScript and Puppeteer.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      Scraper Engine                          │
│  (Main orchestrator - controls data flow)                    │
└──────┬──────────────────────────────────────────────────────┘
       │
       ├──> Request Scheduler (Priority Queue + Deduplication)
       │    │
       │    └──> Middleware Engine
       │         ├── User Agent Rotation
       │         ├── Proxy Rotation
       │         ├── Rate Limiting (Adaptive)
       │         ├── Retry Logic (Exponential Backoff)
       │         ├── Bot Detection
       │         └── Stealth Mode
       │
       ├──> Downloader (HTTP + Browser Hybrid)
       │    ├── Tries HTTP first (fast)
       │    └── Falls back to Puppeteer (JS-heavy sites)
       │
       ├──> Spider (Parsing Logic)
       │    ├── Parse Category Pages
       │    ├── Parse Product Pages
       │    └── Extract Data
       │
       └──> Pipeline Engine
            ├── Validation Pipeline
            ├── Sanitization Pipeline
            ├── Deduplication Pipeline
            ├── Image Processing Pipeline
            ├── Stats Pipeline
            └── Database Pipeline
```

## Key Features

### 1. **Request Scheduling**
- Priority queue with deduplication
- Request fingerprinting
- Automatic retry queue management
- Prevents duplicate requests

### 2. **Middleware System**
- **User Agent Rotation**: Rotates through realistic user agents
- **Proxy Rotation**: Uses database-stored proxies (activates on retries)
- **Adaptive Rate Limiting**: Adjusts delay based on error rate
  - Base delay: 2s
  - Increases on errors (up to 30s)
  - Decreases on success
  - Adds random jitter to avoid patterns
- **Exponential Backoff**: Smart retry with increasing delays
- **Bot Detection**: Monitors for captchas/blocks
- **Stealth Mode**: Hides automation markers

### 3. **Hybrid Downloader**
- Tries lightweight HTTP requests first
- Automatically falls back to Puppeteer if needed
- Auto-scrolling for lazy-loaded content
- Single browser instance with page reuse
- Proper resource cleanup

### 4. **Item Pipelines**
- **Validation**: Ensures data quality
- **Sanitization**: Cleans and normalizes data
- **Deduplication**: Prevents duplicate items
- **Image Processing**: Converts to full-size URLs
- **Stats**: Tracks scraping statistics
- **Database**: Upserts products with conflict resolution

### 5. **Navigation Discovery**
- Automatically detects Dutchie vs custom menus
- Extracts navigation structure
- Creates category hierarchy
- Builds proper category URLs

## Usage

### Scrape a Single Category

```typescript
import { scrapeCategory } from './scraper-v2';

await scrapeCategory(storeId, categoryId);
```

### Scrape Entire Store

```typescript
import { scrapeStore } from './scraper-v2';

await scrapeStore(storeId);
```

### Discover Categories

```typescript
import { discoverCategories } from './scraper-v2';

await discoverCategories(storeId);
```

### Advanced Usage

```typescript
import { ScraperEngine, DutchieSpider } from './scraper-v2';

// Create engine with custom concurrency
const engine = new ScraperEngine(concurrency: 1);
const spider = new DutchieSpider(engine);

// Scrape with monitoring
await spider.scrapeStore(storeId);

// Get statistics
const stats = engine.getStats();
console.log(`Success rate: ${stats.requestsSuccess}/${stats.requestsTotal}`);
```

## Error Handling

### Retryable Errors
- Network timeouts
- Connection errors
- 5xx server errors

### Non-Retryable Errors
- 404 Not Found
- Parse errors
- Validation failures

### Error Flow
1. Error occurs during request
2. Middleware processes error
3. If retryable and under max retries:
   - Wait with exponential backoff
   - Requeue request with lower priority
   - Try again
4. If non-retryable or max retries exceeded:
   - Log error
   - Call error handler if provided
   - Continue with next request

## Rate Limiting

The adaptive rate limiter adjusts delays based on server behavior:

```
Base Delay: 2 seconds
Error Multiplier: 1.5^(error_count)
Max Delay: 30 seconds
Jitter: ±20% random variation
```

Example delays:
- 0 errors: ~2s
- 1 error: ~3s
- 2 errors: ~4.5s
- 3 errors: ~6.75s
- 5+ errors: ~15-30s

## Statistics

The engine tracks:
- Total requests / Success / Failed
- Items scraped / Saved / Dropped
- Error count
- Duration
- Data quality metrics (images, THC, descriptions)

## Migration from V1

The new scraper fixes these V1 issues:

1. ✅ **No more stopping mid-scrape** - Robust error handling
2. ✅ **Gets all images** - Better image extraction + full-size URLs
3. ✅ **Follows all links** - Proper request scheduling
4. ✅ **Builds navigation correctly** - Improved category discovery
5. ✅ **No duplicate requests** - Request fingerprinting
6. ✅ **Adaptive to server load** - Smart rate limiting
7. ✅ **Better retry logic** - Exponential backoff
8. ✅ **Proxy support** - Automatic proxy rotation on errors

## Configuration

### Rate Limiting
```typescript
const rateLimitMiddleware = new RateLimitMiddleware();
rateLimitMiddleware.setBaseDelay(3000); // 3 seconds
```

### Concurrency
```typescript
const engine = new ScraperEngine(concurrency: 1); // Sequential
```

### Retries
```typescript
engine.enqueue({
  url: 'https://example.com',
  maxRetries: 5, // Override default of 3
  // ...
});
```

## Troubleshooting

### Scraper stops unexpectedly
- Check logs for error patterns
- Review bot detection warnings
- Ensure proxies are working

### Missing images
- Check StatsPipeline output for image extraction rate
- Verify image URLs in product pages
- Check MinIO connectivity

### Navigation not building
- Run `POST /stores/:id/discover-categories`
- Check if site is Dutchie or custom
- Review navigation link extraction logs

## Performance

Typical performance (1 category):
- 50 products: ~2-3 minutes
- 100 products: ~4-6 minutes
- 200 products: ~8-12 minutes

Time includes:
- Rate limiting delays
- Page rendering waits
- Auto-scrolling
- Product detail fetching

## Future Enhancements

- [ ] Multi-threaded browser instances
- [ ] Distributed scraping across servers
- [ ] Screenshot capture on errors
- [ ] HTML caching for debugging
- [ ] Webhook notifications on completion
- [ ] GraphQL API integration for Dutchie