feat(workers): Add dual-transport preflight system
Workers now run both curl and http (Puppeteer) preflights on startup: - curl-preflight.ts: Tests axios + proxy via httpbin.org - puppeteer-preflight.ts: Tests browser + StealthPlugin via fingerprint.com (with amiunique.org fallback) - Migration 084: Adds preflight columns to worker_registry and method column to worker_tasks - Workers report preflight status, IP, fingerprint, and response time - Tasks can require specific transport method (curl/http) - Dashboard shows Transport column with preflight status badges 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
106
CLAUDE.md
106
CLAUDE.md
@@ -99,6 +99,112 @@ Bump `wordpress-plugin/VERSION` on changes:
|
||||
|
||||
---
|
||||
|
||||
## Puppeteer Scraping (Browser-Based)
|
||||
|
||||
### Age Gate Bypass
|
||||
|
||||
Most dispensary sites require age verification. The browser scraper handles this automatically:
|
||||
|
||||
**Utility File**: `src/utils/age-gate.ts`
|
||||
|
||||
**Key Functions**:
|
||||
- `setAgeGateCookies(page, url, state)` - Set cookies BEFORE navigation to prevent gate
|
||||
- `hasAgeGate(page)` - Detect if page shows age verification
|
||||
- `bypassAgeGate(page, state)` - Click through age gate if displayed
|
||||
- `detectStateFromUrl(url)` - Extract state from URL (e.g., `-az-` → Arizona)
|
||||
|
||||
**Cookie Names Set**:
|
||||
- `age_gate_passed: 'true'`
|
||||
- `selected_state: '<state>'`
|
||||
- `age_verified: 'true'`
|
||||
|
||||
**Bypass Methods** (tried in order):
|
||||
1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
|
||||
2. Standard `<select>` dropdown
|
||||
3. State button/card click
|
||||
4. Direct "Yes"/"Enter" button
|
||||
|
||||
**Usage Pattern**:
|
||||
```typescript
|
||||
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
|
||||
|
||||
// Set cookies BEFORE navigation
|
||||
await setAgeGateCookies(page, menuUrl, 'Arizona');
|
||||
await page.goto(menuUrl);
|
||||
|
||||
// If gate still appears, bypass it
|
||||
await bypassAgeGate(page, 'Arizona');
|
||||
```
|
||||
|
||||
**Note**: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
|
||||
|
||||
### Dual-Transport Preflight
|
||||
|
||||
Workers run BOTH preflight checks on startup:
|
||||
|
||||
| Transport | Test Method | Use Case |
|
||||
|-----------|-------------|----------|
|
||||
| `curl` | axios + proxy → httpbin.org | Fast API requests |
|
||||
| `http` | Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
|
||||
|
||||
**HTTP Preflight Steps**:
|
||||
1. Get proxy from pool (CrawlRotator)
|
||||
2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
|
||||
3. Visit Dutchie embedded menu to establish session
|
||||
4. Make GraphQL request from browser context
|
||||
|
||||
**Files**:
|
||||
- `src/services/curl-preflight.ts`
|
||||
- `src/services/puppeteer-preflight.ts`
|
||||
- `migrations/084_dual_transport_preflight.sql`
|
||||
|
||||
**Task Method Column**: Tasks have `method` column ('curl' | 'http' | null):
|
||||
- `null` = any worker can claim
|
||||
- `'curl'` = only workers with passed curl preflight
|
||||
- `'http'` = only workers with passed http preflight
|
||||
|
||||
Currently ALL crawl tasks require `method = 'http'`.
|
||||
|
||||
### Anti-Detect Fingerprint Distribution
|
||||
|
||||
Browser fingerprints are randomized using realistic market share distributions:
|
||||
|
||||
**Files**:
|
||||
- `src/services/crawl-rotator.ts` - Device/browser selection
|
||||
- `src/services/http-fingerprint.ts` - HTTP header fingerprinting
|
||||
|
||||
**Device Weights** (matches real traffic patterns):
|
||||
| Device | Weight | Percentage |
|
||||
|--------|--------|------------|
|
||||
| Mobile | 62 | 62% |
|
||||
| Desktop | 36 | 36% |
|
||||
| Tablet | 2 | 2% |
|
||||
|
||||
**Allowed Browsers** (only realistic ones):
|
||||
- Chrome (67% market share)
|
||||
- Safari (20% market share)
|
||||
- Edge (6% market share)
|
||||
- Firefox (3% market share)
|
||||
|
||||
All other browsers are filtered out. Uses `intoli/user-agents` library for realistic UA generation.
|
||||
|
||||
**HTTP Header Fingerprinting**:
|
||||
- DNT (Do Not Track): 30% probability of sending
|
||||
- Accept headers: Browser-specific variations
|
||||
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
|
||||
|
||||
**curl-impersonate Binaries** (for curl transport):
|
||||
| Browser | Binary |
|
||||
|---------|--------|
|
||||
| Chrome | `curl_chrome131` |
|
||||
| Edge | `curl_chrome131` |
|
||||
| Firefox | `curl_ff133` |
|
||||
| Safari | `curl_safari17` |
|
||||
|
||||
These binaries mimic real browser TLS fingerprints to avoid detection.
|
||||
|
||||
---
|
||||
|
||||
## Documentation
|
||||
|
||||
| Doc | Purpose |
|
||||
|
||||
Reference in New Issue
Block a user