Files
cannaiq/workflow-12102025.md
Kelly 824d48fd85 fix: Add curl to Docker, add active flag to worker_tasks
- Install curl in Docker container for Dutchie HTTP requests
- Add 'active' column to worker_tasks (default false) to prevent
  accidental task execution on startup
- Update task-service to only claim tasks where active=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:12:09 -07:00

366 lines
11 KiB
Markdown

# Workflow Documentation - December 10, 2025
## Purpose
This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.
---
## Stealth & Anti-Detection Requirements
### 1. Task Determines Work, Proxy Determines Identity
The task payload contains:
- `dispensary_id` - which store to crawl
- `role` - what type of work (product_resync, entry_point_discovery, etc.)
The **proxy** determines the session identity:
- Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
- Language is always English (`en-US`)
**Flow:**
```
Task claimed
└─► Get proxy from rotation
└─► Proxy has location (city, state, timezone)
└─► Build headers using proxy's timezone
- Accept-Language: en-US,en;q=0.9
- Timezone-consistent behavior
```
### 2. On 403 Block - Immediate Backoff
When a 403 is received:
1. **Immediately** stop using current IP
2. Get a new proxy (new IP)
3. Get a new UA/fingerprint
4. Retry the request
**Per-proxy failure tracking:**
- Track UA rotation attempts per proxy
- After 3 UA/fingerprint rotations on the same proxy → disable that proxy
- This means: if we rotate UA 3 times and still get 403, the proxy is burned
### 3. Fingerprint Rotation Rules
Each request uses:
- Proxy (IP)
- User-Agent
- sec-ch-ua headers (Client Hints)
- Accept-Language (from proxy location)
On 403:
1. Record failure on current proxy
2. Rotate to new proxy
3. Pick new random fingerprint
4. If same proxy fails 3 times with different fingerprints → disable proxy
### 4. Proxy Table Schema
```sql
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http',
active BOOLEAN DEFAULT true,
-- Location (determines session headers)
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(100),
country_code VARCHAR(10),
timezone VARCHAR(50),
-- Health tracking
failure_count INTEGER DEFAULT 0,
consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically
last_used_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
last_error TEXT,
-- Performance
response_time_ms INTEGER,
max_connections INTEGER DEFAULT 1
);
```
### 5. Failure Threshold
- **3 consecutive 403s** with different fingerprints → disable proxy
- Reset `consecutive_403_count` to 0 on successful request
- General `failure_count` tracks all errors (timeouts, connection errors, etc.)
---
## Implementation Status
### COMPLETED - December 10, 2025
All code changes have been implemented per this specification:
#### 1. crawl-rotator.ts ✅
- [x] Added `consecutive403Count` to Proxy interface
- [x] Added `markBlocked()` method that increments `consecutive_403_count` and disables proxy at 3
- [x] Added `getProxyTimezone()` to return current proxy's timezone
- [x] `markSuccess()` now resets `consecutive_403_count` to 0
- [x] Replaced hardcoded UA list with `intoli/user-agents` library for realistic fingerprints
- [x] `BrowserFingerprint` interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)
#### 2. client.ts ✅
- [x] `startSession()` no longer takes state/timezone params
- [x] `startSession()` gets identity from proxy via `crawlRotator.getProxyLocation()`
- [x] Added `handle403Block()` that:
- Calls `crawlRotator.recordBlock()` (tracks consecutive 403s)
- Immediately rotates both proxy and fingerprint via `rotateBoth()`
- Returns false if no more proxies available
- [x] `executeGraphQL()` calls `handle403Block()` on 403 (not `rotateProxyOn403`)
- [x] `fetchPage()` uses same 403 handling
- [x] 500ms backoff after rotation (not linear delay)
#### 3. Task Handlers ✅
- [x] `entry-point-discovery.ts`: `startSession()` called with no params
- [x] `product-refresh.ts`: `startSession()` called with no params
#### 4. Dependencies ✅
- [x] Added `user-agents` npm package for realistic UA generation
---
## Files Changed
| File | Changes |
|------|---------|
| `backend/src/services/crawl-rotator.ts` | Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents` |
| `backend/src/platforms/dutchie/client.ts` | `startSession()` uses proxy location, `handle403Block()` for 403 handling |
| `backend/src/tasks/handlers/entry-point-discovery.ts` | `startSession()` no params |
| `backend/src/tasks/handlers/product-refresh.ts` | `startSession()` no params |
| `backend/package.json` | Added `user-agents` dependency |
---
## Migration Required
The `proxies` table needs `consecutive_403_count` column if not already present:
```sql
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
```
---
## Key Behaviors Summary
| Behavior | Implementation |
|----------|----------------|
| Session identity | From proxy location (`getProxyLocation()`) |
| Language | Always `en-US,en;q=0.9` |
| 403 handling | `handle403Block()``recordBlock()``rotateBoth()` |
| Proxy disable | After 3 consecutive 403s (`consecutive403Count >= 3`) |
| Success reset | `markSuccess()` resets `consecutive403Count` to 0 |
| UA generation | `intoli/user-agents` library (daily updated, realistic fingerprints) |
| Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers |
---
## User-Agent Generation
### Data Source
The `intoli/user-agents` npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.
### Device Category Distribution (hardcoded)
| Category | Share |
|----------|-------|
| Mobile | 62% |
| Desktop | 36% |
| Tablet | 2% |
### Browser Filter (whitelist only)
Only these browsers are allowed:
- Chrome (67%)
- Safari (20%)
- Edge (6%)
- Firefox (3%)
Samsung Internet, Opera, and other niche browsers are filtered out.
### Desktop OS Distribution (from library)
| OS | Share |
|----|-------|
| Windows | 72% |
| macOS | 17% |
| Linux | 4% |
### UA Lifecycle
1. **Session start** (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
2. **UA sticks** until IP rotates (403 block or manual rotation)
3. **IP rotation** triggers new UA generation
### Failure Handling
- If UA generation fails → Alert admin dashboard, **stop crawl immediately**
- No fallback to static UA list
- This forces investigation rather than silent degradation
### Session Logging
Each session logs:
- Device category (mobile/desktop/tablet)
- Full UA string
- Browser name (Chrome/Safari/Edge/Firefox)
- IP address (from proxy)
- Session start timestamp
Logs are rotated monthly.
### Implementation
Located in `backend/src/services/crawl-rotator.ts`:
```typescript
// Per workflow-12102025.md: Device category distribution
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };
// Per workflow-12102025.md: Browser whitelist
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];
```
---
## HTTP Fingerprinting
### Goal
Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.
### Components
1. **Full Header Set** - All headers a real browser sends
2. **Header Ordering** - Browser-specific order (Chrome vs Firefox vs Safari)
3. **TLS Fingerprint** - Use `curl-impersonate` to match browser TLS signature
4. **Dynamic Referer** - Set per dispensary being crawled
5. **Natural Randomization** - Vary optional headers like real users
### Required Headers
| Header | Chrome | Firefox | Safari | Notes |
|--------|--------|---------|--------|-------|
| `User-Agent` | ✅ | ✅ | ✅ | From UA generation |
| `Accept` | ✅ | ✅ | ✅ | Content types |
| `Accept-Language` | ✅ | ✅ | ✅ | Always `en-US,en;q=0.9` |
| `Accept-Encoding` | ✅ | ✅ | ✅ | `gzip, deflate, br` |
| `Connection` | ✅ | ✅ | ✅ | `keep-alive` |
| `Origin` | ✅ | ✅ | ✅ | `https://dutchie.com` (POST only) |
| `Referer` | ✅ | ✅ | ✅ | Dynamic per dispensary |
| `sec-ch-ua` | ✅ | ❌ | ❌ | Chromium only |
| `sec-ch-ua-mobile` | ✅ | ❌ | ❌ | Chromium only |
| `sec-ch-ua-platform` | ✅ | ❌ | ❌ | Chromium only |
| `sec-fetch-dest` | ✅ | ✅ | ❌ | `empty` for XHR |
| `sec-fetch-mode` | ✅ | ✅ | ❌ | `cors` for XHR |
| `sec-fetch-site` | ✅ | ✅ | ❌ | `same-origin` |
| `Upgrade-Insecure-Requests` | ✅ | ✅ | ✅ | `1` (page loads only) |
| `DNT` | ~30% | ~30% | ~30% | Randomized per session |
### Header Ordering
Each browser sends headers in a specific order. Fingerprinting services detect mismatches.
**Chrome order (GraphQL request):**
1. Host
2. Connection
3. Content-Length (POST)
4. sec-ch-ua
5. DNT (if enabled)
6. sec-ch-ua-mobile
7. User-Agent
8. sec-ch-ua-platform
9. Content-Type (POST)
10. Accept
11. Origin (POST)
12. sec-fetch-site
13. sec-fetch-mode
14. sec-fetch-dest
15. Referer
16. Accept-Encoding
17. Accept-Language
**Firefox order (GraphQL request):**
1. Host
2. User-Agent
3. Accept
4. Accept-Language
5. Accept-Encoding
6. Content-Type (POST)
7. Content-Length (POST)
8. Origin (POST)
9. DNT (if enabled)
10. Connection
11. Referer
12. sec-fetch-dest
13. sec-fetch-mode
14. sec-fetch-site
**Safari order (GraphQL request):**
1. Host
2. Connection
3. Content-Length (POST)
4. Accept
5. User-Agent
6. Content-Type (POST)
7. Origin (POST)
8. Referer
9. Accept-Encoding
10. Accept-Language
### TLS Fingerprinting
Use `curl-impersonate` instead of standard curl:
- `curl_chrome131` - Mimics Chrome 131 TLS handshake
- `curl_ff133` - Mimics Firefox 133 TLS handshake
- `curl_safari17` - Mimics Safari 17 TLS handshake
Match TLS binary to browser in UA.
### Dynamic Referer
Set Referer to the dispensary's actual page URL:
```
Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa
```
Derived from dispensary's `menu_url` field.
### Natural Randomization
Per-session randomization (set once when session starts, consistent for session):
| Feature | Distribution | Implementation |
|---------|--------------|----------------|
| DNT header | 30% have it | `Math.random() < 0.30` |
| Accept quality values | Slight variation | `q=0.9` vs `q=0.8` |
### Implementation Files
| File | Purpose |
|------|---------|
| `src/services/crawl-rotator.ts` | `BrowserFingerprint` includes full header config |
| `src/platforms/dutchie/client.ts` | Build headers from fingerprint, use curl-impersonate |
| `src/services/http-fingerprint.ts` | Header ordering per browser (NEW) |