- Install curl in Docker container for Dutchie HTTP requests - Add 'active' column to worker_tasks (default false) to prevent accidental task execution on startup - Update task-service to only claim tasks where active=true 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
366 lines
11 KiB
Markdown
366 lines
11 KiB
Markdown
# Workflow Documentation - December 10, 2025
|
|
|
|
## Purpose
|
|
|
|
This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.
|
|
|
|
---
|
|
|
|
## Stealth & Anti-Detection Requirements
|
|
|
|
### 1. Task Determines Work, Proxy Determines Identity
|
|
|
|
The task payload contains:
|
|
- `dispensary_id` - which store to crawl
|
|
- `role` - what type of work (product_resync, entry_point_discovery, etc.)
|
|
|
|
The **proxy** determines the session identity:
|
|
- Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
|
|
- Language is always English (`en-US`)
|
|
|
|
**Flow:**
|
|
```
|
|
Task claimed
|
|
│
|
|
└─► Get proxy from rotation
|
|
│
|
|
└─► Proxy has location (city, state, timezone)
|
|
│
|
|
└─► Build headers using proxy's timezone
|
|
- Accept-Language: en-US,en;q=0.9
|
|
- Timezone-consistent behavior
|
|
```
|
|
|
|
### 2. On 403 Block - Immediate Backoff
|
|
|
|
When a 403 is received:
|
|
|
|
1. **Immediately** stop using current IP
|
|
2. Get a new proxy (new IP)
|
|
3. Get a new UA/fingerprint
|
|
4. Retry the request
|
|
|
|
**Per-proxy failure tracking:**
|
|
- Track UA rotation attempts per proxy
|
|
- After 3 UA/fingerprint rotations on the same proxy → disable that proxy
|
|
- This means: if we rotate UA 3 times and still get 403, the proxy is burned
|
|
|
|
### 3. Fingerprint Rotation Rules
|
|
|
|
Each request uses:
|
|
- Proxy (IP)
|
|
- User-Agent
|
|
- sec-ch-ua headers (Client Hints)
|
|
- Accept-Language (from proxy location)
|
|
|
|
On 403:
|
|
1. Record failure on current proxy
|
|
2. Rotate to new proxy
|
|
3. Pick new random fingerprint
|
|
4. If same proxy fails 3 times with different fingerprints → disable proxy
|
|
|
|
### 4. Proxy Table Schema
|
|
|
|
```sql
|
|
CREATE TABLE proxies (
|
|
id SERIAL PRIMARY KEY,
|
|
host VARCHAR(255) NOT NULL,
|
|
port INTEGER NOT NULL,
|
|
username VARCHAR(100),
|
|
password VARCHAR(100),
|
|
protocol VARCHAR(10) DEFAULT 'http',
|
|
active BOOLEAN DEFAULT true,
|
|
|
|
-- Location (determines session headers)
|
|
city VARCHAR(100),
|
|
state VARCHAR(50),
|
|
country VARCHAR(100),
|
|
country_code VARCHAR(10),
|
|
timezone VARCHAR(50),
|
|
|
|
-- Health tracking
|
|
failure_count INTEGER DEFAULT 0,
|
|
consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically
|
|
last_used_at TIMESTAMPTZ,
|
|
last_failure_at TIMESTAMPTZ,
|
|
last_error TEXT,
|
|
|
|
-- Performance
|
|
response_time_ms INTEGER,
|
|
max_connections INTEGER DEFAULT 1
|
|
);
|
|
```
|
|
|
|
### 5. Failure Threshold
|
|
|
|
- **3 consecutive 403s** with different fingerprints → disable proxy
|
|
- Reset `consecutive_403_count` to 0 on successful request
|
|
- General `failure_count` tracks all errors (timeouts, connection errors, etc.)
|
|
|
|
---
|
|
|
|
## Implementation Status
|
|
|
|
### COMPLETED - December 10, 2025
|
|
|
|
All code changes have been implemented per this specification:
|
|
|
|
#### 1. crawl-rotator.ts ✅
|
|
|
|
- [x] Added `consecutive403Count` to Proxy interface
|
|
- [x] Added `markBlocked()` method that increments `consecutive_403_count` and disables proxy at 3
|
|
- [x] Added `getProxyTimezone()` to return current proxy's timezone
|
|
- [x] `markSuccess()` now resets `consecutive_403_count` to 0
|
|
- [x] Replaced hardcoded UA list with `intoli/user-agents` library for realistic fingerprints
|
|
- [x] `BrowserFingerprint` interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)
|
|
|
|
#### 2. client.ts ✅
|
|
|
|
- [x] `startSession()` no longer takes state/timezone params
|
|
- [x] `startSession()` gets identity from proxy via `crawlRotator.getProxyLocation()`
|
|
- [x] Added `handle403Block()` that:
|
|
- Calls `crawlRotator.recordBlock()` (tracks consecutive 403s)
|
|
- Immediately rotates both proxy and fingerprint via `rotateBoth()`
|
|
- Returns false if no more proxies available
|
|
- [x] `executeGraphQL()` calls `handle403Block()` on 403 (not `rotateProxyOn403`)
|
|
- [x] `fetchPage()` uses same 403 handling
|
|
- [x] 500ms backoff after rotation (not linear delay)
|
|
|
|
#### 3. Task Handlers ✅
|
|
|
|
- [x] `entry-point-discovery.ts`: `startSession()` called with no params
|
|
- [x] `product-refresh.ts`: `startSession()` called with no params
|
|
|
|
#### 4. Dependencies ✅
|
|
|
|
- [x] Added `user-agents` npm package for realistic UA generation
|
|
|
|
---
|
|
|
|
## Files Changed
|
|
|
|
| File | Changes |
|
|
|------|---------|
|
|
| `backend/src/services/crawl-rotator.ts` | Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents` |
|
|
| `backend/src/platforms/dutchie/client.ts` | `startSession()` uses proxy location, `handle403Block()` for 403 handling |
|
|
| `backend/src/tasks/handlers/entry-point-discovery.ts` | `startSession()` no params |
|
|
| `backend/src/tasks/handlers/product-refresh.ts` | `startSession()` no params |
|
|
| `backend/package.json` | Added `user-agents` dependency |
|
|
|
|
---
|
|
|
|
## Migration Required
|
|
|
|
The `proxies` table needs `consecutive_403_count` column if not already present:
|
|
|
|
```sql
|
|
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
|
|
```
|
|
|
|
---
|
|
|
|
## Key Behaviors Summary
|
|
|
|
| Behavior | Implementation |
|
|
|----------|----------------|
|
|
| Session identity | From proxy location (`getProxyLocation()`) |
|
|
| Language | Always `en-US,en;q=0.9` |
|
|
| 403 handling | `handle403Block()` → `recordBlock()` → `rotateBoth()` |
|
|
| Proxy disable | After 3 consecutive 403s (`consecutive403Count >= 3`) |
|
|
| Success reset | `markSuccess()` resets `consecutive403Count` to 0 |
|
|
| UA generation | `intoli/user-agents` library (daily updated, realistic fingerprints) |
|
|
| Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers |
|
|
|
|
---
|
|
|
|
## User-Agent Generation
|
|
|
|
### Data Source
|
|
|
|
The `intoli/user-agents` npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.
|
|
|
|
### Device Category Distribution (hardcoded)
|
|
|
|
| Category | Share |
|
|
|----------|-------|
|
|
| Mobile | 62% |
|
|
| Desktop | 36% |
|
|
| Tablet | 2% |
|
|
|
|
### Browser Filter (whitelist only)
|
|
|
|
Only these browsers are allowed:
|
|
- Chrome (67%)
|
|
- Safari (20%)
|
|
- Edge (6%)
|
|
- Firefox (3%)
|
|
|
|
Samsung Internet, Opera, and other niche browsers are filtered out.
|
|
|
|
### Desktop OS Distribution (from library)
|
|
|
|
| OS | Share |
|
|
|----|-------|
|
|
| Windows | 72% |
|
|
| macOS | 17% |
|
|
| Linux | 4% |
|
|
|
|
### UA Lifecycle
|
|
|
|
1. **Session start** (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
|
|
2. **UA sticks** until IP rotates (403 block or manual rotation)
|
|
3. **IP rotation** triggers new UA generation
|
|
|
|
### Failure Handling
|
|
|
|
- If UA generation fails → Alert admin dashboard, **stop crawl immediately**
|
|
- No fallback to static UA list
|
|
- This forces investigation rather than silent degradation
|
|
|
|
### Session Logging
|
|
|
|
Each session logs:
|
|
- Device category (mobile/desktop/tablet)
|
|
- Full UA string
|
|
- Browser name (Chrome/Safari/Edge/Firefox)
|
|
- IP address (from proxy)
|
|
- Session start timestamp
|
|
|
|
Logs are rotated monthly.
|
|
|
|
### Implementation
|
|
|
|
Located in `backend/src/services/crawl-rotator.ts`:
|
|
|
|
```typescript
|
|
// Per workflow-12102025.md: Device category distribution
|
|
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };
|
|
|
|
// Per workflow-12102025.md: Browser whitelist
|
|
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];
|
|
```
|
|
|
|
---
|
|
|
|
## HTTP Fingerprinting
|
|
|
|
### Goal
|
|
|
|
Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.
|
|
|
|
### Components
|
|
|
|
1. **Full Header Set** - All headers a real browser sends
|
|
2. **Header Ordering** - Browser-specific order (Chrome vs Firefox vs Safari)
|
|
3. **TLS Fingerprint** - Use `curl-impersonate` to match browser TLS signature
|
|
4. **Dynamic Referer** - Set per dispensary being crawled
|
|
5. **Natural Randomization** - Vary optional headers like real users
|
|
|
|
### Required Headers
|
|
|
|
| Header | Chrome | Firefox | Safari | Notes |
|
|
|--------|--------|---------|--------|-------|
|
|
| `User-Agent` | ✅ | ✅ | ✅ | From UA generation |
|
|
| `Accept` | ✅ | ✅ | ✅ | Content types |
|
|
| `Accept-Language` | ✅ | ✅ | ✅ | Always `en-US,en;q=0.9` |
|
|
| `Accept-Encoding` | ✅ | ✅ | ✅ | `gzip, deflate, br` |
|
|
| `Connection` | ✅ | ✅ | ✅ | `keep-alive` |
|
|
| `Origin` | ✅ | ✅ | ✅ | `https://dutchie.com` (POST only) |
|
|
| `Referer` | ✅ | ✅ | ✅ | Dynamic per dispensary |
|
|
| `sec-ch-ua` | ✅ | ❌ | ❌ | Chromium only |
|
|
| `sec-ch-ua-mobile` | ✅ | ❌ | ❌ | Chromium only |
|
|
| `sec-ch-ua-platform` | ✅ | ❌ | ❌ | Chromium only |
|
|
| `sec-fetch-dest` | ✅ | ✅ | ❌ | `empty` for XHR |
|
|
| `sec-fetch-mode` | ✅ | ✅ | ❌ | `cors` for XHR |
|
|
| `sec-fetch-site` | ✅ | ✅ | ❌ | `same-origin` |
|
|
| `Upgrade-Insecure-Requests` | ✅ | ✅ | ✅ | `1` (page loads only) |
|
|
| `DNT` | ~30% | ~30% | ~30% | Randomized per session |
|
|
|
|
### Header Ordering
|
|
|
|
Each browser sends headers in a specific order. Fingerprinting services detect mismatches.
|
|
|
|
**Chrome order (GraphQL request):**
|
|
1. Host
|
|
2. Connection
|
|
3. Content-Length (POST)
|
|
4. sec-ch-ua
|
|
5. DNT (if enabled)
|
|
6. sec-ch-ua-mobile
|
|
7. User-Agent
|
|
8. sec-ch-ua-platform
|
|
9. Content-Type (POST)
|
|
10. Accept
|
|
11. Origin (POST)
|
|
12. sec-fetch-site
|
|
13. sec-fetch-mode
|
|
14. sec-fetch-dest
|
|
15. Referer
|
|
16. Accept-Encoding
|
|
17. Accept-Language
|
|
|
|
**Firefox order (GraphQL request):**
|
|
1. Host
|
|
2. User-Agent
|
|
3. Accept
|
|
4. Accept-Language
|
|
5. Accept-Encoding
|
|
6. Content-Type (POST)
|
|
7. Content-Length (POST)
|
|
8. Origin (POST)
|
|
9. DNT (if enabled)
|
|
10. Connection
|
|
11. Referer
|
|
12. sec-fetch-dest
|
|
13. sec-fetch-mode
|
|
14. sec-fetch-site
|
|
|
|
**Safari order (GraphQL request):**
|
|
1. Host
|
|
2. Connection
|
|
3. Content-Length (POST)
|
|
4. Accept
|
|
5. User-Agent
|
|
6. Content-Type (POST)
|
|
7. Origin (POST)
|
|
8. Referer
|
|
9. Accept-Encoding
|
|
10. Accept-Language
|
|
|
|
### TLS Fingerprinting
|
|
|
|
Use `curl-impersonate` instead of standard curl:
|
|
- `curl_chrome131` - Mimics Chrome 131 TLS handshake
|
|
- `curl_ff133` - Mimics Firefox 133 TLS handshake
|
|
- `curl_safari17` - Mimics Safari 17 TLS handshake
|
|
|
|
Match TLS binary to browser in UA.
|
|
|
|
### Dynamic Referer
|
|
|
|
Set Referer to the dispensary's actual page URL:
|
|
|
|
```
|
|
Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
|
|
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa
|
|
```
|
|
|
|
Derived from dispensary's `menu_url` field.
|
|
|
|
### Natural Randomization
|
|
|
|
Per-session randomization (set once when session starts, consistent for session):
|
|
|
|
| Feature | Distribution | Implementation |
|
|
|---------|--------------|----------------|
|
|
| DNT header | 30% have it | `Math.random() < 0.30` |
|
|
| Accept quality values | Slight variation | `q=0.9` vs `q=0.8` |
|
|
|
|
### Implementation Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `src/services/crawl-rotator.ts` | `BrowserFingerprint` includes full header config |
|
|
| `src/platforms/dutchie/client.ts` | Build headers from fingerprint, use curl-impersonate |
|
|
| `src/services/http-fingerprint.ts` | Header ordering per browser (NEW) |
|