- Install curl in Docker container for Dutchie HTTP requests - Add 'active' column to worker_tasks (default false) to prevent accidental task execution on startup - Update task-service to only claim tasks where active=true 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 KiB
Workflow Documentation - December 10, 2025
Purpose
This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.
Stealth & Anti-Detection Requirements
1. Task Determines Work, Proxy Determines Identity
The task payload contains:
dispensary_id- which store to crawlrole- what type of work (product_resync, entry_point_discovery, etc.)
The proxy determines the session identity:
- Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
- Language is always English (
en-US)
Flow:
Task claimed
│
└─► Get proxy from rotation
│
└─► Proxy has location (city, state, timezone)
│
└─► Build headers using proxy's timezone
- Accept-Language: en-US,en;q=0.9
- Timezone-consistent behavior
2. On 403 Block - Immediate Backoff
When a 403 is received:
- Immediately stop using current IP
- Get a new proxy (new IP)
- Get a new UA/fingerprint
- Retry the request
Per-proxy failure tracking:
- Track UA rotation attempts per proxy
- After 3 UA/fingerprint rotations on the same proxy → disable that proxy
- This means: if we rotate UA 3 times and still get 403, the proxy is burned
3. Fingerprint Rotation Rules
Each request uses:
- Proxy (IP)
- User-Agent
- sec-ch-ua headers (Client Hints)
- Accept-Language (from proxy location)
On 403:
- Record failure on current proxy
- Rotate to new proxy
- Pick new random fingerprint
- If same proxy fails 3 times with different fingerprints → disable proxy
4. Proxy Table Schema
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http',
active BOOLEAN DEFAULT true,
-- Location (determines session headers)
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(100),
country_code VARCHAR(10),
timezone VARCHAR(50),
-- Health tracking
failure_count INTEGER DEFAULT 0,
consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically
last_used_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
last_error TEXT,
-- Performance
response_time_ms INTEGER,
max_connections INTEGER DEFAULT 1
);
5. Failure Threshold
- 3 consecutive 403s with different fingerprints → disable proxy
- Reset
consecutive_403_countto 0 on successful request - General
failure_counttracks all errors (timeouts, connection errors, etc.)
Implementation Status
COMPLETED - December 10, 2025
All code changes have been implemented per this specification:
1. crawl-rotator.ts ✅
- Added
consecutive403Countto Proxy interface - Added
markBlocked()method that incrementsconsecutive_403_countand disables proxy at 3 - Added
getProxyTimezone()to return current proxy's timezone markSuccess()now resetsconsecutive_403_countto 0- Replaced hardcoded UA list with
intoli/user-agentslibrary for realistic fingerprints BrowserFingerprintinterface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)
2. client.ts ✅
startSession()no longer takes state/timezone paramsstartSession()gets identity from proxy viacrawlRotator.getProxyLocation()- Added
handle403Block()that:- Calls
crawlRotator.recordBlock()(tracks consecutive 403s) - Immediately rotates both proxy and fingerprint via
rotateBoth() - Returns false if no more proxies available
- Calls
executeGraphQL()callshandle403Block()on 403 (notrotateProxyOn403)fetchPage()uses same 403 handling- 500ms backoff after rotation (not linear delay)
3. Task Handlers ✅
entry-point-discovery.ts:startSession()called with no paramsproduct-refresh.ts:startSession()called with no params
4. Dependencies ✅
- Added
user-agentsnpm package for realistic UA generation
Files Changed
| File | Changes |
|---|---|
backend/src/services/crawl-rotator.ts |
Complete rewrite with consecutive403Count, markBlocked(), intoli/user-agents |
backend/src/platforms/dutchie/client.ts |
startSession() uses proxy location, handle403Block() for 403 handling |
backend/src/tasks/handlers/entry-point-discovery.ts |
startSession() no params |
backend/src/tasks/handlers/product-refresh.ts |
startSession() no params |
backend/package.json |
Added user-agents dependency |
Migration Required
The proxies table needs consecutive_403_count column if not already present:
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
Key Behaviors Summary
| Behavior | Implementation |
|---|---|
| Session identity | From proxy location (getProxyLocation()) |
| Language | Always en-US,en;q=0.9 |
| 403 handling | handle403Block() → recordBlock() → rotateBoth() |
| Proxy disable | After 3 consecutive 403s (consecutive403Count >= 3) |
| Success reset | markSuccess() resets consecutive403Count to 0 |
| UA generation | intoli/user-agents library (daily updated, realistic fingerprints) |
| Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers |
User-Agent Generation
Data Source
The intoli/user-agents npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.
Device Category Distribution (hardcoded)
| Category | Share |
|---|---|
| Mobile | 62% |
| Desktop | 36% |
| Tablet | 2% |
Browser Filter (whitelist only)
Only these browsers are allowed:
- Chrome (67%)
- Safari (20%)
- Edge (6%)
- Firefox (3%)
Samsung Internet, Opera, and other niche browsers are filtered out.
Desktop OS Distribution (from library)
| OS | Share |
|---|---|
| Windows | 72% |
| macOS | 17% |
| Linux | 4% |
UA Lifecycle
- Session start (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
- UA sticks until IP rotates (403 block or manual rotation)
- IP rotation triggers new UA generation
Failure Handling
- If UA generation fails → Alert admin dashboard, stop crawl immediately
- No fallback to static UA list
- This forces investigation rather than silent degradation
Session Logging
Each session logs:
- Device category (mobile/desktop/tablet)
- Full UA string
- Browser name (Chrome/Safari/Edge/Firefox)
- IP address (from proxy)
- Session start timestamp
Logs are rotated monthly.
Implementation
Located in backend/src/services/crawl-rotator.ts:
// Per workflow-12102025.md: Device category distribution
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };
// Per workflow-12102025.md: Browser whitelist
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];
HTTP Fingerprinting
Goal
Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.
Components
- Full Header Set - All headers a real browser sends
- Header Ordering - Browser-specific order (Chrome vs Firefox vs Safari)
- TLS Fingerprint - Use
curl-impersonateto match browser TLS signature - Dynamic Referer - Set per dispensary being crawled
- Natural Randomization - Vary optional headers like real users
Required Headers
| Header | Chrome | Firefox | Safari | Notes |
|---|---|---|---|---|
User-Agent |
✅ | ✅ | ✅ | From UA generation |
Accept |
✅ | ✅ | ✅ | Content types |
Accept-Language |
✅ | ✅ | ✅ | Always en-US,en;q=0.9 |
Accept-Encoding |
✅ | ✅ | ✅ | gzip, deflate, br |
Connection |
✅ | ✅ | ✅ | keep-alive |
Origin |
✅ | ✅ | ✅ | https://dutchie.com (POST only) |
Referer |
✅ | ✅ | ✅ | Dynamic per dispensary |
sec-ch-ua |
✅ | ❌ | ❌ | Chromium only |
sec-ch-ua-mobile |
✅ | ❌ | ❌ | Chromium only |
sec-ch-ua-platform |
✅ | ❌ | ❌ | Chromium only |
sec-fetch-dest |
✅ | ✅ | ❌ | empty for XHR |
sec-fetch-mode |
✅ | ✅ | ❌ | cors for XHR |
sec-fetch-site |
✅ | ✅ | ❌ | same-origin |
Upgrade-Insecure-Requests |
✅ | ✅ | ✅ | 1 (page loads only) |
DNT |
~30% | ~30% | ~30% | Randomized per session |
Header Ordering
Each browser sends headers in a specific order. Fingerprinting services detect mismatches.
Chrome order (GraphQL request):
- Host
- Connection
- Content-Length (POST)
- sec-ch-ua
- DNT (if enabled)
- sec-ch-ua-mobile
- User-Agent
- sec-ch-ua-platform
- Content-Type (POST)
- Accept
- Origin (POST)
- sec-fetch-site
- sec-fetch-mode
- sec-fetch-dest
- Referer
- Accept-Encoding
- Accept-Language
Firefox order (GraphQL request):
- Host
- User-Agent
- Accept
- Accept-Language
- Accept-Encoding
- Content-Type (POST)
- Content-Length (POST)
- Origin (POST)
- DNT (if enabled)
- Connection
- Referer
- sec-fetch-dest
- sec-fetch-mode
- sec-fetch-site
Safari order (GraphQL request):
- Host
- Connection
- Content-Length (POST)
- Accept
- User-Agent
- Content-Type (POST)
- Origin (POST)
- Referer
- Accept-Encoding
- Accept-Language
TLS Fingerprinting
Use curl-impersonate instead of standard curl:
curl_chrome131- Mimics Chrome 131 TLS handshakecurl_ff133- Mimics Firefox 133 TLS handshakecurl_safari17- Mimics Safari 17 TLS handshake
Match TLS binary to browser in UA.
Dynamic Referer
Set Referer to the dispensary's actual page URL:
Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa
Derived from dispensary's menu_url field.
Natural Randomization
Per-session randomization (set once when session starts, consistent for session):
| Feature | Distribution | Implementation |
|---|---|---|
| DNT header | 30% have it | Math.random() < 0.30 |
| Accept quality values | Slight variation | q=0.9 vs q=0.8 |
Implementation Files
| File | Purpose |
|---|---|
src/services/crawl-rotator.ts |
BrowserFingerprint includes full header config |
src/platforms/dutchie/client.ts |
Build headers from fingerprint, use curl-impersonate |
src/services/http-fingerprint.ts |
Header ordering per browser (NEW) |