# Workflow Documentation - December 10, 2025 ## Purpose This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection. --- ## Stealth & Anti-Detection Requirements ### 1. Task Determines Work, Proxy Determines Identity The task payload contains: - `dispensary_id` - which store to crawl - `role` - what type of work (product_resync, entry_point_discovery, etc.) The **proxy** determines the session identity: - Proxy location (city, state, timezone) → sets Accept-Language and timezone headers - Language is always English (`en-US`) **Flow:** ``` Task claimed │ └─► Get proxy from rotation │ └─► Proxy has location (city, state, timezone) │ └─► Build headers using proxy's timezone - Accept-Language: en-US,en;q=0.9 - Timezone-consistent behavior ``` ### 2. On 403 Block - Immediate Backoff When a 403 is received: 1. **Immediately** stop using current IP 2. Get a new proxy (new IP) 3. Get a new UA/fingerprint 4. Retry the request **Per-proxy failure tracking:** - Track UA rotation attempts per proxy - After 3 UA/fingerprint rotations on the same proxy → disable that proxy - This means: if we rotate UA 3 times and still get 403, the proxy is burned ### 3. Fingerprint Rotation Rules Each request uses: - Proxy (IP) - User-Agent - sec-ch-ua headers (Client Hints) - Accept-Language (from proxy location) On 403: 1. Record failure on current proxy 2. Rotate to new proxy 3. Pick new random fingerprint 4. If same proxy fails 3 times with different fingerprints → disable proxy ### 4. Proxy Table Schema ```sql CREATE TABLE proxies ( id SERIAL PRIMARY KEY, host VARCHAR(255) NOT NULL, port INTEGER NOT NULL, username VARCHAR(100), password VARCHAR(100), protocol VARCHAR(10) DEFAULT 'http', active BOOLEAN DEFAULT true, -- Location (determines session headers) city VARCHAR(100), state VARCHAR(50), country VARCHAR(100), country_code VARCHAR(10), timezone VARCHAR(50), -- Health tracking failure_count INTEGER DEFAULT 0, consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically last_used_at TIMESTAMPTZ, last_failure_at TIMESTAMPTZ, last_error TEXT, -- Performance response_time_ms INTEGER, max_connections INTEGER DEFAULT 1 ); ``` ### 5. Failure Threshold - **3 consecutive 403s** with different fingerprints → disable proxy - Reset `consecutive_403_count` to 0 on successful request - General `failure_count` tracks all errors (timeouts, connection errors, etc.) --- ## Implementation Status ### COMPLETED - December 10, 2025 All code changes have been implemented per this specification: #### 1. crawl-rotator.ts ✅ - [x] Added `consecutive403Count` to Proxy interface - [x] Added `markBlocked()` method that increments `consecutive_403_count` and disables proxy at 3 - [x] Added `getProxyTimezone()` to return current proxy's timezone - [x] `markSuccess()` now resets `consecutive_403_count` to 0 - [x] Replaced hardcoded UA list with `intoli/user-agents` library for realistic fingerprints - [x] `BrowserFingerprint` interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers) #### 2. client.ts ✅ - [x] `startSession()` no longer takes state/timezone params - [x] `startSession()` gets identity from proxy via `crawlRotator.getProxyLocation()` - [x] Added `handle403Block()` that: - Calls `crawlRotator.recordBlock()` (tracks consecutive 403s) - Immediately rotates both proxy and fingerprint via `rotateBoth()` - Returns false if no more proxies available - [x] `executeGraphQL()` calls `handle403Block()` on 403 (not `rotateProxyOn403`) - [x] `fetchPage()` uses same 403 handling - [x] 500ms backoff after rotation (not linear delay) #### 3. Task Handlers ✅ - [x] `entry-point-discovery.ts`: `startSession()` called with no params - [x] `product-refresh.ts`: `startSession()` called with no params #### 4. Dependencies ✅ - [x] Added `user-agents` npm package for realistic UA generation --- ## Files Changed | File | Changes | |------|---------| | `backend/src/services/crawl-rotator.ts` | Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents` | | `backend/src/platforms/dutchie/client.ts` | `startSession()` uses proxy location, `handle403Block()` for 403 handling | | `backend/src/tasks/handlers/entry-point-discovery.ts` | `startSession()` no params | | `backend/src/tasks/handlers/product-refresh.ts` | `startSession()` no params | | `backend/package.json` | Added `user-agents` dependency | --- ## Migration Required The `proxies` table needs `consecutive_403_count` column if not already present: ```sql ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0; ``` --- ## Key Behaviors Summary | Behavior | Implementation | |----------|----------------| | Session identity | From proxy location (`getProxyLocation()`) | | Language | Always `en-US,en;q=0.9` | | 403 handling | `handle403Block()` → `recordBlock()` → `rotateBoth()` | | Proxy disable | After 3 consecutive 403s (`consecutive403Count >= 3`) | | Success reset | `markSuccess()` resets `consecutive403Count` to 0 | | UA generation | `intoli/user-agents` library (daily updated, realistic fingerprints) | | Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers | --- ## User-Agent Generation ### Data Source The `intoli/user-agents` npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm. ### Device Category Distribution (hardcoded) | Category | Share | |----------|-------| | Mobile | 62% | | Desktop | 36% | | Tablet | 2% | ### Browser Filter (whitelist only) Only these browsers are allowed: - Chrome (67%) - Safari (20%) - Edge (6%) - Firefox (3%) Samsung Internet, Opera, and other niche browsers are filtered out. ### Desktop OS Distribution (from library) | OS | Share | |----|-------| | Windows | 72% | | macOS | 17% | | Linux | 4% | ### UA Lifecycle 1. **Session start** (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session 2. **UA sticks** until IP rotates (403 block or manual rotation) 3. **IP rotation** triggers new UA generation ### Failure Handling - If UA generation fails → Alert admin dashboard, **stop crawl immediately** - No fallback to static UA list - This forces investigation rather than silent degradation ### Session Logging Each session logs: - Device category (mobile/desktop/tablet) - Full UA string - Browser name (Chrome/Safari/Edge/Firefox) - IP address (from proxy) - Session start timestamp Logs are rotated monthly. ### Implementation Located in `backend/src/services/crawl-rotator.ts`: ```typescript // Per workflow-12102025.md: Device category distribution const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 }; // Per workflow-12102025.md: Browser whitelist const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox']; ``` --- ## HTTP Fingerprinting ### Goal Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint. ### Components 1. **Full Header Set** - All headers a real browser sends 2. **Header Ordering** - Browser-specific order (Chrome vs Firefox vs Safari) 3. **TLS Fingerprint** - Use `curl-impersonate` to match browser TLS signature 4. **Dynamic Referer** - Set per dispensary being crawled 5. **Natural Randomization** - Vary optional headers like real users ### Required Headers | Header | Chrome | Firefox | Safari | Notes | |--------|--------|---------|--------|-------| | `User-Agent` | ✅ | ✅ | ✅ | From UA generation | | `Accept` | ✅ | ✅ | ✅ | Content types | | `Accept-Language` | ✅ | ✅ | ✅ | Always `en-US,en;q=0.9` | | `Accept-Encoding` | ✅ | ✅ | ✅ | `gzip, deflate, br` | | `Connection` | ✅ | ✅ | ✅ | `keep-alive` | | `Origin` | ✅ | ✅ | ✅ | `https://dutchie.com` (POST only) | | `Referer` | ✅ | ✅ | ✅ | Dynamic per dispensary | | `sec-ch-ua` | ✅ | ❌ | ❌ | Chromium only | | `sec-ch-ua-mobile` | ✅ | ❌ | ❌ | Chromium only | | `sec-ch-ua-platform` | ✅ | ❌ | ❌ | Chromium only | | `sec-fetch-dest` | ✅ | ✅ | ❌ | `empty` for XHR | | `sec-fetch-mode` | ✅ | ✅ | ❌ | `cors` for XHR | | `sec-fetch-site` | ✅ | ✅ | ❌ | `same-origin` | | `Upgrade-Insecure-Requests` | ✅ | ✅ | ✅ | `1` (page loads only) | | `DNT` | ~30% | ~30% | ~30% | Randomized per session | ### Header Ordering Each browser sends headers in a specific order. Fingerprinting services detect mismatches. **Chrome order (GraphQL request):** 1. Host 2. Connection 3. Content-Length (POST) 4. sec-ch-ua 5. DNT (if enabled) 6. sec-ch-ua-mobile 7. User-Agent 8. sec-ch-ua-platform 9. Content-Type (POST) 10. Accept 11. Origin (POST) 12. sec-fetch-site 13. sec-fetch-mode 14. sec-fetch-dest 15. Referer 16. Accept-Encoding 17. Accept-Language **Firefox order (GraphQL request):** 1. Host 2. User-Agent 3. Accept 4. Accept-Language 5. Accept-Encoding 6. Content-Type (POST) 7. Content-Length (POST) 8. Origin (POST) 9. DNT (if enabled) 10. Connection 11. Referer 12. sec-fetch-dest 13. sec-fetch-mode 14. sec-fetch-site **Safari order (GraphQL request):** 1. Host 2. Connection 3. Content-Length (POST) 4. Accept 5. User-Agent 6. Content-Type (POST) 7. Origin (POST) 8. Referer 9. Accept-Encoding 10. Accept-Language ### TLS Fingerprinting Use `curl-impersonate` instead of standard curl: - `curl_chrome131` - Mimics Chrome 131 TLS handshake - `curl_ff133` - Mimics Firefox 133 TLS handshake - `curl_safari17` - Mimics Safari 17 TLS handshake Match TLS binary to browser in UA. ### Dynamic Referer Set Referer to the dispensary's actual page URL: ``` Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa ``` Derived from dispensary's `menu_url` field. ### Natural Randomization Per-session randomization (set once when session starts, consistent for session): | Feature | Distribution | Implementation | |---------|--------------|----------------| | DNT header | 30% have it | `Math.random() < 0.30` | | Accept quality values | Slight variation | `q=0.9` vs `q=0.8` | ### Implementation Files | File | Purpose | |------|---------| | `src/services/crawl-rotator.ts` | `BrowserFingerprint` includes full header config | | `src/platforms/dutchie/client.ts` | Build headers from fingerprint, use curl-impersonate | | `src/services/http-fingerprint.ts` | Header ordering per browser (NEW) |