Files

Kelly 824d48fd85 fix: Add curl to Docker, add active flag to worker_tasks

- Install curl in Docker container for Dutchie HTTP requests
- Add 'active' column to worker_tasks (default false) to prevent
  accidental task execution on startup
- Update task-service to only claim tasks where active=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-10 23:12:09 -07:00

11 KiB

Raw Permalink Blame History

Workflow Documentation - December 10, 2025

Purpose

This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.

Stealth & Anti-Detection Requirements

1. Task Determines Work, Proxy Determines Identity

The task payload contains:

dispensary_id - which store to crawl
role - what type of work (product_resync, entry_point_discovery, etc.)

The proxy determines the session identity:

Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
Language is always English (en-US)

Flow:

Task claimed
    │
    └─► Get proxy from rotation
            │
            └─► Proxy has location (city, state, timezone)
                    │
                    └─► Build headers using proxy's timezone
                            - Accept-Language: en-US,en;q=0.9
                            - Timezone-consistent behavior

2. On 403 Block - Immediate Backoff

When a 403 is received:

Immediately stop using current IP
Get a new proxy (new IP)
Get a new UA/fingerprint
Retry the request

Per-proxy failure tracking:

Track UA rotation attempts per proxy
After 3 UA/fingerprint rotations on the same proxy → disable that proxy
This means: if we rotate UA 3 times and still get 403, the proxy is burned

3. Fingerprint Rotation Rules

Each request uses:

Proxy (IP)
User-Agent
sec-ch-ua headers (Client Hints)
Accept-Language (from proxy location)

On 403:

Record failure on current proxy
Rotate to new proxy
Pick new random fingerprint
If same proxy fails 3 times with different fingerprints → disable proxy

4. Proxy Table Schema

CREATE TABLE proxies (
  id SERIAL PRIMARY KEY,
  host VARCHAR(255) NOT NULL,
  port INTEGER NOT NULL,
  username VARCHAR(100),
  password VARCHAR(100),
  protocol VARCHAR(10) DEFAULT 'http',
  active BOOLEAN DEFAULT true,

  -- Location (determines session headers)
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(100),
  country_code VARCHAR(10),
  timezone VARCHAR(50),

  -- Health tracking
  failure_count INTEGER DEFAULT 0,
  consecutive_403_count INTEGER DEFAULT 0,  -- Track 403s specifically
  last_used_at TIMESTAMPTZ,
  last_failure_at TIMESTAMPTZ,
  last_error TEXT,

  -- Performance
  response_time_ms INTEGER,
  max_connections INTEGER DEFAULT 1
);

5. Failure Threshold

3 consecutive 403s with different fingerprints → disable proxy
Reset consecutive_403_count to 0 on successful request
General failure_count tracks all errors (timeouts, connection errors, etc.)

Implementation Status

COMPLETED - December 10, 2025

All code changes have been implemented per this specification:

1. crawl-rotator.ts ✅

Added consecutive403Count to Proxy interface
Added markBlocked() method that increments consecutive_403_count and disables proxy at 3
Added getProxyTimezone() to return current proxy's timezone
markSuccess() now resets consecutive_403_count to 0
Replaced hardcoded UA list with intoli/user-agents library for realistic fingerprints
BrowserFingerprint interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)

2. client.ts ✅

startSession() no longer takes state/timezone params
startSession() gets identity from proxy via crawlRotator.getProxyLocation()
Added handle403Block() that:
- Calls crawlRotator.recordBlock() (tracks consecutive 403s)
- Immediately rotates both proxy and fingerprint via rotateBoth()
- Returns false if no more proxies available
executeGraphQL() calls handle403Block() on 403 (not rotateProxyOn403)
fetchPage() uses same 403 handling
500ms backoff after rotation (not linear delay)

3. Task Handlers ✅

entry-point-discovery.ts: startSession() called with no params
product-refresh.ts: startSession() called with no params

4. Dependencies ✅

Added user-agents npm package for realistic UA generation

Files Changed

File	Changes
`backend/src/services/crawl-rotator.ts`	Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents`
`backend/src/platforms/dutchie/client.ts`	`startSession()` uses proxy location, `handle403Block()` for 403 handling
`backend/src/tasks/handlers/entry-point-discovery.ts`	`startSession()` no params
`backend/src/tasks/handlers/product-refresh.ts`	`startSession()` no params
`backend/package.json`	Added `user-agents` dependency

Migration Required

The proxies table needs consecutive_403_count column if not already present:

ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;

Key Behaviors Summary

Behavior	Implementation
Session identity	From proxy location (`getProxyLocation()`)
Language	Always `en-US,en;q=0.9`
403 handling	`handle403Block()` → `recordBlock()` → `rotateBoth()`
Proxy disable	After 3 consecutive 403s (`consecutive403Count >= 3`)
Success reset	`markSuccess()` resets `consecutive403Count` to 0
UA generation	`intoli/user-agents` library (daily updated, realistic fingerprints)
Fingerprint data	Full: UA, platform, screen size, viewport, sec-ch-ua headers

User-Agent Generation

Data Source

The intoli/user-agents npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.

Device Category Distribution (hardcoded)

Category	Share
Mobile	62%
Desktop	36%
Tablet	2%

Browser Filter (whitelist only)

Only these browsers are allowed:

Chrome (67%)
Safari (20%)
Edge (6%)
Firefox (3%)

Samsung Internet, Opera, and other niche browsers are filtered out.

Desktop OS Distribution (from library)

OS	Share
Windows	72%
macOS	17%
Linux	4%

UA Lifecycle

Session start (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
UA sticks until IP rotates (403 block or manual rotation)
IP rotation triggers new UA generation

Failure Handling

If UA generation fails → Alert admin dashboard, stop crawl immediately
No fallback to static UA list
This forces investigation rather than silent degradation

Session Logging

Each session logs:

Device category (mobile/desktop/tablet)
Full UA string
Browser name (Chrome/Safari/Edge/Firefox)
IP address (from proxy)
Session start timestamp

Logs are rotated monthly.

Implementation

Located in backend/src/services/crawl-rotator.ts:

// Per workflow-12102025.md: Device category distribution
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };

// Per workflow-12102025.md: Browser whitelist
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];

HTTP Fingerprinting

Goal

Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.

Components

Full Header Set - All headers a real browser sends
Header Ordering - Browser-specific order (Chrome vs Firefox vs Safari)
TLS Fingerprint - Use curl-impersonate to match browser TLS signature
Dynamic Referer - Set per dispensary being crawled
Natural Randomization - Vary optional headers like real users

Required Headers

Header	Chrome	Firefox	Safari	Notes
`User-Agent`	✅	✅	✅	From UA generation
`Accept`	✅	✅	✅	Content types
`Accept-Language`	✅	✅	✅	Always `en-US,en;q=0.9`
`Accept-Encoding`	✅	✅	✅	`gzip, deflate, br`
`Connection`	✅	✅	✅	`keep-alive`
`Origin`	✅	✅	✅	`https://dutchie.com` (POST only)
`Referer`	✅	✅	✅	Dynamic per dispensary
`sec-ch-ua`	✅	❌	❌	Chromium only
`sec-ch-ua-mobile`	✅	❌	❌	Chromium only
`sec-ch-ua-platform`	✅	❌	❌	Chromium only
`sec-fetch-dest`	✅	✅	❌	`empty` for XHR
`sec-fetch-mode`	✅	✅	❌	`cors` for XHR
`sec-fetch-site`	✅	✅	❌	`same-origin`
`Upgrade-Insecure-Requests`	✅	✅	✅	`1` (page loads only)
`DNT`	~30%	~30%	~30%	Randomized per session

Header Ordering

Each browser sends headers in a specific order. Fingerprinting services detect mismatches.

Chrome order (GraphQL request):

Host
Connection
Content-Length (POST)
sec-ch-ua
DNT (if enabled)
sec-ch-ua-mobile
User-Agent
sec-ch-ua-platform
Content-Type (POST)
Accept
Origin (POST)
sec-fetch-site
sec-fetch-mode
sec-fetch-dest
Referer
Accept-Encoding
Accept-Language

Firefox order (GraphQL request):

Host
User-Agent
Accept
Accept-Language
Accept-Encoding
Content-Type (POST)
Content-Length (POST)
Origin (POST)
DNT (if enabled)
Connection
Referer
sec-fetch-dest
sec-fetch-mode
sec-fetch-site

Safari order (GraphQL request):

Host
Connection
Content-Length (POST)
Accept
User-Agent
Content-Type (POST)
Origin (POST)
Referer
Accept-Encoding
Accept-Language

TLS Fingerprinting

Use curl-impersonate instead of standard curl:

curl_chrome131 - Mimics Chrome 131 TLS handshake
curl_ff133 - Mimics Firefox 133 TLS handshake
curl_safari17 - Mimics Safari 17 TLS handshake

Match TLS binary to browser in UA.

Dynamic Referer

Set Referer to the dispensary's actual page URL:

Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa

Derived from dispensary's menu_url field.

Natural Randomization

Per-session randomization (set once when session starts, consistent for session):

Feature	Distribution	Implementation
DNT header	30% have it	`Math.random() < 0.30`
Accept quality values	Slight variation	`q=0.9` vs `q=0.8`

Implementation Files

File	Purpose
`src/services/crawl-rotator.ts`	`BrowserFingerprint` includes full header config
`src/platforms/dutchie/client.ts`	Build headers from fingerprint, use curl-impersonate
`src/services/http-fingerprint.ts`	Header ordering per browser (NEW)

11 KiB Raw Permalink Blame History