Files
cannaiq/workflow-12102025.md
Kelly 824d48fd85 fix: Add curl to Docker, add active flag to worker_tasks
- Install curl in Docker container for Dutchie HTTP requests
- Add 'active' column to worker_tasks (default false) to prevent
  accidental task execution on startup
- Update task-service to only claim tasks where active=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:12:09 -07:00

11 KiB

Workflow Documentation - December 10, 2025

Purpose

This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.


Stealth & Anti-Detection Requirements

1. Task Determines Work, Proxy Determines Identity

The task payload contains:

  • dispensary_id - which store to crawl
  • role - what type of work (product_resync, entry_point_discovery, etc.)

The proxy determines the session identity:

  • Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
  • Language is always English (en-US)

Flow:

Task claimed
    │
    └─► Get proxy from rotation
            │
            └─► Proxy has location (city, state, timezone)
                    │
                    └─► Build headers using proxy's timezone
                            - Accept-Language: en-US,en;q=0.9
                            - Timezone-consistent behavior

2. On 403 Block - Immediate Backoff

When a 403 is received:

  1. Immediately stop using current IP
  2. Get a new proxy (new IP)
  3. Get a new UA/fingerprint
  4. Retry the request

Per-proxy failure tracking:

  • Track UA rotation attempts per proxy
  • After 3 UA/fingerprint rotations on the same proxy → disable that proxy
  • This means: if we rotate UA 3 times and still get 403, the proxy is burned

3. Fingerprint Rotation Rules

Each request uses:

  • Proxy (IP)
  • User-Agent
  • sec-ch-ua headers (Client Hints)
  • Accept-Language (from proxy location)

On 403:

  1. Record failure on current proxy
  2. Rotate to new proxy
  3. Pick new random fingerprint
  4. If same proxy fails 3 times with different fingerprints → disable proxy

4. Proxy Table Schema

CREATE TABLE proxies (
  id SERIAL PRIMARY KEY,
  host VARCHAR(255) NOT NULL,
  port INTEGER NOT NULL,
  username VARCHAR(100),
  password VARCHAR(100),
  protocol VARCHAR(10) DEFAULT 'http',
  active BOOLEAN DEFAULT true,

  -- Location (determines session headers)
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(100),
  country_code VARCHAR(10),
  timezone VARCHAR(50),

  -- Health tracking
  failure_count INTEGER DEFAULT 0,
  consecutive_403_count INTEGER DEFAULT 0,  -- Track 403s specifically
  last_used_at TIMESTAMPTZ,
  last_failure_at TIMESTAMPTZ,
  last_error TEXT,

  -- Performance
  response_time_ms INTEGER,
  max_connections INTEGER DEFAULT 1
);

5. Failure Threshold

  • 3 consecutive 403s with different fingerprints → disable proxy
  • Reset consecutive_403_count to 0 on successful request
  • General failure_count tracks all errors (timeouts, connection errors, etc.)

Implementation Status

COMPLETED - December 10, 2025

All code changes have been implemented per this specification:

1. crawl-rotator.ts

  • Added consecutive403Count to Proxy interface
  • Added markBlocked() method that increments consecutive_403_count and disables proxy at 3
  • Added getProxyTimezone() to return current proxy's timezone
  • markSuccess() now resets consecutive_403_count to 0
  • Replaced hardcoded UA list with intoli/user-agents library for realistic fingerprints
  • BrowserFingerprint interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)

2. client.ts

  • startSession() no longer takes state/timezone params
  • startSession() gets identity from proxy via crawlRotator.getProxyLocation()
  • Added handle403Block() that:
    • Calls crawlRotator.recordBlock() (tracks consecutive 403s)
    • Immediately rotates both proxy and fingerprint via rotateBoth()
    • Returns false if no more proxies available
  • executeGraphQL() calls handle403Block() on 403 (not rotateProxyOn403)
  • fetchPage() uses same 403 handling
  • 500ms backoff after rotation (not linear delay)

3. Task Handlers

  • entry-point-discovery.ts: startSession() called with no params
  • product-refresh.ts: startSession() called with no params

4. Dependencies

  • Added user-agents npm package for realistic UA generation

Files Changed

File Changes
backend/src/services/crawl-rotator.ts Complete rewrite with consecutive403Count, markBlocked(), intoli/user-agents
backend/src/platforms/dutchie/client.ts startSession() uses proxy location, handle403Block() for 403 handling
backend/src/tasks/handlers/entry-point-discovery.ts startSession() no params
backend/src/tasks/handlers/product-refresh.ts startSession() no params
backend/package.json Added user-agents dependency

Migration Required

The proxies table needs consecutive_403_count column if not already present:

ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;

Key Behaviors Summary

Behavior Implementation
Session identity From proxy location (getProxyLocation())
Language Always en-US,en;q=0.9
403 handling handle403Block()recordBlock()rotateBoth()
Proxy disable After 3 consecutive 403s (consecutive403Count >= 3)
Success reset markSuccess() resets consecutive403Count to 0
UA generation intoli/user-agents library (daily updated, realistic fingerprints)
Fingerprint data Full: UA, platform, screen size, viewport, sec-ch-ua headers

User-Agent Generation

Data Source

The intoli/user-agents npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.

Device Category Distribution (hardcoded)

Category Share
Mobile 62%
Desktop 36%
Tablet 2%

Browser Filter (whitelist only)

Only these browsers are allowed:

  • Chrome (67%)
  • Safari (20%)
  • Edge (6%)
  • Firefox (3%)

Samsung Internet, Opera, and other niche browsers are filtered out.

Desktop OS Distribution (from library)

OS Share
Windows 72%
macOS 17%
Linux 4%

UA Lifecycle

  1. Session start (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
  2. UA sticks until IP rotates (403 block or manual rotation)
  3. IP rotation triggers new UA generation

Failure Handling

  • If UA generation fails → Alert admin dashboard, stop crawl immediately
  • No fallback to static UA list
  • This forces investigation rather than silent degradation

Session Logging

Each session logs:

  • Device category (mobile/desktop/tablet)
  • Full UA string
  • Browser name (Chrome/Safari/Edge/Firefox)
  • IP address (from proxy)
  • Session start timestamp

Logs are rotated monthly.

Implementation

Located in backend/src/services/crawl-rotator.ts:

// Per workflow-12102025.md: Device category distribution
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };

// Per workflow-12102025.md: Browser whitelist
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];

HTTP Fingerprinting

Goal

Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.

Components

  1. Full Header Set - All headers a real browser sends
  2. Header Ordering - Browser-specific order (Chrome vs Firefox vs Safari)
  3. TLS Fingerprint - Use curl-impersonate to match browser TLS signature
  4. Dynamic Referer - Set per dispensary being crawled
  5. Natural Randomization - Vary optional headers like real users

Required Headers

Header Chrome Firefox Safari Notes
User-Agent From UA generation
Accept Content types
Accept-Language Always en-US,en;q=0.9
Accept-Encoding gzip, deflate, br
Connection keep-alive
Origin https://dutchie.com (POST only)
Referer Dynamic per dispensary
sec-ch-ua Chromium only
sec-ch-ua-mobile Chromium only
sec-ch-ua-platform Chromium only
sec-fetch-dest empty for XHR
sec-fetch-mode cors for XHR
sec-fetch-site same-origin
Upgrade-Insecure-Requests 1 (page loads only)
DNT ~30% ~30% ~30% Randomized per session

Header Ordering

Each browser sends headers in a specific order. Fingerprinting services detect mismatches.

Chrome order (GraphQL request):

  1. Host
  2. Connection
  3. Content-Length (POST)
  4. sec-ch-ua
  5. DNT (if enabled)
  6. sec-ch-ua-mobile
  7. User-Agent
  8. sec-ch-ua-platform
  9. Content-Type (POST)
  10. Accept
  11. Origin (POST)
  12. sec-fetch-site
  13. sec-fetch-mode
  14. sec-fetch-dest
  15. Referer
  16. Accept-Encoding
  17. Accept-Language

Firefox order (GraphQL request):

  1. Host
  2. User-Agent
  3. Accept
  4. Accept-Language
  5. Accept-Encoding
  6. Content-Type (POST)
  7. Content-Length (POST)
  8. Origin (POST)
  9. DNT (if enabled)
  10. Connection
  11. Referer
  12. sec-fetch-dest
  13. sec-fetch-mode
  14. sec-fetch-site

Safari order (GraphQL request):

  1. Host
  2. Connection
  3. Content-Length (POST)
  4. Accept
  5. User-Agent
  6. Content-Type (POST)
  7. Origin (POST)
  8. Referer
  9. Accept-Encoding
  10. Accept-Language

TLS Fingerprinting

Use curl-impersonate instead of standard curl:

  • curl_chrome131 - Mimics Chrome 131 TLS handshake
  • curl_ff133 - Mimics Firefox 133 TLS handshake
  • curl_safari17 - Mimics Safari 17 TLS handshake

Match TLS binary to browser in UA.

Dynamic Referer

Set Referer to the dispensary's actual page URL:

Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa

Derived from dispensary's menu_url field.

Natural Randomization

Per-session randomization (set once when session starts, consistent for session):

Feature Distribution Implementation
DNT header 30% have it Math.random() < 0.30
Accept quality values Slight variation q=0.9 vs q=0.8

Implementation Files

File Purpose
src/services/crawl-rotator.ts BrowserFingerprint includes full header config
src/platforms/dutchie/client.ts Build headers from fingerprint, use curl-impersonate
src/services/http-fingerprint.ts Header ordering per browser (NEW)