Files
cannaiq/docs/CRAWL_SYSTEM_V2.md
Kelly 56cc171287 feat: Stealth worker system with mandatory proxy rotation
## Worker System
- Role-agnostic workers that can handle any task type
- Pod-based architecture with StatefulSet (5-15 pods, 5 workers each)
- Custom pod names (Aethelgard, Xylos, Kryll, etc.)
- Worker registry with friendly names and resource monitoring
- Hub-and-spoke visualization on JobQueue page

## Stealth & Anti-Detection (REQUIRED)
- Proxies are MANDATORY - workers fail to start without active proxies
- CrawlRotator initializes on worker startup
- Loads proxies from `proxies` table
- Auto-rotates proxy + fingerprint on 403 errors
- 12 browser fingerprints (Chrome, Firefox, Safari, Edge)
- Locale/timezone matching for geographic consistency

## Task System
- Renamed product_resync → product_refresh
- Task chaining: store_discovery → entry_point → product_discovery
- Priority-based claiming with FOR UPDATE SKIP LOCKED
- Heartbeat and stale task recovery

## UI Updates
- JobQueue: Pod visualization, resource monitoring on hover
- WorkersDashboard: Simplified worker list
- Removed unused filters from task list

## Other
- IP2Location service for visitor analytics
- Findagram consumer features scaffolding
- Documentation updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00

10 KiB

CannaiQ Crawl System V2

Overview

The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:

  1. Phase 1: Store Discovery - Weekly discovery of Dutchie-powered dispensaries
  2. Phase 2: Product Crawling - Regular product/price/stock updates (documented separately)

Phase 1: Store Discovery

Purpose

Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.

Schedule

  • Frequency: Weekly (typically Sunday night)
  • Duration: ~2-4 hours for full US coverage

Flow Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        PHASE 1: STORE DISCOVERY                     │
└─────────────────────────────────────────────────────────────────────┘

1. IDENTITY SETUP
   ┌──────────────────┐
   │ getRandomProxy() │ ──► Random IP from proxy pool
   └──────────────────┘
            │
            ▼
   ┌──────────────────┐
   │  startSession()  │ ──► Random UA + fingerprint + locale matching proxy location
   └──────────────────┘

2. CITY DISCOVERY (per state)
   ┌──────────────────────────────┐
   │ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
   └──────────────────────────────┘
            │
            ▼
   ┌──────────────────────────────┐
   │  Upsert dutchie_discovery_   │
   │  cities table                │
   └──────────────────────────────┘

3. STORE DISCOVERY (per city)
   ┌───────────────────────────────┐
   │ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
   └───────────────────────────────┘
            │
            ▼
   ┌───────────────────────────────┐
   │  Upsert dutchie_discovery_    │
   │  locations table              │
   └───────────────────────────────┘

4. VALIDATION & PROMOTION
   ┌──────────────────────────┐
   │ validateForPromotion()  │ ──► Check required fields
   └──────────────────────────┘
            │
            ▼
   ┌──────────────────────────┐
   │   promoteLocation()     │ ──► Upsert to dispensaries table
   └──────────────────────────┘
            │
            ▼
   ┌──────────────────────────┐
   │ ensureCrawlerProfile()  │ ──► Create profile with status='sandbox'
   └──────────────────────────┘

5. DROPPED STORE DETECTION
   ┌──────────────────────────┐
   │  detectDroppedStores()  │ ──► Find stores missing from discovery
   └──────────────────────────┘
            │
            ▼
   ┌──────────────────────────┐
   │  Mark status='dropped'  │ ──► Dashboard alert for review
   └──────────────────────────┘

Key Files

File Purpose
backend/src/platforms/dutchie/client.ts HTTP client with proxy/fingerprint rotation
backend/src/discovery/discovery-crawler.ts Main discovery orchestrator
backend/src/discovery/location-discovery.ts City/store GraphQL fetching
backend/src/discovery/promotion.ts Validation and promotion logic
backend/src/scripts/run-discovery.ts CLI entry point

Identity Masking

Before any GraphQL queries, the system establishes a masked identity:

1. Proxy Selection

// backend/src/platforms/dutchie/client.ts

// Get random proxy from active pool (NOT state-specific)
const proxy = await getRandomProxy();
setProxy(proxy.url);

The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.

2. Fingerprint + Locale Harmonization

// backend/src/platforms/dutchie/client.ts

function startSession(stateCode: string, timezone: string) {
  // 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
  const fingerprint = getRandomFingerprint();

  // 2. Match Accept-Language to proxy's timezone/location
  const locale = getLocaleForTimezone(timezone);

  // 3. Set headers for this session
  currentSession = {
    userAgent: fingerprint.ua,
    acceptLanguage: locale,
    secChUa: fingerprint.secChUa,
    // ... other fingerprint headers
  };
}

Fingerprint Pool

6 browser fingerprints rotate on each session and on 403 errors:

Browser Version Platform
Chrome 120 Windows
Chrome 120 macOS
Firefox 121 Windows
Firefox 121 macOS
Safari 17.2 macOS
Edge 120 Windows

Timezone → Locale Mapping

const TIMEZONE_TO_LOCALE: Record<string, string> = {
  'America/New_York': 'en-US,en;q=0.9',
  'America/Chicago': 'en-US,en;q=0.9',
  'America/Denver': 'en-US,en;q=0.9',
  'America/Los_Angeles': 'en-US,en;q=0.9',
  'America/Phoenix': 'en-US,en;q=0.9',
  // ...
};

GraphQL Queries

1. getAllCitiesByState

Fetches cities with active dispensaries for a state.

// backend/src/discovery/location-discovery.ts

const response = await executeGraphQL({
  operationName: 'getAllCitiesByState',
  variables: {
    state: 'AZ',
    countryCode: 'US'
  }
});
// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }

Hash: ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6

2. ConsumerDispensaries

Fetches store data for a city/state.

// backend/src/discovery/location-discovery.ts

const response = await executeGraphQL({
  operationName: 'ConsumerDispensaries',
  variables: {
    dispensaryFilter: {
      city: 'Phoenix',
      state: 'AZ',
      activeOnly: true
    }
  }
});
// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]

Hash: 0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b


Database Tables

Discovery Tables (Staging)

Table Purpose
dutchie_discovery_cities Cities known to have dispensaries
dutchie_discovery_locations Raw discovered store data

Canonical Tables

Table Purpose
dispensaries Promoted stores ready for crawling
dispensary_crawler_profiles Crawler configuration per store
dutchie_promotion_log Audit trail for all discovery actions

Validation Rules

A discovery location must have these fields to be promoted:

Field Requirement
platform_location_id MongoDB ObjectId (24 hex chars)
name Non-empty string
city Non-empty string
state_code Non-empty string
platform_menu_url Valid URL

Invalid records are marked status='rejected' with errors logged.


Dropped Store Detection

After discovery, the system identifies stores that may have left the Dutchie platform:

Detection Criteria

A store is marked as "dropped" if:

  1. It has a platform_dispensary_id (was previously verified)
  2. It's currently status='open' and crawl_enabled=true
  3. It was NOT seen in the latest discovery (not in dutchie_discovery_locations with last_seen_at in last 24 hours)

Implementation

// backend/src/discovery/discovery-crawler.ts

export async function detectDroppedStores(pool: Pool, stateCode?: string) {
  // 1. Find dispensaries not in recent discovery
  // 2. Mark status='dropped'
  // 3. Log to dutchie_promotion_log
  // 4. Return list for dashboard alert
}

Admin UI

  • Dashboard: Red alert banner when dropped stores exist
  • Dispensaries page: Filter by status=dropped to review

CLI Usage

# Discover all stores in a state
npx tsx src/scripts/run-discovery.ts discover:state AZ

# Discover all US states
npx tsx src/scripts/run-discovery.ts discover:all

# Dry run (no DB writes)
npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run

# Check stats
npx tsx src/scripts/run-discovery.ts stats

Rate Limiting

  • 2 seconds between city requests
  • Exponential backoff on 429/403 responses
  • Fingerprint rotation on 403 errors

Error Handling

Error Action
403 Forbidden Rotate fingerprint, retry
429 Rate Limited Wait 30s, retry
Network timeout Retry up to 3 times
GraphQL error Log and continue to next city

Monitoring

Logs

Discovery progress is logged to stdout:

[Discovery] Starting discovery for state: AZ
[Discovery] Step 1: Initializing proxy...
[Discovery] Step 2: Fetching cities...
[Discovery] Found 45 cities for AZ
[Discovery] Step 3: Discovering locations...
[Discovery] City 1/45: Phoenix - found 28 stores
...
[Discovery] Step 4: Auto-promoting discovered locations...
[Discovery] Created: 5 new dispensaries
[Discovery] Updated: 40 existing dispensaries
[Discovery] Step 5: Detecting dropped stores...
[Discovery] Found 2 dropped stores

Audit Log

All actions logged to dutchie_promotion_log:

Action Description
promoted_create New dispensary created
promoted_update Existing dispensary updated
rejected Validation failed
dropped Store not found in discovery

Next: Phase 2

See docs/PRODUCT_CRAWL_V2.md for the product crawling phase (coming next).