Files
cannaiq/backend/docs/_archive/CRAWL_PIPELINE.md
Kelly a35976b9e9 chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00

17 KiB

Crawl Pipeline Documentation

Overview

The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.


Pipeline Stages

┌─────────────────────┐
│  store_discovery    │  Find new dispensaries
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ entry_point_discovery│  Resolve slug → platform_dispensary_id
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  product_discovery  │  Initial product crawl
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   product_resync    │  Recurring crawl (every 4 hours)
└─────────────────────┘

Stage Details

1. Store Discovery

Purpose: Find new dispensaries to crawl

Handler: src/tasks/handlers/store-discovery.ts

Flow:

  1. Query Dutchie ConsumerDispensaries GraphQL for cities/states
  2. Extract dispensary info (name, address, menu_url)
  3. Insert into dutchie_discovery_locations
  4. Queue entry_point_discovery for each new location

2. Entry Point Discovery

Purpose: Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)

Handler: src/tasks/handlers/entry-point-discovery.ts

Flow:

  1. Load dispensary from database
  2. Extract slug from menu_url:
    • /embedded-menu/<slug> or /dispensary/<slug>
  3. Start stealth session (fingerprint + proxy)
  4. Query resolveDispensaryIdWithDetails(slug) via GraphQL
  5. Update dispensary with platform_dispensary_id
  6. Queue product_discovery task

Example:

menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b

3. Product Discovery

Purpose: Initial crawl of a new dispensary

Handler: src/tasks/handlers/product-discovery.ts

Same as product_resync but for first-time crawls.


4. Product Resync

Purpose: Recurring crawl to capture price/stock changes

Handler: src/tasks/handlers/product-resync.ts

Flow:

Step 1: Load Dispensary Info

SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true

Step 2: Start Stealth Session

  • Generate random browser fingerprint
  • Set locale/timezone matching state
  • Optional proxy rotation

Step 3: Fetch Products via GraphQL

Endpoint: https://dutchie.com/api-3/graphql

Variables:

{
  includeEnterpriseSpecials: false,
  productsFilter: {
    dispensaryId: "<platform_dispensary_id>",
    pricingType: "rec",
    Status: "All",
    types: [],
    useCache: false,
    isDefaultSort: true,
    sortBy: "popularSortIdx",
    sortDirection: 1,
    bypassOnlineThresholds: true,
    isKioskMenu: false,
    removeProductsBelowOptionThresholds: false
  },
  page: 0,
  perPage: 100
}

Key Notes:

  • Status: "All" returns all products (Active returns same count)
  • Status: null returns 0 products (broken)
  • pricingType: "rec" returns BOTH rec and med prices
  • Paginate until products.length < perPage or allProducts.length >= totalCount

Step 4: Normalize Data

Transform raw Dutchie payload to canonical format via DutchieNormalizer.

Step 5: Upsert Products

Insert/update store_products table with normalized data.

Step 6: Create Snapshots

Insert point-in-time record to store_product_snapshots.

Step 7: Track Missing Products (OOS Detection)

-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id = ANY($2)

-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id NOT IN (...)
  AND consecutive_misses < 3

-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
  AND consecutive_misses >= 3
  AND stock_status != 'oos'

Step 8: Download Images

For new products, download and store images locally.

Step 9: Update Dispensary

UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1

GraphQL Payload Structure

Product Fields (from filteredProducts.products[])

Field Type Description
_id / id string MongoDB ObjectId (24 hex chars)
Name string Product display name
brandName string Brand name
brand.name string Brand name (nested)
brand.description string Brand description
type string Category (Flower, Edible, Concentrate, etc.)
subcategory string Subcategory
strainType string Hybrid, Indica, Sativa, N/A
Status string Always "Active" in feed
Image string Primary image URL
images[] array All product images

Pricing Fields

Field Type Description
Prices[] number[] Rec prices per option
recPrices[] number[] Rec prices
medicalPrices[] number[] Medical prices
recSpecialPrices[] number[] Rec sale prices
medicalSpecialPrices[] number[] Medical sale prices
Options[] string[] Size options ("1/8oz", "1g", etc.)
rawOptions[] string[] Raw weight options ("3.5g")

Inventory Fields (POSMetaData.children[])

Field Type Description
quantity number Total inventory count
quantityAvailable number Available for online orders
kioskQuantityAvailable number Available for kiosk orders
option string Which size option this is for

Potency Fields

Field Type Description
THCContent.range[] number[] THC percentage
CBDContent.range[] number[] CBD percentage
cannabinoidsV2[] array Detailed cannabinoid breakdown

Specials (specialData.bogoSpecials[])

Field Type Description
specialName string Deal name
specialType string "bogo", "sale", etc.
itemsForAPrice.value string Bundle price
bogoRewards[].totalQuantity.quantity number Required quantity

OOS Detection Logic

Products disappear from the Dutchie feed when they go out of stock. We track this via consecutive_misses:

Scenario Action
Product in feed consecutive_misses = 0
Product missing 1st time consecutive_misses = 1
Product missing 2nd time consecutive_misses = 2
Product missing 3rd time consecutive_misses = 3, mark stock_status = 'oos'
Product returns to feed consecutive_misses = 0, update stock_status

Why 3 misses?

  • Protects against false positives from crawl failures
  • Single bad crawl doesn't trigger mass OOS alerts
  • Balances detection speed vs accuracy

Database Tables

store_products

Current state of each product:

  • provider_product_id - Dutchie's MongoDB ObjectId
  • name_raw, brand_name_raw - Raw values from feed
  • price_rec, price_med - Current prices
  • is_in_stock, stock_status - Availability
  • consecutive_misses - OOS detection counter
  • last_seen_at - Last time product was in feed

store_product_snapshots

Point-in-time records for historical analysis:

  • One row per product per crawl
  • Captures price, stock, potency at that moment
  • Used for price history, analytics

dispensaries

Store metadata:

  • platform_dispensary_id - MongoDB ObjectId for GraphQL
  • menu_url - Source URL
  • last_crawl_at - Last successful crawl
  • crawl_enabled - Whether to crawl

Worker Roles

Workers pull tasks from the worker_tasks queue based on their assigned role.

Role Name Description Handler
product_resync Product Resync Re-crawl dispensary products for price/stock changes handleProductResync
product_discovery Product Discovery Initial product discovery for new dispensaries handleProductDiscovery
store_discovery Store Discovery Discover new dispensary locations handleStoreDiscovery
entry_point_discovery Entry Point Discovery Resolve platform IDs from menu URLs handleEntryPointDiscovery
analytics_refresh Analytics Refresh Refresh materialized views and analytics handleAnalyticsRefresh

API Endpoint: GET /api/worker-registry/roles


Scheduling

Crawls are scheduled via worker_tasks table:

Role Frequency Description
product_resync Every 4 hours Regular product refresh
product_discovery On-demand First crawl for new stores
entry_point_discovery On-demand New store setup
store_discovery Daily Find new stores
analytics_refresh Daily Refresh analytics materialized views

Priority & On-Demand Tasks

Tasks are claimed by workers in order of priority DESC, created_at ASC.

Priority Levels

Priority Use Case Example
0 Scheduled/batch tasks Daily product_resync generation
10 On-demand/chained tasks entry_point → product_discovery
Higher Urgent/manual triggers Admin-triggered immediate crawl

Task Chaining

When a task completes, the system automatically creates follow-up tasks:

store_discovery (completed)
    └─► entry_point_discovery (priority: 10) for each new store

entry_point_discovery (completed, success)
    └─► product_discovery (priority: 10) for that store

product_discovery (completed)
    └─► [no chain] Store enters regular resync schedule

On-Demand Task Creation

Use the task service to create high-priority tasks:

// Create immediate product resync for a store
await taskService.createTask({
  role: 'product_resync',
  dispensary_id: 123,
  platform: 'dutchie',
  priority: 20, // Higher than batch tasks
});

// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');

Claim Function

The claim_task() SQL function atomically claims tasks:

  • Respects priority ordering (higher = first)
  • Uses FOR UPDATE SKIP LOCKED for concurrency
  • Prevents multiple active tasks per store

Image Storage

Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.

Storage Path

/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp

Example:

/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp

Image Proxy API

Served via /img/* with on-demand resizing using sharp:

GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
Param Description
w Width in pixels (max 4000)
h Height in pixels (max 4000)
q Quality 1-100 (default 80)
fit cover, contain, fill, inside, outside
blur Blur sigma (0.3-1000)
gray Grayscale (1 = enabled)
format webp, jpeg, png, avif (default webp)

Key Files

File Purpose
src/utils/image-storage.ts Download & save images to local filesystem
src/routes/image-proxy.ts On-demand resize/transform at /img/*

Download Rules

Scenario Image Action
New product (first crawl) Download if primaryImageUrl exists
Existing product (refresh) Download only if local_image_path is NULL (backfill)
Product already has local image Skip download entirely

Logic:

  • Images are downloaded once and never re-downloaded on subsequent crawls
  • skipIfExists: true - filesystem check prevents re-download even if queued
  • First crawl: all products get images
  • Refresh crawl: only new products or products missing local images

Storage Rules

  • NO MinIO - local filesystem only (STORAGE_DRIVER=local)
  • Store full resolution, resize on-demand via /img proxy
  • Convert to webp for consistency using sharp
  • Preserve original Dutchie URL as fallback in image_url column
  • Local path stored in local_image_path column

Stealth & Anti-Detection

PROXIES ARE REQUIRED - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.

Workers automatically initialize anti-detection systems on startup.

Components

Component Purpose Source
CrawlRotator Coordinates proxy + UA rotation src/services/crawl-rotator.ts
ProxyRotator Round-robin proxy selection, health tracking src/services/crawl-rotator.ts
UserAgentRotator Cycles through realistic browser fingerprints src/services/crawl-rotator.ts
Dutchie Client Curl-based HTTP with auto-retry on 403 src/platforms/dutchie/client.ts

Initialization Flow

Worker Start
    │
    ├─► initializeStealth()
    │       │
    │       ├─► CrawlRotator.initialize()
    │       │       └─► Load proxies from `proxies` table
    │       │
    │       └─► setCrawlRotator(rotator)
    │               └─► Wire to Dutchie client
    │
    └─► Process tasks...

Stealth Session (per task)

Each crawl task starts a stealth session:

// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');

This creates a new identity with:

  • Random fingerprint: Chrome/Firefox/Safari/Edge on Win/Mac/Linux
  • Accept-Language: Matches timezone (e.g., America/Phoenixen-US,en;q=0.9)
  • sec-ch-ua headers: Proper Client Hints for the browser profile

On 403 Block

When Dutchie returns 403, the client automatically:

  1. Records failure on current proxy (increments failure_count)
  2. If proxy has 5+ failures, deactivates it
  3. Rotates to next healthy proxy
  4. Rotates fingerprint
  5. Retries the request

Proxy Table Schema

CREATE TABLE proxies (
  id SERIAL PRIMARY KEY,
  host VARCHAR(255) NOT NULL,
  port INTEGER NOT NULL,
  username VARCHAR(100),
  password VARCHAR(100),
  protocol VARCHAR(10) DEFAULT 'http',  -- http, https, socks5
  is_active BOOLEAN DEFAULT true,
  last_used_at TIMESTAMPTZ,
  failure_count INTEGER DEFAULT 0,
  success_count INTEGER DEFAULT 0,
  avg_response_time_ms INTEGER,
  last_failure_at TIMESTAMPTZ,
  last_error TEXT
);

Configuration

Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.

User-Agent Generation

See workflow-12102025.md for full specification.

Summary:

  • Uses intoli/user-agents library (daily-updated market share data)
  • Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
  • Browser whitelist: Chrome, Safari, Edge, Firefox only
  • UA sticks until IP rotates (403 or manual rotation)
  • Failure = alert admin + stop crawl (no fallback)

Each fingerprint includes proper sec-ch-ua, sec-ch-ua-platform, and sec-ch-ua-mobile headers.


Error Handling

  • GraphQL errors: Logged, task marked failed, retried later
  • Normalization errors: Logged as warnings, continue with valid products
  • Image download errors: Non-fatal, logged, continue
  • Database errors: Task fails, will be retried
  • 403 blocks: Auto-rotate proxy + fingerprint, retry (up to 3 retries)

Files

File Purpose
src/tasks/handlers/product-resync.ts Main crawl handler
src/tasks/handlers/entry-point-discovery.ts Slug → ID resolution
src/platforms/dutchie/index.ts GraphQL client, session management
src/hydration/normalizers/dutchie.ts Payload normalization
src/hydration/canonical-upsert.ts Database upsert logic
src/utils/image-storage.ts Image download and local storage
src/routes/image-proxy.ts On-demand image resizing
migrations/075_consecutive_misses.sql OOS tracking column