Files
cannaiq/backend/docs/CRAWL_PIPELINE.md
Kelly 0295637ed6 fix: Public API column mappings and OOS detection
- Fix store_products column references (name_raw, brand_name_raw, category_raw)
- Fix v_product_snapshots column references (crawled_at, *_cents pricing)
- Fix dispensaries column references (zipcode, logo_image, remove hours/amenities)
- Add services and license_type to dispensary API response
- Add consecutive_misses OOS tracking to product-resync handler
- Add migration 075 for consecutive_misses column
- Add CRAWL_PIPELINE.md documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 20:44:53 -07:00

9.1 KiB

Crawl Pipeline Documentation

Overview

The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.


Pipeline Stages

┌─────────────────────┐
│  store_discovery    │  Find new dispensaries
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ entry_point_discovery│  Resolve slug → platform_dispensary_id
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  product_discovery  │  Initial product crawl
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   product_resync    │  Recurring crawl (every 4 hours)
└─────────────────────┘

Stage Details

1. Store Discovery

Purpose: Find new dispensaries to crawl

Handler: src/tasks/handlers/store-discovery.ts

Flow:

  1. Query Dutchie ConsumerDispensaries GraphQL for cities/states
  2. Extract dispensary info (name, address, menu_url)
  3. Insert into dutchie_discovery_locations
  4. Queue entry_point_discovery for each new location

2. Entry Point Discovery

Purpose: Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)

Handler: src/tasks/handlers/entry-point-discovery.ts

Flow:

  1. Load dispensary from database
  2. Extract slug from menu_url:
    • /embedded-menu/<slug> or /dispensary/<slug>
  3. Start stealth session (fingerprint + proxy)
  4. Query resolveDispensaryIdWithDetails(slug) via GraphQL
  5. Update dispensary with platform_dispensary_id
  6. Queue product_discovery task

Example:

menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b

3. Product Discovery

Purpose: Initial crawl of a new dispensary

Handler: src/tasks/handlers/product-discovery.ts

Same as product_resync but for first-time crawls.


4. Product Resync

Purpose: Recurring crawl to capture price/stock changes

Handler: src/tasks/handlers/product-resync.ts

Flow:

Step 1: Load Dispensary Info

SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true

Step 2: Start Stealth Session

  • Generate random browser fingerprint
  • Set locale/timezone matching state
  • Optional proxy rotation

Step 3: Fetch Products via GraphQL

Endpoint: https://dutchie.com/api-3/graphql

Variables:

{
  includeEnterpriseSpecials: false,
  productsFilter: {
    dispensaryId: "<platform_dispensary_id>",
    pricingType: "rec",
    Status: "All",
    types: [],
    useCache: false,
    isDefaultSort: true,
    sortBy: "popularSortIdx",
    sortDirection: 1,
    bypassOnlineThresholds: true,
    isKioskMenu: false,
    removeProductsBelowOptionThresholds: false
  },
  page: 0,
  perPage: 100
}

Key Notes:

  • Status: "All" returns all products (Active returns same count)
  • Status: null returns 0 products (broken)
  • pricingType: "rec" returns BOTH rec and med prices
  • Paginate until products.length < perPage or allProducts.length >= totalCount

Step 4: Normalize Data

Transform raw Dutchie payload to canonical format via DutchieNormalizer.

Step 5: Upsert Products

Insert/update store_products table with normalized data.

Step 6: Create Snapshots

Insert point-in-time record to store_product_snapshots.

Step 7: Track Missing Products (OOS Detection)

-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id = ANY($2)

-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id NOT IN (...)
  AND consecutive_misses < 3

-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
  AND consecutive_misses >= 3
  AND stock_status != 'oos'

Step 8: Download Images

For new products, download and store images locally.

Step 9: Update Dispensary

UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1

GraphQL Payload Structure

Product Fields (from filteredProducts.products[])

Field Type Description
_id / id string MongoDB ObjectId (24 hex chars)
Name string Product display name
brandName string Brand name
brand.name string Brand name (nested)
brand.description string Brand description
type string Category (Flower, Edible, Concentrate, etc.)
subcategory string Subcategory
strainType string Hybrid, Indica, Sativa, N/A
Status string Always "Active" in feed
Image string Primary image URL
images[] array All product images

Pricing Fields

Field Type Description
Prices[] number[] Rec prices per option
recPrices[] number[] Rec prices
medicalPrices[] number[] Medical prices
recSpecialPrices[] number[] Rec sale prices
medicalSpecialPrices[] number[] Medical sale prices
Options[] string[] Size options ("1/8oz", "1g", etc.)
rawOptions[] string[] Raw weight options ("3.5g")

Inventory Fields (POSMetaData.children[])

Field Type Description
quantity number Total inventory count
quantityAvailable number Available for online orders
kioskQuantityAvailable number Available for kiosk orders
option string Which size option this is for

Potency Fields

Field Type Description
THCContent.range[] number[] THC percentage
CBDContent.range[] number[] CBD percentage
cannabinoidsV2[] array Detailed cannabinoid breakdown

Specials (specialData.bogoSpecials[])

Field Type Description
specialName string Deal name
specialType string "bogo", "sale", etc.
itemsForAPrice.value string Bundle price
bogoRewards[].totalQuantity.quantity number Required quantity

OOS Detection Logic

Products disappear from the Dutchie feed when they go out of stock. We track this via consecutive_misses:

Scenario Action
Product in feed consecutive_misses = 0
Product missing 1st time consecutive_misses = 1
Product missing 2nd time consecutive_misses = 2
Product missing 3rd time consecutive_misses = 3, mark stock_status = 'oos'
Product returns to feed consecutive_misses = 0, update stock_status

Why 3 misses?

  • Protects against false positives from crawl failures
  • Single bad crawl doesn't trigger mass OOS alerts
  • Balances detection speed vs accuracy

Database Tables

store_products

Current state of each product:

  • provider_product_id - Dutchie's MongoDB ObjectId
  • name_raw, brand_name_raw - Raw values from feed
  • price_rec, price_med - Current prices
  • is_in_stock, stock_status - Availability
  • consecutive_misses - OOS detection counter
  • last_seen_at - Last time product was in feed

store_product_snapshots

Point-in-time records for historical analysis:

  • One row per product per crawl
  • Captures price, stock, potency at that moment
  • Used for price history, analytics

dispensaries

Store metadata:

  • platform_dispensary_id - MongoDB ObjectId for GraphQL
  • menu_url - Source URL
  • last_crawl_at - Last successful crawl
  • crawl_enabled - Whether to crawl

Scheduling

Crawls are scheduled via worker_tasks table:

Role Frequency Description
product_resync Every 4 hours Regular product refresh
entry_point_discovery On-demand New store setup
store_discovery Daily Find new stores

Error Handling

  • GraphQL errors: Logged, task marked failed, retried later
  • Normalization errors: Logged as warnings, continue with valid products
  • Image download errors: Non-fatal, logged, continue
  • Database errors: Task fails, will be retried

Files

File Purpose
src/tasks/handlers/product-resync.ts Main crawl handler
src/tasks/handlers/entry-point-discovery.ts Slug → ID resolution
src/platforms/dutchie/index.ts GraphQL client, session management
src/hydration/normalizers/dutchie.ts Payload normalization
src/hydration/canonical-upsert.ts Database upsert logic
migrations/075_consecutive_misses.sql OOS tracking column