- Fix store_products column references (name_raw, brand_name_raw, category_raw) - Fix v_product_snapshots column references (crawled_at, *_cents pricing) - Fix dispensaries column references (zipcode, logo_image, remove hours/amenities) - Add services and license_type to dispensary API response - Add consecutive_misses OOS tracking to product-resync handler - Add migration 075 for consecutive_misses column - Add CRAWL_PIPELINE.md documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.1 KiB
Crawl Pipeline Documentation
Overview
The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.
Pipeline Stages
┌─────────────────────┐
│ store_discovery │ Find new dispensaries
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ entry_point_discovery│ Resolve slug → platform_dispensary_id
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ product_discovery │ Initial product crawl
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ product_resync │ Recurring crawl (every 4 hours)
└─────────────────────┘
Stage Details
1. Store Discovery
Purpose: Find new dispensaries to crawl
Handler: src/tasks/handlers/store-discovery.ts
Flow:
- Query Dutchie
ConsumerDispensariesGraphQL for cities/states - Extract dispensary info (name, address, menu_url)
- Insert into
dutchie_discovery_locations - Queue
entry_point_discoveryfor each new location
2. Entry Point Discovery
Purpose: Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)
Handler: src/tasks/handlers/entry-point-discovery.ts
Flow:
- Load dispensary from database
- Extract slug from
menu_url:/embedded-menu/<slug>or/dispensary/<slug>
- Start stealth session (fingerprint + proxy)
- Query
resolveDispensaryIdWithDetails(slug)via GraphQL - Update dispensary with
platform_dispensary_id - Queue
product_discoverytask
Example:
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b
3. Product Discovery
Purpose: Initial crawl of a new dispensary
Handler: src/tasks/handlers/product-discovery.ts
Same as product_resync but for first-time crawls.
4. Product Resync
Purpose: Recurring crawl to capture price/stock changes
Handler: src/tasks/handlers/product-resync.ts
Flow:
Step 1: Load Dispensary Info
SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
Step 2: Start Stealth Session
- Generate random browser fingerprint
- Set locale/timezone matching state
- Optional proxy rotation
Step 3: Fetch Products via GraphQL
Endpoint: https://dutchie.com/api-3/graphql
Variables:
{
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: "<platform_dispensary_id>",
pricingType: "rec",
Status: "All",
types: [],
useCache: false,
isDefaultSort: true,
sortBy: "popularSortIdx",
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false
},
page: 0,
perPage: 100
}
Key Notes:
Status: "All"returns all products (Active returns same count)Status: nullreturns 0 products (broken)pricingType: "rec"returns BOTH rec and med prices- Paginate until
products.length < perPageorallProducts.length >= totalCount
Step 4: Normalize Data
Transform raw Dutchie payload to canonical format via DutchieNormalizer.
Step 5: Upsert Products
Insert/update store_products table with normalized data.
Step 6: Create Snapshots
Insert point-in-time record to store_product_snapshots.
Step 7: Track Missing Products (OOS Detection)
-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id = ANY($2)
-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id NOT IN (...)
AND consecutive_misses < 3
-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
AND consecutive_misses >= 3
AND stock_status != 'oos'
Step 8: Download Images
For new products, download and store images locally.
Step 9: Update Dispensary
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
GraphQL Payload Structure
Product Fields (from filteredProducts.products[])
| Field | Type | Description |
|---|---|---|
_id / id |
string | MongoDB ObjectId (24 hex chars) |
Name |
string | Product display name |
brandName |
string | Brand name |
brand.name |
string | Brand name (nested) |
brand.description |
string | Brand description |
type |
string | Category (Flower, Edible, Concentrate, etc.) |
subcategory |
string | Subcategory |
strainType |
string | Hybrid, Indica, Sativa, N/A |
Status |
string | Always "Active" in feed |
Image |
string | Primary image URL |
images[] |
array | All product images |
Pricing Fields
| Field | Type | Description |
|---|---|---|
Prices[] |
number[] | Rec prices per option |
recPrices[] |
number[] | Rec prices |
medicalPrices[] |
number[] | Medical prices |
recSpecialPrices[] |
number[] | Rec sale prices |
medicalSpecialPrices[] |
number[] | Medical sale prices |
Options[] |
string[] | Size options ("1/8oz", "1g", etc.) |
rawOptions[] |
string[] | Raw weight options ("3.5g") |
Inventory Fields (POSMetaData.children[])
| Field | Type | Description |
|---|---|---|
quantity |
number | Total inventory count |
quantityAvailable |
number | Available for online orders |
kioskQuantityAvailable |
number | Available for kiosk orders |
option |
string | Which size option this is for |
Potency Fields
| Field | Type | Description |
|---|---|---|
THCContent.range[] |
number[] | THC percentage |
CBDContent.range[] |
number[] | CBD percentage |
cannabinoidsV2[] |
array | Detailed cannabinoid breakdown |
Specials (specialData.bogoSpecials[])
| Field | Type | Description |
|---|---|---|
specialName |
string | Deal name |
specialType |
string | "bogo", "sale", etc. |
itemsForAPrice.value |
string | Bundle price |
bogoRewards[].totalQuantity.quantity |
number | Required quantity |
OOS Detection Logic
Products disappear from the Dutchie feed when they go out of stock. We track this via consecutive_misses:
| Scenario | Action |
|---|---|
| Product in feed | consecutive_misses = 0 |
| Product missing 1st time | consecutive_misses = 1 |
| Product missing 2nd time | consecutive_misses = 2 |
| Product missing 3rd time | consecutive_misses = 3, mark stock_status = 'oos' |
| Product returns to feed | consecutive_misses = 0, update stock_status |
Why 3 misses?
- Protects against false positives from crawl failures
- Single bad crawl doesn't trigger mass OOS alerts
- Balances detection speed vs accuracy
Database Tables
store_products
Current state of each product:
provider_product_id- Dutchie's MongoDB ObjectIdname_raw,brand_name_raw- Raw values from feedprice_rec,price_med- Current pricesis_in_stock,stock_status- Availabilityconsecutive_misses- OOS detection counterlast_seen_at- Last time product was in feed
store_product_snapshots
Point-in-time records for historical analysis:
- One row per product per crawl
- Captures price, stock, potency at that moment
- Used for price history, analytics
dispensaries
Store metadata:
platform_dispensary_id- MongoDB ObjectId for GraphQLmenu_url- Source URLlast_crawl_at- Last successful crawlcrawl_enabled- Whether to crawl
Scheduling
Crawls are scheduled via worker_tasks table:
| Role | Frequency | Description |
|---|---|---|
product_resync |
Every 4 hours | Regular product refresh |
entry_point_discovery |
On-demand | New store setup |
store_discovery |
Daily | Find new stores |
Error Handling
- GraphQL errors: Logged, task marked failed, retried later
- Normalization errors: Logged as warnings, continue with valid products
- Image download errors: Non-fatal, logged, continue
- Database errors: Task fails, will be retried
Files
| File | Purpose |
|---|---|
src/tasks/handlers/product-resync.ts |
Main crawl handler |
src/tasks/handlers/entry-point-discovery.ts |
Slug → ID resolution |
src/platforms/dutchie/index.ts |
GraphQL client, session management |
src/hydration/normalizers/dutchie.ts |
Payload normalization |
src/hydration/canonical-upsert.ts |
Database upsert logic |
migrations/075_consecutive_misses.sql |
OOS tracking column |