# Crawl Pipeline Documentation ## Overview The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage. --- ## Pipeline Stages ``` ┌─────────────────────┐ │ store_discovery │ Find new dispensaries └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ entry_point_discovery│ Resolve slug → platform_dispensary_id └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ product_discovery │ Initial product crawl └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ product_resync │ Recurring crawl (every 4 hours) └─────────────────────┘ ``` --- ## Stage Details ### 1. Store Discovery **Purpose:** Find new dispensaries to crawl **Handler:** `src/tasks/handlers/store-discovery.ts` **Flow:** 1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states 2. Extract dispensary info (name, address, menu_url) 3. Insert into `dutchie_discovery_locations` 4. Queue `entry_point_discovery` for each new location --- ### 2. Entry Point Discovery **Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId) **Handler:** `src/tasks/handlers/entry-point-discovery.ts` **Flow:** 1. Load dispensary from database 2. Extract slug from `menu_url`: - `/embedded-menu/` or `/dispensary/` 3. Start stealth session (fingerprint + proxy) 4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL 5. Update dispensary with `platform_dispensary_id` 6. Queue `product_discovery` task **Example:** ``` menu_url: https://dutchie.com/embedded-menu/deeply-rooted slug: deeply-rooted platform_dispensary_id: 6405ef617056e8014d79101b ``` --- ### 3. Product Discovery **Purpose:** Initial crawl of a new dispensary **Handler:** `src/tasks/handlers/product-discovery.ts` Same as product_resync but for first-time crawls. --- ### 4. Product Resync **Purpose:** Recurring crawl to capture price/stock changes **Handler:** `src/tasks/handlers/product-resync.ts` **Flow:** #### Step 1: Load Dispensary Info ```sql SELECT id, name, platform_dispensary_id, menu_url, state FROM dispensaries WHERE id = $1 AND crawl_enabled = true ``` #### Step 2: Start Stealth Session - Generate random browser fingerprint - Set locale/timezone matching state - Optional proxy rotation #### Step 3: Fetch Products via GraphQL **Endpoint:** `https://dutchie.com/api-3/graphql` **Variables:** ```javascript { includeEnterpriseSpecials: false, productsFilter: { dispensaryId: "", pricingType: "rec", Status: "All", types: [], useCache: false, isDefaultSort: true, sortBy: "popularSortIdx", sortDirection: 1, bypassOnlineThresholds: true, isKioskMenu: false, removeProductsBelowOptionThresholds: false }, page: 0, perPage: 100 } ``` **Key Notes:** - `Status: "All"` returns all products (Active returns same count) - `Status: null` returns 0 products (broken) - `pricingType: "rec"` returns BOTH rec and med prices - Paginate until `products.length < perPage` or `allProducts.length >= totalCount` #### Step 4: Normalize Data Transform raw Dutchie payload to canonical format via `DutchieNormalizer`. #### Step 5: Upsert Products Insert/update `store_products` table with normalized data. #### Step 6: Create Snapshots Insert point-in-time record to `store_product_snapshots`. #### Step 7: Track Missing Products (OOS Detection) ```sql -- Reset consecutive_misses for products IN the feed UPDATE store_products SET consecutive_misses = 0, last_seen_at = NOW() WHERE dispensary_id = $1 AND provider = 'dutchie' AND provider_product_id = ANY($2) -- Increment for products NOT in feed UPDATE store_products SET consecutive_misses = consecutive_misses + 1 WHERE dispensary_id = $1 AND provider = 'dutchie' AND provider_product_id NOT IN (...) AND consecutive_misses < 3 -- Mark OOS at 3 consecutive misses UPDATE store_products SET stock_status = 'oos', is_in_stock = false WHERE dispensary_id = $1 AND consecutive_misses >= 3 AND stock_status != 'oos' ``` #### Step 8: Download Images For new products, download and store images locally. #### Step 9: Update Dispensary ```sql UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1 ``` --- ## GraphQL Payload Structure ### Product Fields (from filteredProducts.products[]) | Field | Type | Description | |-------|------|-------------| | `_id` / `id` | string | MongoDB ObjectId (24 hex chars) | | `Name` | string | Product display name | | `brandName` | string | Brand name | | `brand.name` | string | Brand name (nested) | | `brand.description` | string | Brand description | | `type` | string | Category (Flower, Edible, Concentrate, etc.) | | `subcategory` | string | Subcategory | | `strainType` | string | Hybrid, Indica, Sativa, N/A | | `Status` | string | Always "Active" in feed | | `Image` | string | Primary image URL | | `images[]` | array | All product images | ### Pricing Fields | Field | Type | Description | |-------|------|-------------| | `Prices[]` | number[] | Rec prices per option | | `recPrices[]` | number[] | Rec prices | | `medicalPrices[]` | number[] | Medical prices | | `recSpecialPrices[]` | number[] | Rec sale prices | | `medicalSpecialPrices[]` | number[] | Medical sale prices | | `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) | | `rawOptions[]` | string[] | Raw weight options ("3.5g") | ### Inventory Fields (POSMetaData.children[]) | Field | Type | Description | |-------|------|-------------| | `quantity` | number | Total inventory count | | `quantityAvailable` | number | Available for online orders | | `kioskQuantityAvailable` | number | Available for kiosk orders | | `option` | string | Which size option this is for | ### Potency Fields | Field | Type | Description | |-------|------|-------------| | `THCContent.range[]` | number[] | THC percentage | | `CBDContent.range[]` | number[] | CBD percentage | | `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown | ### Specials (specialData.bogoSpecials[]) | Field | Type | Description | |-------|------|-------------| | `specialName` | string | Deal name | | `specialType` | string | "bogo", "sale", etc. | | `itemsForAPrice.value` | string | Bundle price | | `bogoRewards[].totalQuantity.quantity` | number | Required quantity | --- ## OOS Detection Logic Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`: | Scenario | Action | |----------|--------| | Product in feed | `consecutive_misses = 0` | | Product missing 1st time | `consecutive_misses = 1` | | Product missing 2nd time | `consecutive_misses = 2` | | Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` | | Product returns to feed | `consecutive_misses = 0`, update stock_status | **Why 3 misses?** - Protects against false positives from crawl failures - Single bad crawl doesn't trigger mass OOS alerts - Balances detection speed vs accuracy --- ## Database Tables ### store_products Current state of each product: - `provider_product_id` - Dutchie's MongoDB ObjectId - `name_raw`, `brand_name_raw` - Raw values from feed - `price_rec`, `price_med` - Current prices - `is_in_stock`, `stock_status` - Availability - `consecutive_misses` - OOS detection counter - `last_seen_at` - Last time product was in feed ### store_product_snapshots Point-in-time records for historical analysis: - One row per product per crawl - Captures price, stock, potency at that moment - Used for price history, analytics ### dispensaries Store metadata: - `platform_dispensary_id` - MongoDB ObjectId for GraphQL - `menu_url` - Source URL - `last_crawl_at` - Last successful crawl - `crawl_enabled` - Whether to crawl --- ## Scheduling Crawls are scheduled via `worker_tasks` table: | Role | Frequency | Description | |------|-----------|-------------| | `product_resync` | Every 4 hours | Regular product refresh | | `entry_point_discovery` | On-demand | New store setup | | `store_discovery` | Daily | Find new stores | --- ## Error Handling - **GraphQL errors:** Logged, task marked failed, retried later - **Normalization errors:** Logged as warnings, continue with valid products - **Image download errors:** Non-fatal, logged, continue - **Database errors:** Task fails, will be retried --- ## Files | File | Purpose | |------|---------| | `src/tasks/handlers/product-resync.ts` | Main crawl handler | | `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution | | `src/platforms/dutchie/index.ts` | GraphQL client, session management | | `src/hydration/normalizers/dutchie.ts` | Payload normalization | | `src/hydration/canonical-upsert.ts` | Database upsert logic | | `migrations/075_consecutive_misses.sql` | OOS tracking column |