# Crawl Pipeline Documentation ## Overview The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage. --- ## Pipeline Stages ``` ┌─────────────────────┐ │ store_discovery │ Find new dispensaries └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ entry_point_discovery│ Resolve slug → platform_dispensary_id └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ product_discovery │ Initial product crawl └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ product_resync │ Recurring crawl (every 4 hours) └─────────────────────┘ ``` --- ## Stage Details ### 1. Store Discovery **Purpose:** Find new dispensaries to crawl **Handler:** `src/tasks/handlers/store-discovery.ts` **Flow:** 1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states 2. Extract dispensary info (name, address, menu_url) 3. Insert into `dutchie_discovery_locations` 4. Queue `entry_point_discovery` for each new location --- ### 2. Entry Point Discovery **Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId) **Handler:** `src/tasks/handlers/entry-point-discovery.ts` **Flow:** 1. Load dispensary from database 2. Extract slug from `menu_url`: - `/embedded-menu/` or `/dispensary/` 3. Start stealth session (fingerprint + proxy) 4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL 5. Update dispensary with `platform_dispensary_id` 6. Queue `product_discovery` task **Example:** ``` menu_url: https://dutchie.com/embedded-menu/deeply-rooted slug: deeply-rooted platform_dispensary_id: 6405ef617056e8014d79101b ``` --- ### 3. Product Discovery **Purpose:** Initial crawl of a new dispensary **Handler:** `src/tasks/handlers/product-discovery.ts` Same as product_resync but for first-time crawls. --- ### 4. Product Resync **Purpose:** Recurring crawl to capture price/stock changes **Handler:** `src/tasks/handlers/product-resync.ts` **Flow:** #### Step 1: Load Dispensary Info ```sql SELECT id, name, platform_dispensary_id, menu_url, state FROM dispensaries WHERE id = $1 AND crawl_enabled = true ``` #### Step 2: Start Stealth Session - Generate random browser fingerprint - Set locale/timezone matching state - Optional proxy rotation #### Step 3: Fetch Products via GraphQL **Endpoint:** `https://dutchie.com/api-3/graphql` **Variables:** ```javascript { includeEnterpriseSpecials: false, productsFilter: { dispensaryId: "", pricingType: "rec", Status: "All", types: [], useCache: false, isDefaultSort: true, sortBy: "popularSortIdx", sortDirection: 1, bypassOnlineThresholds: true, isKioskMenu: false, removeProductsBelowOptionThresholds: false }, page: 0, perPage: 100 } ``` **Key Notes:** - `Status: "All"` returns all products (Active returns same count) - `Status: null` returns 0 products (broken) - `pricingType: "rec"` returns BOTH rec and med prices - Paginate until `products.length < perPage` or `allProducts.length >= totalCount` #### Step 4: Normalize Data Transform raw Dutchie payload to canonical format via `DutchieNormalizer`. #### Step 5: Upsert Products Insert/update `store_products` table with normalized data. #### Step 6: Create Snapshots Insert point-in-time record to `store_product_snapshots`. #### Step 7: Track Missing Products (OOS Detection) ```sql -- Reset consecutive_misses for products IN the feed UPDATE store_products SET consecutive_misses = 0, last_seen_at = NOW() WHERE dispensary_id = $1 AND provider = 'dutchie' AND provider_product_id = ANY($2) -- Increment for products NOT in feed UPDATE store_products SET consecutive_misses = consecutive_misses + 1 WHERE dispensary_id = $1 AND provider = 'dutchie' AND provider_product_id NOT IN (...) AND consecutive_misses < 3 -- Mark OOS at 3 consecutive misses UPDATE store_products SET stock_status = 'oos', is_in_stock = false WHERE dispensary_id = $1 AND consecutive_misses >= 3 AND stock_status != 'oos' ``` #### Step 8: Download Images For new products, download and store images locally. #### Step 9: Update Dispensary ```sql UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1 ``` --- ## GraphQL Payload Structure ### Product Fields (from filteredProducts.products[]) | Field | Type | Description | |-------|------|-------------| | `_id` / `id` | string | MongoDB ObjectId (24 hex chars) | | `Name` | string | Product display name | | `brandName` | string | Brand name | | `brand.name` | string | Brand name (nested) | | `brand.description` | string | Brand description | | `type` | string | Category (Flower, Edible, Concentrate, etc.) | | `subcategory` | string | Subcategory | | `strainType` | string | Hybrid, Indica, Sativa, N/A | | `Status` | string | Always "Active" in feed | | `Image` | string | Primary image URL | | `images[]` | array | All product images | ### Pricing Fields | Field | Type | Description | |-------|------|-------------| | `Prices[]` | number[] | Rec prices per option | | `recPrices[]` | number[] | Rec prices | | `medicalPrices[]` | number[] | Medical prices | | `recSpecialPrices[]` | number[] | Rec sale prices | | `medicalSpecialPrices[]` | number[] | Medical sale prices | | `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) | | `rawOptions[]` | string[] | Raw weight options ("3.5g") | ### Inventory Fields (POSMetaData.children[]) | Field | Type | Description | |-------|------|-------------| | `quantity` | number | Total inventory count | | `quantityAvailable` | number | Available for online orders | | `kioskQuantityAvailable` | number | Available for kiosk orders | | `option` | string | Which size option this is for | ### Potency Fields | Field | Type | Description | |-------|------|-------------| | `THCContent.range[]` | number[] | THC percentage | | `CBDContent.range[]` | number[] | CBD percentage | | `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown | ### Specials (specialData.bogoSpecials[]) | Field | Type | Description | |-------|------|-------------| | `specialName` | string | Deal name | | `specialType` | string | "bogo", "sale", etc. | | `itemsForAPrice.value` | string | Bundle price | | `bogoRewards[].totalQuantity.quantity` | number | Required quantity | --- ## OOS Detection Logic Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`: | Scenario | Action | |----------|--------| | Product in feed | `consecutive_misses = 0` | | Product missing 1st time | `consecutive_misses = 1` | | Product missing 2nd time | `consecutive_misses = 2` | | Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` | | Product returns to feed | `consecutive_misses = 0`, update stock_status | **Why 3 misses?** - Protects against false positives from crawl failures - Single bad crawl doesn't trigger mass OOS alerts - Balances detection speed vs accuracy --- ## Database Tables ### store_products Current state of each product: - `provider_product_id` - Dutchie's MongoDB ObjectId - `name_raw`, `brand_name_raw` - Raw values from feed - `price_rec`, `price_med` - Current prices - `is_in_stock`, `stock_status` - Availability - `consecutive_misses` - OOS detection counter - `last_seen_at` - Last time product was in feed ### store_product_snapshots Point-in-time records for historical analysis: - One row per product per crawl - Captures price, stock, potency at that moment - Used for price history, analytics ### dispensaries Store metadata: - `platform_dispensary_id` - MongoDB ObjectId for GraphQL - `menu_url` - Source URL - `last_crawl_at` - Last successful crawl - `crawl_enabled` - Whether to crawl --- ## Worker Roles Workers pull tasks from the `worker_tasks` queue based on their assigned role. | Role | Name | Description | Handler | |------|------|-------------|---------| | `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` | | `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` | | `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` | | `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` | | `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` | **API Endpoint:** `GET /api/worker-registry/roles` --- ## Scheduling Crawls are scheduled via `worker_tasks` table: | Role | Frequency | Description | |------|-----------|-------------| | `product_resync` | Every 4 hours | Regular product refresh | | `product_discovery` | On-demand | First crawl for new stores | | `entry_point_discovery` | On-demand | New store setup | | `store_discovery` | Daily | Find new stores | | `analytics_refresh` | Daily | Refresh analytics materialized views | --- ## Priority & On-Demand Tasks Tasks are claimed by workers in order of **priority DESC, created_at ASC**. ### Priority Levels | Priority | Use Case | Example | |----------|----------|---------| | 0 | Scheduled/batch tasks | Daily product_resync generation | | 10 | On-demand/chained tasks | entry_point → product_discovery | | Higher | Urgent/manual triggers | Admin-triggered immediate crawl | ### Task Chaining When a task completes, the system automatically creates follow-up tasks: ``` store_discovery (completed) └─► entry_point_discovery (priority: 10) for each new store entry_point_discovery (completed, success) └─► product_discovery (priority: 10) for that store product_discovery (completed) └─► [no chain] Store enters regular resync schedule ``` ### On-Demand Task Creation Use the task service to create high-priority tasks: ```typescript // Create immediate product resync for a store await taskService.createTask({ role: 'product_resync', dispensary_id: 123, platform: 'dutchie', priority: 20, // Higher than batch tasks }); // Convenience methods with default high priority (10) await taskService.createEntryPointTask(dispensaryId, 'dutchie'); await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie'); await taskService.createStoreDiscoveryTask('dutchie', 'AZ'); ``` ### Claim Function The `claim_task()` SQL function atomically claims tasks: - Respects priority ordering (higher = first) - Uses `FOR UPDATE SKIP LOCKED` for concurrency - Prevents multiple active tasks per store --- ## Image Storage Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing. ### Storage Path ``` /storage/images/products/////image-.webp /storage/images/brands//logo-.webp ``` **Example:** ``` /storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp ``` ### Image Proxy API Served via `/img/*` with on-demand resizing using **sharp**: ``` GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200 ``` | Param | Description | |-------|-------------| | `w` | Width in pixels (max 4000) | | `h` | Height in pixels (max 4000) | | `q` | Quality 1-100 (default 80) | | `fit` | cover, contain, fill, inside, outside | | `blur` | Blur sigma (0.3-1000) | | `gray` | Grayscale (1 = enabled) | | `format` | webp, jpeg, png, avif (default webp) | ### Key Files | File | Purpose | |------|---------| | `src/utils/image-storage.ts` | Download & save images to local filesystem | | `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` | ### Download Rules | Scenario | Image Action | |----------|--------------| | **New product (first crawl)** | Download if `primaryImageUrl` exists | | **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) | | **Product already has local image** | Skip download entirely | **Logic:** - Images are downloaded **once** and never re-downloaded on subsequent crawls - `skipIfExists: true` - filesystem check prevents re-download even if queued - First crawl: all products get images - Refresh crawl: only new products or products missing local images ### Storage Rules - **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`) - Store full resolution, resize on-demand via `/img` proxy - Convert to webp for consistency using **sharp** - Preserve original Dutchie URL as fallback in `image_url` column - Local path stored in `local_image_path` column --- ## Stealth & Anti-Detection **PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy. Workers automatically initialize anti-detection systems on startup. ### Components | Component | Purpose | Source | |-----------|---------|--------| | **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` | | **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` | | **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` | | **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` | ### Initialization Flow ``` Worker Start │ ├─► initializeStealth() │ │ │ ├─► CrawlRotator.initialize() │ │ └─► Load proxies from `proxies` table │ │ │ └─► setCrawlRotator(rotator) │ └─► Wire to Dutchie client │ └─► Process tasks... ``` ### Stealth Session (per task) Each crawl task starts a stealth session: ```typescript // In product-refresh.ts, entry-point-discovery.ts const session = startSession(dispensary.state || 'AZ', 'America/Phoenix'); ``` This creates a new identity with: - **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux - **Accept-Language:** Matches timezone (e.g., `America/Phoenix` → `en-US,en;q=0.9`) - **sec-ch-ua headers:** Proper Client Hints for the browser profile ### On 403 Block When Dutchie returns 403, the client automatically: 1. Records failure on current proxy (increments `failure_count`) 2. If proxy has 5+ failures, deactivates it 3. Rotates to next healthy proxy 4. Rotates fingerprint 5. Retries the request ### Proxy Table Schema ```sql CREATE TABLE proxies ( id SERIAL PRIMARY KEY, host VARCHAR(255) NOT NULL, port INTEGER NOT NULL, username VARCHAR(100), password VARCHAR(100), protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5 is_active BOOLEAN DEFAULT true, last_used_at TIMESTAMPTZ, failure_count INTEGER DEFAULT 0, success_count INTEGER DEFAULT 0, avg_response_time_ms INTEGER, last_failure_at TIMESTAMPTZ, last_error TEXT ); ``` ### Configuration Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database. ### User-Agent Generation See `workflow-12102025.md` for full specification. **Summary:** - Uses `intoli/user-agents` library (daily-updated market share data) - Device distribution: Mobile 62%, Desktop 36%, Tablet 2% - Browser whitelist: Chrome, Safari, Edge, Firefox only - UA sticks until IP rotates (403 or manual rotation) - Failure = alert admin + stop crawl (no fallback) Each fingerprint includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers. --- ## Error Handling - **GraphQL errors:** Logged, task marked failed, retried later - **Normalization errors:** Logged as warnings, continue with valid products - **Image download errors:** Non-fatal, logged, continue - **Database errors:** Task fails, will be retried - **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries) --- ## Files | File | Purpose | |------|---------| | `src/tasks/handlers/product-resync.ts` | Main crawl handler | | `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution | | `src/platforms/dutchie/index.ts` | GraphQL client, session management | | `src/hydration/normalizers/dutchie.ts` | Payload normalization | | `src/hydration/canonical-upsert.ts` | Database upsert logic | | `src/utils/image-storage.ts` | Image download and local storage | | `src/routes/image-proxy.ts` | On-demand image resizing | | `migrations/075_consecutive_misses.sql` | OOS tracking column |