- Fix store_products column references (name_raw, brand_name_raw, category_raw) - Fix v_product_snapshots column references (crawled_at, *_cents pricing) - Fix dispensaries column references (zipcode, logo_image, remove hours/amenities) - Add services and license_type to dispensary API response - Add consecutive_misses OOS tracking to product-resync handler - Add migration 075 for consecutive_misses column - Add CRAWL_PIPELINE.md documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
309 lines
9.1 KiB
Markdown
309 lines
9.1 KiB
Markdown
# Crawl Pipeline Documentation
|
|
|
|
## Overview
|
|
|
|
The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.
|
|
|
|
---
|
|
|
|
## Pipeline Stages
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ store_discovery │ Find new dispensaries
|
|
└─────────┬───────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ entry_point_discovery│ Resolve slug → platform_dispensary_id
|
|
└─────────┬───────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ product_discovery │ Initial product crawl
|
|
└─────────┬───────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ product_resync │ Recurring crawl (every 4 hours)
|
|
└─────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Stage Details
|
|
|
|
### 1. Store Discovery
|
|
**Purpose:** Find new dispensaries to crawl
|
|
|
|
**Handler:** `src/tasks/handlers/store-discovery.ts`
|
|
|
|
**Flow:**
|
|
1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states
|
|
2. Extract dispensary info (name, address, menu_url)
|
|
3. Insert into `dutchie_discovery_locations`
|
|
4. Queue `entry_point_discovery` for each new location
|
|
|
|
---
|
|
|
|
### 2. Entry Point Discovery
|
|
**Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)
|
|
|
|
**Handler:** `src/tasks/handlers/entry-point-discovery.ts`
|
|
|
|
**Flow:**
|
|
1. Load dispensary from database
|
|
2. Extract slug from `menu_url`:
|
|
- `/embedded-menu/<slug>` or `/dispensary/<slug>`
|
|
3. Start stealth session (fingerprint + proxy)
|
|
4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL
|
|
5. Update dispensary with `platform_dispensary_id`
|
|
6. Queue `product_discovery` task
|
|
|
|
**Example:**
|
|
```
|
|
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
|
|
slug: deeply-rooted
|
|
platform_dispensary_id: 6405ef617056e8014d79101b
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Product Discovery
|
|
**Purpose:** Initial crawl of a new dispensary
|
|
|
|
**Handler:** `src/tasks/handlers/product-discovery.ts`
|
|
|
|
Same as product_resync but for first-time crawls.
|
|
|
|
---
|
|
|
|
### 4. Product Resync
|
|
**Purpose:** Recurring crawl to capture price/stock changes
|
|
|
|
**Handler:** `src/tasks/handlers/product-resync.ts`
|
|
|
|
**Flow:**
|
|
|
|
#### Step 1: Load Dispensary Info
|
|
```sql
|
|
SELECT id, name, platform_dispensary_id, menu_url, state
|
|
FROM dispensaries
|
|
WHERE id = $1 AND crawl_enabled = true
|
|
```
|
|
|
|
#### Step 2: Start Stealth Session
|
|
- Generate random browser fingerprint
|
|
- Set locale/timezone matching state
|
|
- Optional proxy rotation
|
|
|
|
#### Step 3: Fetch Products via GraphQL
|
|
**Endpoint:** `https://dutchie.com/api-3/graphql`
|
|
|
|
**Variables:**
|
|
```javascript
|
|
{
|
|
includeEnterpriseSpecials: false,
|
|
productsFilter: {
|
|
dispensaryId: "<platform_dispensary_id>",
|
|
pricingType: "rec",
|
|
Status: "All",
|
|
types: [],
|
|
useCache: false,
|
|
isDefaultSort: true,
|
|
sortBy: "popularSortIdx",
|
|
sortDirection: 1,
|
|
bypassOnlineThresholds: true,
|
|
isKioskMenu: false,
|
|
removeProductsBelowOptionThresholds: false
|
|
},
|
|
page: 0,
|
|
perPage: 100
|
|
}
|
|
```
|
|
|
|
**Key Notes:**
|
|
- `Status: "All"` returns all products (Active returns same count)
|
|
- `Status: null` returns 0 products (broken)
|
|
- `pricingType: "rec"` returns BOTH rec and med prices
|
|
- Paginate until `products.length < perPage` or `allProducts.length >= totalCount`
|
|
|
|
#### Step 4: Normalize Data
|
|
Transform raw Dutchie payload to canonical format via `DutchieNormalizer`.
|
|
|
|
#### Step 5: Upsert Products
|
|
Insert/update `store_products` table with normalized data.
|
|
|
|
#### Step 6: Create Snapshots
|
|
Insert point-in-time record to `store_product_snapshots`.
|
|
|
|
#### Step 7: Track Missing Products (OOS Detection)
|
|
```sql
|
|
-- Reset consecutive_misses for products IN the feed
|
|
UPDATE store_products
|
|
SET consecutive_misses = 0, last_seen_at = NOW()
|
|
WHERE dispensary_id = $1
|
|
AND provider = 'dutchie'
|
|
AND provider_product_id = ANY($2)
|
|
|
|
-- Increment for products NOT in feed
|
|
UPDATE store_products
|
|
SET consecutive_misses = consecutive_misses + 1
|
|
WHERE dispensary_id = $1
|
|
AND provider = 'dutchie'
|
|
AND provider_product_id NOT IN (...)
|
|
AND consecutive_misses < 3
|
|
|
|
-- Mark OOS at 3 consecutive misses
|
|
UPDATE store_products
|
|
SET stock_status = 'oos', is_in_stock = false
|
|
WHERE dispensary_id = $1
|
|
AND consecutive_misses >= 3
|
|
AND stock_status != 'oos'
|
|
```
|
|
|
|
#### Step 8: Download Images
|
|
For new products, download and store images locally.
|
|
|
|
#### Step 9: Update Dispensary
|
|
```sql
|
|
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
|
|
```
|
|
|
|
---
|
|
|
|
## GraphQL Payload Structure
|
|
|
|
### Product Fields (from filteredProducts.products[])
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `_id` / `id` | string | MongoDB ObjectId (24 hex chars) |
|
|
| `Name` | string | Product display name |
|
|
| `brandName` | string | Brand name |
|
|
| `brand.name` | string | Brand name (nested) |
|
|
| `brand.description` | string | Brand description |
|
|
| `type` | string | Category (Flower, Edible, Concentrate, etc.) |
|
|
| `subcategory` | string | Subcategory |
|
|
| `strainType` | string | Hybrid, Indica, Sativa, N/A |
|
|
| `Status` | string | Always "Active" in feed |
|
|
| `Image` | string | Primary image URL |
|
|
| `images[]` | array | All product images |
|
|
|
|
### Pricing Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `Prices[]` | number[] | Rec prices per option |
|
|
| `recPrices[]` | number[] | Rec prices |
|
|
| `medicalPrices[]` | number[] | Medical prices |
|
|
| `recSpecialPrices[]` | number[] | Rec sale prices |
|
|
| `medicalSpecialPrices[]` | number[] | Medical sale prices |
|
|
| `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) |
|
|
| `rawOptions[]` | string[] | Raw weight options ("3.5g") |
|
|
|
|
### Inventory Fields (POSMetaData.children[])
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `quantity` | number | Total inventory count |
|
|
| `quantityAvailable` | number | Available for online orders |
|
|
| `kioskQuantityAvailable` | number | Available for kiosk orders |
|
|
| `option` | string | Which size option this is for |
|
|
|
|
### Potency Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `THCContent.range[]` | number[] | THC percentage |
|
|
| `CBDContent.range[]` | number[] | CBD percentage |
|
|
| `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown |
|
|
|
|
### Specials (specialData.bogoSpecials[])
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `specialName` | string | Deal name |
|
|
| `specialType` | string | "bogo", "sale", etc. |
|
|
| `itemsForAPrice.value` | string | Bundle price |
|
|
| `bogoRewards[].totalQuantity.quantity` | number | Required quantity |
|
|
|
|
---
|
|
|
|
## OOS Detection Logic
|
|
|
|
Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`:
|
|
|
|
| Scenario | Action |
|
|
|----------|--------|
|
|
| Product in feed | `consecutive_misses = 0` |
|
|
| Product missing 1st time | `consecutive_misses = 1` |
|
|
| Product missing 2nd time | `consecutive_misses = 2` |
|
|
| Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` |
|
|
| Product returns to feed | `consecutive_misses = 0`, update stock_status |
|
|
|
|
**Why 3 misses?**
|
|
- Protects against false positives from crawl failures
|
|
- Single bad crawl doesn't trigger mass OOS alerts
|
|
- Balances detection speed vs accuracy
|
|
|
|
---
|
|
|
|
## Database Tables
|
|
|
|
### store_products
|
|
Current state of each product:
|
|
- `provider_product_id` - Dutchie's MongoDB ObjectId
|
|
- `name_raw`, `brand_name_raw` - Raw values from feed
|
|
- `price_rec`, `price_med` - Current prices
|
|
- `is_in_stock`, `stock_status` - Availability
|
|
- `consecutive_misses` - OOS detection counter
|
|
- `last_seen_at` - Last time product was in feed
|
|
|
|
### store_product_snapshots
|
|
Point-in-time records for historical analysis:
|
|
- One row per product per crawl
|
|
- Captures price, stock, potency at that moment
|
|
- Used for price history, analytics
|
|
|
|
### dispensaries
|
|
Store metadata:
|
|
- `platform_dispensary_id` - MongoDB ObjectId for GraphQL
|
|
- `menu_url` - Source URL
|
|
- `last_crawl_at` - Last successful crawl
|
|
- `crawl_enabled` - Whether to crawl
|
|
|
|
---
|
|
|
|
## Scheduling
|
|
|
|
Crawls are scheduled via `worker_tasks` table:
|
|
|
|
| Role | Frequency | Description |
|
|
|------|-----------|-------------|
|
|
| `product_resync` | Every 4 hours | Regular product refresh |
|
|
| `entry_point_discovery` | On-demand | New store setup |
|
|
| `store_discovery` | Daily | Find new stores |
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
- **GraphQL errors:** Logged, task marked failed, retried later
|
|
- **Normalization errors:** Logged as warnings, continue with valid products
|
|
- **Image download errors:** Non-fatal, logged, continue
|
|
- **Database errors:** Task fails, will be retried
|
|
|
|
---
|
|
|
|
## Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `src/tasks/handlers/product-resync.ts` | Main crawl handler |
|
|
| `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution |
|
|
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
|
|
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
|
|
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
|
|
| `migrations/075_consecutive_misses.sql` | OOS tracking column |
|