cannaiq/backend/docs/CRAWL_PIPELINE.md

# Crawl Pipeline Documentation

## Overview

The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.

---

## Pipeline Stages

```
┌─────────────────────┐
│  store_discovery    │  Find new dispensaries
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ entry_point_discovery│  Resolve slug → platform_dispensary_id
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  product_discovery  │  Initial product crawl
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   product_resync    │  Recurring crawl (every 4 hours)
└─────────────────────┘
```

---

## Stage Details

### 1. Store Discovery
**Purpose:** Find new dispensaries to crawl

**Handler:** `src/tasks/handlers/store-discovery.ts`

**Flow:**
1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states
2. Extract dispensary info (name, address, menu_url)
3. Insert into `dutchie_discovery_locations`
4. Queue `entry_point_discovery` for each new location

---

### 2. Entry Point Discovery
**Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)

**Handler:** `src/tasks/handlers/entry-point-discovery.ts`

**Flow:**
1. Load dispensary from database
2. Extract slug from `menu_url`:
   - `/embedded-menu/<slug>` or `/dispensary/<slug>`
3. Start stealth session (fingerprint + proxy)
4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL
5. Update dispensary with `platform_dispensary_id`
6. Queue `product_discovery` task

**Example:**
```
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b
```

---

### 3. Product Discovery
**Purpose:** Initial crawl of a new dispensary

**Handler:** `src/tasks/handlers/product-discovery.ts`

Same as product_resync but for first-time crawls.

---

### 4. Product Resync
**Purpose:** Recurring crawl to capture price/stock changes

**Handler:** `src/tasks/handlers/product-resync.ts`

**Flow:**

#### Step 1: Load Dispensary Info
```sql
SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
```

#### Step 2: Start Stealth Session
- Generate random browser fingerprint
- Set locale/timezone matching state
- Optional proxy rotation

#### Step 3: Fetch Products via GraphQL
**Endpoint:** `https://dutchie.com/api-3/graphql`

**Variables:**
```javascript
{
  includeEnterpriseSpecials: false,
  productsFilter: {
    dispensaryId: "<platform_dispensary_id>",
    pricingType: "rec",
    Status: "All",
    types: [],
    useCache: false,
    isDefaultSort: true,
    sortBy: "popularSortIdx",
    sortDirection: 1,
    bypassOnlineThresholds: true,
    isKioskMenu: false,
    removeProductsBelowOptionThresholds: false
  },
  page: 0,
  perPage: 100
}
```

**Key Notes:**
- `Status: "All"` returns all products (Active returns same count)
- `Status: null` returns 0 products (broken)
- `pricingType: "rec"` returns BOTH rec and med prices
- Paginate until `products.length < perPage` or `allProducts.length >= totalCount`

#### Step 4: Normalize Data
Transform raw Dutchie payload to canonical format via `DutchieNormalizer`.

#### Step 5: Upsert Products
Insert/update `store_products` table with normalized data.

#### Step 6: Create Snapshots
Insert point-in-time record to `store_product_snapshots`.

#### Step 7: Track Missing Products (OOS Detection)
```sql
-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id = ANY($2)

-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id NOT IN (...)
  AND consecutive_misses < 3

-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
  AND consecutive_misses >= 3
  AND stock_status != 'oos'
```

#### Step 8: Download Images
For new products, download and store images locally.

#### Step 9: Update Dispensary
```sql
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
```

---

## GraphQL Payload Structure

### Product Fields (from filteredProducts.products[])

| Field | Type | Description |
|-------|------|-------------|
| `_id` / `id` | string | MongoDB ObjectId (24 hex chars) |
| `Name` | string | Product display name |
| `brandName` | string | Brand name |
| `brand.name` | string | Brand name (nested) |
| `brand.description` | string | Brand description |
| `type` | string | Category (Flower, Edible, Concentrate, etc.) |
| `subcategory` | string | Subcategory |
| `strainType` | string | Hybrid, Indica, Sativa, N/A |
| `Status` | string | Always "Active" in feed |
| `Image` | string | Primary image URL |
| `images[]` | array | All product images |

### Pricing Fields

| Field | Type | Description |
|-------|------|-------------|
| `Prices[]` | number[] | Rec prices per option |
| `recPrices[]` | number[] | Rec prices |
| `medicalPrices[]` | number[] | Medical prices |
| `recSpecialPrices[]` | number[] | Rec sale prices |
| `medicalSpecialPrices[]` | number[] | Medical sale prices |
| `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) |
| `rawOptions[]` | string[] | Raw weight options ("3.5g") |

### Inventory Fields (POSMetaData.children[])

| Field | Type | Description |
|-------|------|-------------|
| `quantity` | number | Total inventory count |
| `quantityAvailable` | number | Available for online orders |
| `kioskQuantityAvailable` | number | Available for kiosk orders |
| `option` | string | Which size option this is for |

### Potency Fields

| Field | Type | Description |
|-------|------|-------------|
| `THCContent.range[]` | number[] | THC percentage |
| `CBDContent.range[]` | number[] | CBD percentage |
| `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown |

### Specials (specialData.bogoSpecials[])

| Field | Type | Description |
|-------|------|-------------|
| `specialName` | string | Deal name |
| `specialType` | string | "bogo", "sale", etc. |
| `itemsForAPrice.value` | string | Bundle price |
| `bogoRewards[].totalQuantity.quantity` | number | Required quantity |

---

## OOS Detection Logic

Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`:

| Scenario | Action |
|----------|--------|
| Product in feed | `consecutive_misses = 0` |
| Product missing 1st time | `consecutive_misses = 1` |
| Product missing 2nd time | `consecutive_misses = 2` |
| Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` |
| Product returns to feed | `consecutive_misses = 0`, update stock_status |

**Why 3 misses?**
- Protects against false positives from crawl failures
- Single bad crawl doesn't trigger mass OOS alerts
- Balances detection speed vs accuracy

---

## Database Tables

### store_products
Current state of each product:
- `provider_product_id` - Dutchie's MongoDB ObjectId
- `name_raw`, `brand_name_raw` - Raw values from feed
- `price_rec`, `price_med` - Current prices
- `is_in_stock`, `stock_status` - Availability
- `consecutive_misses` - OOS detection counter
- `last_seen_at` - Last time product was in feed

### store_product_snapshots
Point-in-time records for historical analysis:
- One row per product per crawl
- Captures price, stock, potency at that moment
- Used for price history, analytics

### dispensaries
Store metadata:
- `platform_dispensary_id` - MongoDB ObjectId for GraphQL
- `menu_url` - Source URL
- `last_crawl_at` - Last successful crawl
- `crawl_enabled` - Whether to crawl

---

## Worker Roles

Workers pull tasks from the `worker_tasks` queue based on their assigned role.

| Role | Name | Description | Handler |
|------|------|-------------|---------|
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |

**API Endpoint:** `GET /api/worker-registry/roles`

---

## Scheduling

Crawls are scheduled via `worker_tasks` table:

| Role | Frequency | Description |
|------|-----------|-------------|
| `product_resync` | Every 4 hours | Regular product refresh |
| `product_discovery` | On-demand | First crawl for new stores |
| `entry_point_discovery` | On-demand | New store setup |
| `store_discovery` | Daily | Find new stores |
| `analytics_refresh` | Daily | Refresh analytics materialized views |

---

## Priority & On-Demand Tasks

Tasks are claimed by workers in order of **priority DESC, created_at ASC**.

### Priority Levels

| Priority | Use Case | Example |
|----------|----------|---------|
| 0 | Scheduled/batch tasks | Daily product_resync generation |
| 10 | On-demand/chained tasks | entry_point → product_discovery |
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |

### Task Chaining

When a task completes, the system automatically creates follow-up tasks:

```
store_discovery (completed)
    └─► entry_point_discovery (priority: 10) for each new store

entry_point_discovery (completed, success)
    └─► product_discovery (priority: 10) for that store

product_discovery (completed)
    └─► [no chain] Store enters regular resync schedule
```

### On-Demand Task Creation

Use the task service to create high-priority tasks:

```typescript
// Create immediate product resync for a store
await taskService.createTask({
  role: 'product_resync',
  dispensary_id: 123,
  platform: 'dutchie',
  priority: 20, // Higher than batch tasks
});

// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
```

### Claim Function

The `claim_task()` SQL function atomically claims tasks:
- Respects priority ordering (higher = first)
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
- Prevents multiple active tasks per store

---

## Image Storage

Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.

### Storage Path
```
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp
```

**Example:**
```
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
```

### Image Proxy API
Served via `/img/*` with on-demand resizing using **sharp**:

```
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
```

| Param | Description |
|-------|-------------|
| `w` | Width in pixels (max 4000) |
| `h` | Height in pixels (max 4000) |
| `q` | Quality 1-100 (default 80) |
| `fit` | cover, contain, fill, inside, outside |
| `blur` | Blur sigma (0.3-1000) |
| `gray` | Grayscale (1 = enabled) |
| `format` | webp, jpeg, png, avif (default webp) |

### Key Files
| File | Purpose |
|------|---------|
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |

### Download Rules

| Scenario | Image Action |
|----------|--------------|
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
| **Product already has local image** | Skip download entirely |

**Logic:**
- Images are downloaded **once** and never re-downloaded on subsequent crawls
- `skipIfExists: true` - filesystem check prevents re-download even if queued
- First crawl: all products get images
- Refresh crawl: only new products or products missing local images

### Storage Rules
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
- Store full resolution, resize on-demand via `/img` proxy
- Convert to webp for consistency using **sharp**
- Preserve original Dutchie URL as fallback in `image_url` column
- Local path stored in `local_image_path` column

---

## Stealth & Anti-Detection

**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.

Workers automatically initialize anti-detection systems on startup.

### Components

| Component | Purpose | Source |
|-----------|---------|--------|
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |

### Initialization Flow

```
Worker Start
    │
    ├─► initializeStealth()
    │       │
    │       ├─► CrawlRotator.initialize()
    │       │       └─► Load proxies from `proxies` table
    │       │
    │       └─► setCrawlRotator(rotator)
    │               └─► Wire to Dutchie client
    │
    └─► Process tasks...
```

### Stealth Session (per task)

Each crawl task starts a stealth session:

```typescript
// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
```

This creates a new identity with:
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix` → `en-US,en;q=0.9`)
- **sec-ch-ua headers:** Proper Client Hints for the browser profile

### On 403 Block

When Dutchie returns 403, the client automatically:

1. Records failure on current proxy (increments `failure_count`)
2. If proxy has 5+ failures, deactivates it
3. Rotates to next healthy proxy
4. Rotates fingerprint
5. Retries the request

### Proxy Table Schema

```sql
CREATE TABLE proxies (
  id SERIAL PRIMARY KEY,
  host VARCHAR(255) NOT NULL,
  port INTEGER NOT NULL,
  username VARCHAR(100),
  password VARCHAR(100),
  protocol VARCHAR(10) DEFAULT 'http',  -- http, https, socks5
  is_active BOOLEAN DEFAULT true,
  last_used_at TIMESTAMPTZ,
  failure_count INTEGER DEFAULT 0,
  success_count INTEGER DEFAULT 0,
  avg_response_time_ms INTEGER,
  last_failure_at TIMESTAMPTZ,
  last_error TEXT
);
```

### Configuration

Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.

### Fingerprints Available

The client includes 6 browser fingerprints:
- Chrome 131 on Windows
- Chrome 131 on macOS
- Chrome 120 on Windows
- Firefox 133 on Windows
- Safari 17.2 on macOS
- Edge 131 on Windows

Each includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.

---

## Error Handling

- **GraphQL errors:** Logged, task marked failed, retried later
- **Normalization errors:** Logged as warnings, continue with valid products
- **Image download errors:** Non-fatal, logged, continue
- **Database errors:** Task fails, will be retried
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)

---

## Files

| File | Purpose |
|------|---------|
| `src/tasks/handlers/product-resync.ts` | Main crawl handler |
| `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution |
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
| `src/utils/image-storage.ts` | Image download and local storage |
| `src/routes/image-proxy.ts` | On-demand image resizing |
| `migrations/075_consecutive_misses.sql` | OOS tracking column |