cannaiq/docs/CRAWL_OPERATIONS.md

# Crawl Operations & Data Philosophy

This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system.

---

## 1. Frozen Crawler Policy

> **CRITICAL CONSTRAINT**: The crawler code is FROZEN. Do NOT modify any crawler logic.

### What Is Frozen

The following components are read-only and must not be modified:

- **Selectors**: All CSS/XPath selectors for extracting data from Dutchie pages
- **Parsing Logic**: Functions that transform raw HTML into structured data
- **Request Patterns**: URL construction, pagination, API calls to Dutchie
- **Browser Configuration**: Puppeteer settings, user agents, viewport sizes
- **Rate Limiting**: Request delays, retry logic, concurrent request limits

### What CAN Be Modified

You may build around the crawler's output:

| Layer | Allowed Changes |
|-------|-----------------|
| **Scheduling** | CronJobs, run frequency, store queuing |
| **Ingestion** | Post-processing of crawler output before DB insert |
| **API Layer** | Query logic, computed fields, response transformations |
| **Intelligence** | Aggregation tables, metrics computation |
| **Infrastructure** | K8s resources, scaling, monitoring |

### Rationale

The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk:
- Breaking data extraction if Dutchie changes their UI
- Introducing regressions that are hard to detect
- Requiring re-validation across all store types

All improvements must happen in **downstream processing**, not in the crawler itself.

---

## 2. Crawl Scheduling

### Standard Schedule: Every 4 Hours

Run a full crawl for each store every 4 hours, 24/7.

```yaml
# K8s CronJob: Every 4 hours
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scraper-4h-cycle
  namespace: dispensary-scraper
spec:
  schedule: "0 */4 * * *"  # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scraper
            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
            command: ["node", "dist/scripts/run-all-stores.js"]
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: scraper-secrets
                  key: database-url
          restartPolicy: OnFailure
```

### Daily Specials Crawl: 12:01 AM Store Local Time

Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone.

```yaml
# K8s CronJob: Daily specials at store midnight (example for MST/Arizona)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scraper-daily-specials-mst
  namespace: dispensary-scraper
spec:
  schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scraper
            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
            command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
          restartPolicy: OnFailure
```

### Timezone-Aware Scheduling

Stores table includes timezone information:

```sql
ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix';

-- Lookup table for common dispensary timezones
-- America/Phoenix (Arizona, no DST)
-- America/Los_Angeles (California)
-- America/Denver (Colorado)
-- America/Chicago (Illinois)
```

### Scripts Required

```
/backend/src/scripts/
├── run-all-stores.ts       # Run crawl for all enabled stores
├── run-stores-by-timezone.ts  # Run crawl for stores in a specific timezone
└── scheduler.ts            # Orchestrates CronJob dispatch
```

---

## 3. Specials Detection Logic

> **Problem**: The Specials tab in the frontend is EMPTY even though products have discounts.

### Root Cause Analysis

Database investigation reveals:

| Metric | Count |
|--------|-------|
| Total products | 1,414 |
| `is_special = true` | 0 |
| Has "Special Offer" in name | 325 |
| Has `sale_price < regular_price` | 4 |

The crawler captures "Special Offer" **embedded in the product name** but doesn't set `is_special = true`.

### Solution: API-Layer Specials Detection

Since the crawler is frozen, detect specials at query time:

```sql
-- Computed is_on_special in API queries
SELECT
  p.*,
  CASE
    WHEN p.name ILIKE '%Special Offer%' THEN TRUE
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE
    WHEN p.price IS NOT NULL
         AND p.original_price IS NOT NULL
         AND p.price::numeric < p.original_price::numeric THEN TRUE
    ELSE FALSE
  END AS is_on_special,

  -- Compute special type
  CASE
    WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer'
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off'
    ELSE NULL
  END AS computed_special_type,

  -- Compute discount percentage
  CASE
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.regular_price::numeric > 0
    THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0)
    ELSE NULL
  END AS computed_discount_percent

FROM products p
WHERE p.store_id = :store_id;
```

### Special Detection Rules (Priority Order)

1. **Name Contains "Special Offer"**: `name ILIKE '%Special Offer%'`
   - Type: `special_offer`
   - Badge: "Special"

2. **Price Discount (sale < regular)**: `sale_price < regular_price`
   - Type: `percent_off`
   - Badge: Computed as "X% OFF"

3. **Price Discount (current < original)**: `price < original_price`
   - Type: `percent_off`
   - Badge: Computed as "X% OFF"

4. **Metadata Offers** (future): `metadata->'offers' IS NOT NULL`
   - Parse offer type from metadata JSON

### Clean Product Name

Strip "Special Offer" from display name:

```typescript
function cleanProductName(rawName: string): string {
  return rawName
    .replace(/Special Offer$/i, '')
    .replace(/\s+$/, '')  // Trim trailing whitespace
    .trim();
}
```

### API Specials Endpoint

```typescript
// GET /api/stores/:store_key/specials
async function getStoreSpecials(storeKey: string, options: SpecialsOptions) {
  const query = `
    WITH specials AS (
      SELECT
        p.*,
        -- Detect special
        CASE
          WHEN p.name ILIKE '%Special Offer%' THEN TRUE
          WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE
          ELSE FALSE
        END AS is_on_special,

        -- Compute discount
        CASE
          WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL
          THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100)
          ELSE NULL
        END AS discount_percent

      FROM products p
      JOIN stores s ON p.store_id = s.id
      WHERE s.store_key = $1
        AND p.in_stock = TRUE
    )
    SELECT * FROM specials
    WHERE is_on_special = TRUE
    ORDER BY discount_percent DESC NULLS LAST
    LIMIT $2 OFFSET $3
  `;

  return db.query(query, [storeKey, options.limit, options.offset]);
}
```

---

## 4. Append-Only Data Philosophy

> **Principle**: Every crawl should ADD information, never LOSE it.

### What Append-Only Means

| Action | Allowed | Not Allowed |
|--------|---------|-------------|
| Insert new product | ✅ | - |
| Update product price | ✅ | - |
| Mark product out-of-stock | ✅ | - |
| DELETE product row | ❌ | Never delete |
| TRUNCATE table | ❌ | Never truncate |
| UPDATE to remove data | ❌ | Never null-out existing data |

### Product Lifecycle States

```sql
-- Products are never deleted, only state changes
ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active';

-- Statuses:
-- 'active'     - Currently in stock or recently seen
-- 'out_of_stock' - Seen but marked out of stock
-- 'stale'      - Not seen in last 3 crawls (likely discontinued)
-- 'archived'   - Manually marked as discontinued

CREATE INDEX idx_products_status ON products(status);
```

### Marking Products Stale (NOT Deleting)

```typescript
// After crawl completes, mark unseen products as stale
async function markStaleProducts(storeId: number, crawlRunId: number) {
  await db.query(`
    UPDATE products
    SET
      status = 'stale',
      updated_at = NOW()
    WHERE store_id = $1
      AND id NOT IN (
        SELECT DISTINCT product_id
        FROM store_product_snapshots
        WHERE crawl_run_id = $2
      )
      AND status = 'active'
      AND last_seen_at < NOW() - INTERVAL '3 days'
  `, [storeId, crawlRunId]);
}
```

### Store Product Snapshots: True Append-Only

The `store_product_snapshots` table is strictly append-only:

```sql
CREATE TABLE store_product_snapshots (
  id                SERIAL PRIMARY KEY,
  store_id          INTEGER NOT NULL REFERENCES stores(id),
  product_id        INTEGER NOT NULL REFERENCES products(id),
  crawl_run_id      INTEGER NOT NULL REFERENCES crawl_runs(id),

  -- Snapshot of data at crawl time
  price_cents       INTEGER,
  regular_price_cents INTEGER,
  sale_price_cents  INTEGER,
  in_stock          BOOLEAN NOT NULL,

  -- Computed at crawl time
  is_on_special     BOOLEAN NOT NULL DEFAULT FALSE,
  special_type      VARCHAR(50),
  discount_percent  INTEGER,

  captured_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  -- Composite unique: one snapshot per product per crawl
  CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id)
);

-- NO UPDATE or DELETE triggers - this table is INSERT-only
-- For data corrections, insert a new snapshot with corrected flag

CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id);
CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC);
```

### Crawl Runs Table

Track every crawl execution:

```sql
CREATE TABLE crawl_runs (
  id              SERIAL PRIMARY KEY,
  store_id        INTEGER NOT NULL REFERENCES stores(id),
  started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at    TIMESTAMPTZ,
  status          VARCHAR(20) NOT NULL DEFAULT 'running',
  products_found  INTEGER,
  products_new    INTEGER,
  products_updated INTEGER,
  error_message   TEXT,

  -- Scheduling metadata
  trigger_type    VARCHAR(20) NOT NULL DEFAULT 'scheduled',  -- 'scheduled', 'manual', 'daily_specials'

  CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed'))
);

CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC);
```

### Data Correction Pattern

If data needs correction, don't UPDATE - insert a correction record:

```sql
CREATE TABLE data_corrections (
  id              SERIAL PRIMARY KEY,
  table_name      VARCHAR(50) NOT NULL,
  record_id       INTEGER NOT NULL,
  field_name      VARCHAR(100) NOT NULL,
  old_value       JSONB,
  new_value       JSONB,
  reason          TEXT NOT NULL,
  corrected_by    VARCHAR(100) NOT NULL,
  corrected_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```

---

## 5. Safe Ingestion Patterns

### Upsert Products (Preserving History)

```typescript
async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) {
  // 1. Find or create product
  const existing = await db.query(
    `SELECT id, price, regular_price, sale_price FROM products
     WHERE store_id = $1 AND dutchie_product_id = $2`,
    [storeId, product.dutchieId]
  );

  let productId: number;

  if (existing.rows.length === 0) {
    // INSERT new product
    const result = await db.query(`
      INSERT INTO products (
        store_id, dutchie_product_id, name, slug, price, regular_price, sale_price,
        in_stock, first_seen_at, last_seen_at, status
      ) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active')
      RETURNING id
    `, [storeId, product.dutchieId, product.name, product.slug,
        product.price, product.regularPrice, product.salePrice]);
    productId = result.rows[0].id;
  } else {
    // UPDATE existing - only update if values changed, never null-out
    productId = existing.rows[0].id;
    await db.query(`
      UPDATE products SET
        name = COALESCE($2, name),
        price = COALESCE($3, price),
        regular_price = COALESCE($4, regular_price),
        sale_price = COALESCE($5, sale_price),
        in_stock = TRUE,
        last_seen_at = NOW(),
        status = 'active',
        updated_at = NOW()
      WHERE id = $1
    `, [productId, product.name, product.price, product.regularPrice, product.salePrice]);
  }

  // 2. Always create snapshot (append-only)
  const isOnSpecial = detectSpecial(product);
  const discountPercent = computeDiscount(product);

  await db.query(`
    INSERT INTO store_product_snapshots (
      store_id, product_id, crawl_run_id,
      price_cents, regular_price_cents, sale_price_cents,
      in_stock, is_on_special, special_type, discount_percent
    ) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9)
    ON CONFLICT (product_id, crawl_run_id) DO NOTHING
  `, [
    storeId, productId, crawlRunId,
    toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice),
    isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent
  ]);

  return productId;
}

function detectSpecial(product: ScrapedProduct): boolean {
  // Check name for "Special Offer"
  if (product.name?.includes('Special Offer')) return true;

  // Check price discount
  if (product.salePrice && product.regularPrice) {
    return parseFloat(product.salePrice) < parseFloat(product.regularPrice);
  }

  return false;
}

function computeDiscount(product: ScrapedProduct): number | null {
  if (!product.salePrice || !product.regularPrice) return null;

  const sale = parseFloat(product.salePrice);
  const regular = parseFloat(product.regularPrice);

  if (regular <= 0) return null;

  return Math.round((1 - sale / regular) * 100);
}
```

---

## 6. K8s Deployment Configuration

### CronJobs Overview

```yaml
# All CronJobs for scheduling
apiVersion: v1
kind: List
items:
  # 1. Standard 4-hour crawl cycle
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-4h-00
      namespace: dispensary-scraper
    spec:
      schedule: "0 0,4,8,12,16,20 * * *"
      concurrencyPolicy: Forbid
      successfulJobsHistoryLimit: 3
      failedJobsHistoryLimit: 3
      jobTemplate:
        spec:
          activeDeadlineSeconds: 3600  # 1 hour timeout
          template:
            spec:
              containers:
              - name: scraper
                image: code.cannabrands.app/creationshop/dispensary-scraper:latest
                command: ["node", "dist/scripts/run-all-stores.js"]
                resources:
                  requests:
                    memory: "512Mi"
                    cpu: "250m"
                  limits:
                    memory: "2Gi"
                    cpu: "1000m"
              restartPolicy: OnFailure

  # 2. Daily specials crawl - Arizona (MST, no DST)
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-daily-mst
      namespace: dispensary-scraper
    spec:
      schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
      concurrencyPolicy: Forbid
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: scraper
                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]

  # 3. Daily specials crawl - California (PST/PDT)
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-daily-pst
      namespace: dispensary-scraper
    spec:
      schedule: "1 8 * * *"  # 12:01 AM PST = 08:01 UTC (adjust for DST)
      concurrencyPolicy: Forbid
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: scraper
                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"]
```

### Monitoring and Alerts

```yaml
# PrometheusRule for scraper monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: scraper-alerts
  namespace: dispensary-scraper
spec:
  groups:
  - name: scraper.rules
    rules:
    - alert: ScraperJobFailed
      expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Scraper job failed"

    - alert: ScraperMissedSchedule
      expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Scraper hasn't run successfully in 5+ hours"
```

---

## 7. Summary

| Constraint | Implementation |
|------------|----------------|
| **Frozen Crawler** | No changes to selectors, parsing, or request logic |
| **4-Hour Schedule** | K8s CronJob at 0,4,8,12,16,20 UTC |
| **12:01 AM Specials** | Timezone-specific CronJobs for store local midnight |
| **Specials Detection** | API-layer detection via name pattern + price comparison |
| **Append-Only Data** | Never DELETE; use status flags; `store_product_snapshots` is INSERT-only |
| **Historical Preservation** | All crawls create snapshots; stale products marked, never deleted |

This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.