Files
cannaiq/docs/CRAWL_OPERATIONS.md
Kelly 9d8972aa86 Fix category-crawler-jobs store lookup query
- Fix column name from s.dutchie_plus_url to s.dutchie_url
- Add availability tracking and product freshness APIs
- Add crawl script for sequential dispensary processing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 00:07:00 -07:00

17 KiB

Crawl Operations & Data Philosophy

This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system.


1. Frozen Crawler Policy

CRITICAL CONSTRAINT: The crawler code is FROZEN. Do NOT modify any crawler logic.

What Is Frozen

The following components are read-only and must not be modified:

  • Selectors: All CSS/XPath selectors for extracting data from Dutchie pages
  • Parsing Logic: Functions that transform raw HTML into structured data
  • Request Patterns: URL construction, pagination, API calls to Dutchie
  • Browser Configuration: Puppeteer settings, user agents, viewport sizes
  • Rate Limiting: Request delays, retry logic, concurrent request limits

What CAN Be Modified

You may build around the crawler's output:

Layer Allowed Changes
Scheduling CronJobs, run frequency, store queuing
Ingestion Post-processing of crawler output before DB insert
API Layer Query logic, computed fields, response transformations
Intelligence Aggregation tables, metrics computation
Infrastructure K8s resources, scaling, monitoring

Rationale

The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk:

  • Breaking data extraction if Dutchie changes their UI
  • Introducing regressions that are hard to detect
  • Requiring re-validation across all store types

All improvements must happen in downstream processing, not in the crawler itself.


2. Crawl Scheduling

Standard Schedule: Every 4 Hours

Run a full crawl for each store every 4 hours, 24/7.

# K8s CronJob: Every 4 hours
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scraper-4h-cycle
  namespace: dispensary-scraper
spec:
  schedule: "0 */4 * * *"  # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scraper
            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
            command: ["node", "dist/scripts/run-all-stores.js"]
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: scraper-secrets
                  key: database-url
          restartPolicy: OnFailure

Daily Specials Crawl: 12:01 AM Store Local Time

Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone.

# K8s CronJob: Daily specials at store midnight (example for MST/Arizona)
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scraper-daily-specials-mst
  namespace: dispensary-scraper
spec:
  schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scraper
            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
            command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
          restartPolicy: OnFailure

Timezone-Aware Scheduling

Stores table includes timezone information:

ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix';

-- Lookup table for common dispensary timezones
-- America/Phoenix (Arizona, no DST)
-- America/Los_Angeles (California)
-- America/Denver (Colorado)
-- America/Chicago (Illinois)

Scripts Required

/backend/src/scripts/
├── run-all-stores.ts       # Run crawl for all enabled stores
├── run-stores-by-timezone.ts  # Run crawl for stores in a specific timezone
└── scheduler.ts            # Orchestrates CronJob dispatch

3. Specials Detection Logic

Problem: The Specials tab in the frontend is EMPTY even though products have discounts.

Root Cause Analysis

Database investigation reveals:

Metric Count
Total products 1,414
is_special = true 0
Has "Special Offer" in name 325
Has sale_price < regular_price 4

The crawler captures "Special Offer" embedded in the product name but doesn't set is_special = true.

Solution: API-Layer Specials Detection

Since the crawler is frozen, detect specials at query time:

-- Computed is_on_special in API queries
SELECT
  p.*,
  CASE
    WHEN p.name ILIKE '%Special Offer%' THEN TRUE
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE
    WHEN p.price IS NOT NULL
         AND p.original_price IS NOT NULL
         AND p.price::numeric < p.original_price::numeric THEN TRUE
    ELSE FALSE
  END AS is_on_special,

  -- Compute special type
  CASE
    WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer'
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off'
    ELSE NULL
  END AS computed_special_type,

  -- Compute discount percentage
  CASE
    WHEN p.sale_price IS NOT NULL
         AND p.regular_price IS NOT NULL
         AND p.regular_price::numeric > 0
    THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0)
    ELSE NULL
  END AS computed_discount_percent

FROM products p
WHERE p.store_id = :store_id;

Special Detection Rules (Priority Order)

  1. Name Contains "Special Offer": name ILIKE '%Special Offer%'

    • Type: special_offer
    • Badge: "Special"
  2. Price Discount (sale < regular): sale_price < regular_price

    • Type: percent_off
    • Badge: Computed as "X% OFF"
  3. Price Discount (current < original): price < original_price

    • Type: percent_off
    • Badge: Computed as "X% OFF"
  4. Metadata Offers (future): metadata->'offers' IS NOT NULL

    • Parse offer type from metadata JSON

Clean Product Name

Strip "Special Offer" from display name:

function cleanProductName(rawName: string): string {
  return rawName
    .replace(/Special Offer$/i, '')
    .replace(/\s+$/, '')  // Trim trailing whitespace
    .trim();
}

API Specials Endpoint

// GET /api/stores/:store_key/specials
async function getStoreSpecials(storeKey: string, options: SpecialsOptions) {
  const query = `
    WITH specials AS (
      SELECT
        p.*,
        -- Detect special
        CASE
          WHEN p.name ILIKE '%Special Offer%' THEN TRUE
          WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE
          ELSE FALSE
        END AS is_on_special,

        -- Compute discount
        CASE
          WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL
          THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100)
          ELSE NULL
        END AS discount_percent

      FROM products p
      JOIN stores s ON p.store_id = s.id
      WHERE s.store_key = $1
        AND p.in_stock = TRUE
    )
    SELECT * FROM specials
    WHERE is_on_special = TRUE
    ORDER BY discount_percent DESC NULLS LAST
    LIMIT $2 OFFSET $3
  `;

  return db.query(query, [storeKey, options.limit, options.offset]);
}

4. Append-Only Data Philosophy

Principle: Every crawl should ADD information, never LOSE it.

What Append-Only Means

Action Allowed Not Allowed
Insert new product -
Update product price -
Mark product out-of-stock -
DELETE product row Never delete
TRUNCATE table Never truncate
UPDATE to remove data Never null-out existing data

Product Lifecycle States

-- Products are never deleted, only state changes
ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active';

-- Statuses:
-- 'active'     - Currently in stock or recently seen
-- 'out_of_stock' - Seen but marked out of stock
-- 'stale'      - Not seen in last 3 crawls (likely discontinued)
-- 'archived'   - Manually marked as discontinued

CREATE INDEX idx_products_status ON products(status);

Marking Products Stale (NOT Deleting)

// After crawl completes, mark unseen products as stale
async function markStaleProducts(storeId: number, crawlRunId: number) {
  await db.query(`
    UPDATE products
    SET
      status = 'stale',
      updated_at = NOW()
    WHERE store_id = $1
      AND id NOT IN (
        SELECT DISTINCT product_id
        FROM store_product_snapshots
        WHERE crawl_run_id = $2
      )
      AND status = 'active'
      AND last_seen_at < NOW() - INTERVAL '3 days'
  `, [storeId, crawlRunId]);
}

Store Product Snapshots: True Append-Only

The store_product_snapshots table is strictly append-only:

CREATE TABLE store_product_snapshots (
  id                SERIAL PRIMARY KEY,
  store_id          INTEGER NOT NULL REFERENCES stores(id),
  product_id        INTEGER NOT NULL REFERENCES products(id),
  crawl_run_id      INTEGER NOT NULL REFERENCES crawl_runs(id),

  -- Snapshot of data at crawl time
  price_cents       INTEGER,
  regular_price_cents INTEGER,
  sale_price_cents  INTEGER,
  in_stock          BOOLEAN NOT NULL,

  -- Computed at crawl time
  is_on_special     BOOLEAN NOT NULL DEFAULT FALSE,
  special_type      VARCHAR(50),
  discount_percent  INTEGER,

  captured_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  -- Composite unique: one snapshot per product per crawl
  CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id)
);

-- NO UPDATE or DELETE triggers - this table is INSERT-only
-- For data corrections, insert a new snapshot with corrected flag

CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id);
CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC);

Crawl Runs Table

Track every crawl execution:

CREATE TABLE crawl_runs (
  id              SERIAL PRIMARY KEY,
  store_id        INTEGER NOT NULL REFERENCES stores(id),
  started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at    TIMESTAMPTZ,
  status          VARCHAR(20) NOT NULL DEFAULT 'running',
  products_found  INTEGER,
  products_new    INTEGER,
  products_updated INTEGER,
  error_message   TEXT,

  -- Scheduling metadata
  trigger_type    VARCHAR(20) NOT NULL DEFAULT 'scheduled',  -- 'scheduled', 'manual', 'daily_specials'

  CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed'))
);

CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC);

Data Correction Pattern

If data needs correction, don't UPDATE - insert a correction record:

CREATE TABLE data_corrections (
  id              SERIAL PRIMARY KEY,
  table_name      VARCHAR(50) NOT NULL,
  record_id       INTEGER NOT NULL,
  field_name      VARCHAR(100) NOT NULL,
  old_value       JSONB,
  new_value       JSONB,
  reason          TEXT NOT NULL,
  corrected_by    VARCHAR(100) NOT NULL,
  corrected_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

5. Safe Ingestion Patterns

Upsert Products (Preserving History)

async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) {
  // 1. Find or create product
  const existing = await db.query(
    `SELECT id, price, regular_price, sale_price FROM products
     WHERE store_id = $1 AND dutchie_product_id = $2`,
    [storeId, product.dutchieId]
  );

  let productId: number;

  if (existing.rows.length === 0) {
    // INSERT new product
    const result = await db.query(`
      INSERT INTO products (
        store_id, dutchie_product_id, name, slug, price, regular_price, sale_price,
        in_stock, first_seen_at, last_seen_at, status
      ) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active')
      RETURNING id
    `, [storeId, product.dutchieId, product.name, product.slug,
        product.price, product.regularPrice, product.salePrice]);
    productId = result.rows[0].id;
  } else {
    // UPDATE existing - only update if values changed, never null-out
    productId = existing.rows[0].id;
    await db.query(`
      UPDATE products SET
        name = COALESCE($2, name),
        price = COALESCE($3, price),
        regular_price = COALESCE($4, regular_price),
        sale_price = COALESCE($5, sale_price),
        in_stock = TRUE,
        last_seen_at = NOW(),
        status = 'active',
        updated_at = NOW()
      WHERE id = $1
    `, [productId, product.name, product.price, product.regularPrice, product.salePrice]);
  }

  // 2. Always create snapshot (append-only)
  const isOnSpecial = detectSpecial(product);
  const discountPercent = computeDiscount(product);

  await db.query(`
    INSERT INTO store_product_snapshots (
      store_id, product_id, crawl_run_id,
      price_cents, regular_price_cents, sale_price_cents,
      in_stock, is_on_special, special_type, discount_percent
    ) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9)
    ON CONFLICT (product_id, crawl_run_id) DO NOTHING
  `, [
    storeId, productId, crawlRunId,
    toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice),
    isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent
  ]);

  return productId;
}

function detectSpecial(product: ScrapedProduct): boolean {
  // Check name for "Special Offer"
  if (product.name?.includes('Special Offer')) return true;

  // Check price discount
  if (product.salePrice && product.regularPrice) {
    return parseFloat(product.salePrice) < parseFloat(product.regularPrice);
  }

  return false;
}

function computeDiscount(product: ScrapedProduct): number | null {
  if (!product.salePrice || !product.regularPrice) return null;

  const sale = parseFloat(product.salePrice);
  const regular = parseFloat(product.regularPrice);

  if (regular <= 0) return null;

  return Math.round((1 - sale / regular) * 100);
}

6. K8s Deployment Configuration

CronJobs Overview

# All CronJobs for scheduling
apiVersion: v1
kind: List
items:
  # 1. Standard 4-hour crawl cycle
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-4h-00
      namespace: dispensary-scraper
    spec:
      schedule: "0 0,4,8,12,16,20 * * *"
      concurrencyPolicy: Forbid
      successfulJobsHistoryLimit: 3
      failedJobsHistoryLimit: 3
      jobTemplate:
        spec:
          activeDeadlineSeconds: 3600  # 1 hour timeout
          template:
            spec:
              containers:
              - name: scraper
                image: code.cannabrands.app/creationshop/dispensary-scraper:latest
                command: ["node", "dist/scripts/run-all-stores.js"]
                resources:
                  requests:
                    memory: "512Mi"
                    cpu: "250m"
                  limits:
                    memory: "2Gi"
                    cpu: "1000m"
              restartPolicy: OnFailure

  # 2. Daily specials crawl - Arizona (MST, no DST)
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-daily-mst
      namespace: dispensary-scraper
    spec:
      schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
      concurrencyPolicy: Forbid
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: scraper
                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]

  # 3. Daily specials crawl - California (PST/PDT)
  - apiVersion: batch/v1
    kind: CronJob
    metadata:
      name: scraper-daily-pst
      namespace: dispensary-scraper
    spec:
      schedule: "1 8 * * *"  # 12:01 AM PST = 08:01 UTC (adjust for DST)
      concurrencyPolicy: Forbid
      jobTemplate:
        spec:
          template:
            spec:
              containers:
              - name: scraper
                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"]

Monitoring and Alerts

# PrometheusRule for scraper monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: scraper-alerts
  namespace: dispensary-scraper
spec:
  groups:
  - name: scraper.rules
    rules:
    - alert: ScraperJobFailed
      expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Scraper job failed"

    - alert: ScraperMissedSchedule
      expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Scraper hasn't run successfully in 5+ hours"

7. Summary

Constraint Implementation
Frozen Crawler No changes to selectors, parsing, or request logic
4-Hour Schedule K8s CronJob at 0,4,8,12,16,20 UTC
12:01 AM Specials Timezone-specific CronJobs for store local midnight
Specials Detection API-layer detection via name pattern + price comparison
Append-Only Data Never DELETE; use status flags; store_product_snapshots is INSERT-only
Historical Preservation All crawls create snapshots; stale products marked, never deleted

This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.