Fix category-crawler-jobs store lookup query

- Fix column name from s.dutchie_plus_url to s.dutchie_url - Add availability tracking and product freshness APIs - Add crawl script for sequential dispensary processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 00:07:00 -07:00
parent 20a7b69537
commit 9d8972aa86
15 changed files with 11604 additions and 42 deletions
--- a/docs/CRAWL_OPERATIONS.md
+++ b/docs/CRAWL_OPERATIONS.md
@@ -0,0 +1,592 @@
+# Crawl Operations & Data Philosophy
+
+This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system.
+
+---
+
+## 1. Frozen Crawler Policy
+
+> **CRITICAL CONSTRAINT**: The crawler code is FROZEN. Do NOT modify any crawler logic.
+
+### What Is Frozen
+
+The following components are read-only and must not be modified:
+
+- **Selectors**: All CSS/XPath selectors for extracting data from Dutchie pages
+- **Parsing Logic**: Functions that transform raw HTML into structured data
+- **Request Patterns**: URL construction, pagination, API calls to Dutchie
+- **Browser Configuration**: Puppeteer settings, user agents, viewport sizes
+- **Rate Limiting**: Request delays, retry logic, concurrent request limits
+
+### What CAN Be Modified
+
+You may build around the crawler's output:
+
+| Layer | Allowed Changes |
+|-------|-----------------|
+| **Scheduling** | CronJobs, run frequency, store queuing |
+| **Ingestion** | Post-processing of crawler output before DB insert |
+| **API Layer** | Query logic, computed fields, response transformations |
+| **Intelligence** | Aggregation tables, metrics computation |
+| **Infrastructure** | K8s resources, scaling, monitoring |
+
+### Rationale
+
+The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk:
+- Breaking data extraction if Dutchie changes their UI
+- Introducing regressions that are hard to detect
+- Requiring re-validation across all store types
+
+All improvements must happen in **downstream processing**, not in the crawler itself.
+
+---
+
+## 2. Crawl Scheduling
+
+### Standard Schedule: Every 4 Hours
+
+Run a full crawl for each store every 4 hours, 24/7.
+
+```yaml
+# K8s CronJob: Every 4 hours
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: scraper-4h-cycle
+  namespace: dispensary-scraper
+spec:
+  schedule: "0 */4 * * *"  # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          containers:
+          - name: scraper
+            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
+            command: ["node", "dist/scripts/run-all-stores.js"]
+            env:
+            - name: DATABASE_URL
+              valueFrom:
+                secretKeyRef:
+                  name: scraper-secrets
+                  key: database-url
+          restartPolicy: OnFailure
+```
+
+### Daily Specials Crawl: 12:01 AM Store Local Time
+
+Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone.
+
+```yaml
+# K8s CronJob: Daily specials at store midnight (example for MST/Arizona)
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: scraper-daily-specials-mst
+  namespace: dispensary-scraper
+spec:
+  schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          containers:
+          - name: scraper
+            image: code.cannabrands.app/creationshop/dispensary-scraper:latest
+            command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
+          restartPolicy: OnFailure
+```
+
+### Timezone-Aware Scheduling
+
+Stores table includes timezone information:
+
+```sql
+ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix';
+
+-- Lookup table for common dispensary timezones
+-- America/Phoenix (Arizona, no DST)
+-- America/Los_Angeles (California)
+-- America/Denver (Colorado)
+-- America/Chicago (Illinois)
+```
+
+### Scripts Required
+
+```
+/backend/src/scripts/
+├── run-all-stores.ts       # Run crawl for all enabled stores
+├── run-stores-by-timezone.ts  # Run crawl for stores in a specific timezone
+└── scheduler.ts            # Orchestrates CronJob dispatch
+```
+
+---
+
+## 3. Specials Detection Logic
+
+> **Problem**: The Specials tab in the frontend is EMPTY even though products have discounts.
+
+### Root Cause Analysis
+
+Database investigation reveals:
+
+| Metric | Count |
+|--------|-------|
+| Total products | 1,414 |
+| `is_special = true` | 0 |
+| Has "Special Offer" in name | 325 |
+| Has `sale_price < regular_price` | 4 |
+
+The crawler captures "Special Offer" **embedded in the product name** but doesn't set `is_special = true`.
+
+### Solution: API-Layer Specials Detection
+
+Since the crawler is frozen, detect specials at query time:
+
+```sql
+-- Computed is_on_special in API queries
+SELECT
+  p.*,
+  CASE
+    WHEN p.name ILIKE '%Special Offer%' THEN TRUE
+    WHEN p.sale_price IS NOT NULL
+         AND p.regular_price IS NOT NULL
+         AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE
+    WHEN p.price IS NOT NULL
+         AND p.original_price IS NOT NULL
+         AND p.price::numeric < p.original_price::numeric THEN TRUE
+    ELSE FALSE
+  END AS is_on_special,
+
+  -- Compute special type
+  CASE
+    WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer'
+    WHEN p.sale_price IS NOT NULL
+         AND p.regular_price IS NOT NULL
+         AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off'
+    ELSE NULL
+  END AS computed_special_type,
+
+  -- Compute discount percentage
+  CASE
+    WHEN p.sale_price IS NOT NULL
+         AND p.regular_price IS NOT NULL
+         AND p.regular_price::numeric > 0
+    THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0)
+    ELSE NULL
+  END AS computed_discount_percent
+
+FROM products p
+WHERE p.store_id = :store_id;
+```
+
+### Special Detection Rules (Priority Order)
+
+1. **Name Contains "Special Offer"**: `name ILIKE '%Special Offer%'`
+   - Type: `special_offer`
+   - Badge: "Special"
+
+2. **Price Discount (sale < regular)**: `sale_price < regular_price`
+   - Type: `percent_off`
+   - Badge: Computed as "X% OFF"
+
+3. **Price Discount (current < original)**: `price < original_price`
+   - Type: `percent_off`
+   - Badge: Computed as "X% OFF"
+
+4. **Metadata Offers** (future): `metadata->'offers' IS NOT NULL`
+   - Parse offer type from metadata JSON
+
+### Clean Product Name
+
+Strip "Special Offer" from display name:
+
+```typescript
+function cleanProductName(rawName: string): string {
+  return rawName
+    .replace(/Special Offer$/i, '')
+    .replace(/\s+$/, '')  // Trim trailing whitespace
+    .trim();
+}
+```
+
+### API Specials Endpoint
+
+```typescript
+// GET /api/stores/:store_key/specials
+async function getStoreSpecials(storeKey: string, options: SpecialsOptions) {
+  const query = `
+    WITH specials AS (
+      SELECT
+        p.*,
+        -- Detect special
+        CASE
+          WHEN p.name ILIKE '%Special Offer%' THEN TRUE
+          WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE
+          ELSE FALSE
+        END AS is_on_special,
+
+        -- Compute discount
+        CASE
+          WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL
+          THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100)
+          ELSE NULL
+        END AS discount_percent
+
+      FROM products p
+      JOIN stores s ON p.store_id = s.id
+      WHERE s.store_key = $1
+        AND p.in_stock = TRUE
+    )
+    SELECT * FROM specials
+    WHERE is_on_special = TRUE
+    ORDER BY discount_percent DESC NULLS LAST
+    LIMIT $2 OFFSET $3
+  `;
+
+  return db.query(query, [storeKey, options.limit, options.offset]);
+}
+```
+
+---
+
+## 4. Append-Only Data Philosophy
+
+> **Principle**: Every crawl should ADD information, never LOSE it.
+
+### What Append-Only Means
+
+| Action | Allowed | Not Allowed |
+|--------|---------|-------------|
+| Insert new product | ✅ | - |
+| Update product price | ✅ | - |
+| Mark product out-of-stock | ✅ | - |
+| DELETE product row | ❌ | Never delete |
+| TRUNCATE table | ❌ | Never truncate |
+| UPDATE to remove data | ❌ | Never null-out existing data |
+
+### Product Lifecycle States
+
+```sql
+-- Products are never deleted, only state changes
+ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active';
+
+-- Statuses:
+-- 'active'     - Currently in stock or recently seen
+-- 'out_of_stock' - Seen but marked out of stock
+-- 'stale'      - Not seen in last 3 crawls (likely discontinued)
+-- 'archived'   - Manually marked as discontinued
+
+CREATE INDEX idx_products_status ON products(status);
+```
+
+### Marking Products Stale (NOT Deleting)
+
+```typescript
+// After crawl completes, mark unseen products as stale
+async function markStaleProducts(storeId: number, crawlRunId: number) {
+  await db.query(`
+    UPDATE products
+    SET
+      status = 'stale',
+      updated_at = NOW()
+    WHERE store_id = $1
+      AND id NOT IN (
+        SELECT DISTINCT product_id
+        FROM store_product_snapshots
+        WHERE crawl_run_id = $2
+      )
+      AND status = 'active'
+      AND last_seen_at < NOW() - INTERVAL '3 days'
+  `, [storeId, crawlRunId]);
+}
+```
+
+### Store Product Snapshots: True Append-Only
+
+The `store_product_snapshots` table is strictly append-only:
+
+```sql
+CREATE TABLE store_product_snapshots (
+  id                SERIAL PRIMARY KEY,
+  store_id          INTEGER NOT NULL REFERENCES stores(id),
+  product_id        INTEGER NOT NULL REFERENCES products(id),
+  crawl_run_id      INTEGER NOT NULL REFERENCES crawl_runs(id),
+
+  -- Snapshot of data at crawl time
+  price_cents       INTEGER,
+  regular_price_cents INTEGER,
+  sale_price_cents  INTEGER,
+  in_stock          BOOLEAN NOT NULL,
+
+  -- Computed at crawl time
+  is_on_special     BOOLEAN NOT NULL DEFAULT FALSE,
+  special_type      VARCHAR(50),
+  discount_percent  INTEGER,
+
+  captured_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+
+  -- Composite unique: one snapshot per product per crawl
+  CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id)
+);
+
+-- NO UPDATE or DELETE triggers - this table is INSERT-only
+-- For data corrections, insert a new snapshot with corrected flag
+
+CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id);
+CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC);
+```
+
+### Crawl Runs Table
+
+Track every crawl execution:
+
+```sql
+CREATE TABLE crawl_runs (
+  id              SERIAL PRIMARY KEY,
+  store_id        INTEGER NOT NULL REFERENCES stores(id),
+  started_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+  completed_at    TIMESTAMPTZ,
+  status          VARCHAR(20) NOT NULL DEFAULT 'running',
+  products_found  INTEGER,
+  products_new    INTEGER,
+  products_updated INTEGER,
+  error_message   TEXT,
+
+  -- Scheduling metadata
+  trigger_type    VARCHAR(20) NOT NULL DEFAULT 'scheduled',  -- 'scheduled', 'manual', 'daily_specials'
+
+  CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed'))
+);
+
+CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC);
+```
+
+### Data Correction Pattern
+
+If data needs correction, don't UPDATE - insert a correction record:
+
+```sql
+CREATE TABLE data_corrections (
+  id              SERIAL PRIMARY KEY,
+  table_name      VARCHAR(50) NOT NULL,
+  record_id       INTEGER NOT NULL,
+  field_name      VARCHAR(100) NOT NULL,
+  old_value       JSONB,
+  new_value       JSONB,
+  reason          TEXT NOT NULL,
+  corrected_by    VARCHAR(100) NOT NULL,
+  corrected_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+```
+
+---
+
+## 5. Safe Ingestion Patterns
+
+### Upsert Products (Preserving History)
+
+```typescript
+async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) {
+  // 1. Find or create product
+  const existing = await db.query(
+    `SELECT id, price, regular_price, sale_price FROM products
+     WHERE store_id = $1 AND dutchie_product_id = $2`,
+    [storeId, product.dutchieId]
+  );
+
+  let productId: number;
+
+  if (existing.rows.length === 0) {
+    // INSERT new product
+    const result = await db.query(`
+      INSERT INTO products (
+        store_id, dutchie_product_id, name, slug, price, regular_price, sale_price,
+        in_stock, first_seen_at, last_seen_at, status
+      ) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active')
+      RETURNING id
+    `, [storeId, product.dutchieId, product.name, product.slug,
+        product.price, product.regularPrice, product.salePrice]);
+    productId = result.rows[0].id;
+  } else {
+    // UPDATE existing - only update if values changed, never null-out
+    productId = existing.rows[0].id;
+    await db.query(`
+      UPDATE products SET
+        name = COALESCE($2, name),
+        price = COALESCE($3, price),
+        regular_price = COALESCE($4, regular_price),
+        sale_price = COALESCE($5, sale_price),
+        in_stock = TRUE,
+        last_seen_at = NOW(),
+        status = 'active',
+        updated_at = NOW()
+      WHERE id = $1
+    `, [productId, product.name, product.price, product.regularPrice, product.salePrice]);
+  }
+
+  // 2. Always create snapshot (append-only)
+  const isOnSpecial = detectSpecial(product);
+  const discountPercent = computeDiscount(product);
+
+  await db.query(`
+    INSERT INTO store_product_snapshots (
+      store_id, product_id, crawl_run_id,
+      price_cents, regular_price_cents, sale_price_cents,
+      in_stock, is_on_special, special_type, discount_percent
+    ) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9)
+    ON CONFLICT (product_id, crawl_run_id) DO NOTHING
+  `, [
+    storeId, productId, crawlRunId,
+    toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice),
+    isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent
+  ]);
+
+  return productId;
+}
+
+function detectSpecial(product: ScrapedProduct): boolean {
+  // Check name for "Special Offer"
+  if (product.name?.includes('Special Offer')) return true;
+
+  // Check price discount
+  if (product.salePrice && product.regularPrice) {
+    return parseFloat(product.salePrice) < parseFloat(product.regularPrice);
+  }
+
+  return false;
+}
+
+function computeDiscount(product: ScrapedProduct): number | null {
+  if (!product.salePrice || !product.regularPrice) return null;
+
+  const sale = parseFloat(product.salePrice);
+  const regular = parseFloat(product.regularPrice);
+
+  if (regular <= 0) return null;
+
+  return Math.round((1 - sale / regular) * 100);
+}
+```
+
+---
+
+## 6. K8s Deployment Configuration
+
+### CronJobs Overview
+
+```yaml
+# All CronJobs for scheduling
+apiVersion: v1
+kind: List
+items:
+  # 1. Standard 4-hour crawl cycle
+  - apiVersion: batch/v1
+    kind: CronJob
+    metadata:
+      name: scraper-4h-00
+      namespace: dispensary-scraper
+    spec:
+      schedule: "0 0,4,8,12,16,20 * * *"
+      concurrencyPolicy: Forbid
+      successfulJobsHistoryLimit: 3
+      failedJobsHistoryLimit: 3
+      jobTemplate:
+        spec:
+          activeDeadlineSeconds: 3600  # 1 hour timeout
+          template:
+            spec:
+              containers:
+              - name: scraper
+                image: code.cannabrands.app/creationshop/dispensary-scraper:latest
+                command: ["node", "dist/scripts/run-all-stores.js"]
+                resources:
+                  requests:
+                    memory: "512Mi"
+                    cpu: "250m"
+                  limits:
+                    memory: "2Gi"
+                    cpu: "1000m"
+              restartPolicy: OnFailure
+
+  # 2. Daily specials crawl - Arizona (MST, no DST)
+  - apiVersion: batch/v1
+    kind: CronJob
+    metadata:
+      name: scraper-daily-mst
+      namespace: dispensary-scraper
+    spec:
+      schedule: "1 7 * * *"  # 12:01 AM MST = 07:01 UTC
+      concurrencyPolicy: Forbid
+      jobTemplate:
+        spec:
+          template:
+            spec:
+              containers:
+              - name: scraper
+                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
+
+  # 3. Daily specials crawl - California (PST/PDT)
+  - apiVersion: batch/v1
+    kind: CronJob
+    metadata:
+      name: scraper-daily-pst
+      namespace: dispensary-scraper
+    spec:
+      schedule: "1 8 * * *"  # 12:01 AM PST = 08:01 UTC (adjust for DST)
+      concurrencyPolicy: Forbid
+      jobTemplate:
+        spec:
+          template:
+            spec:
+              containers:
+              - name: scraper
+                command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"]
+```
+
+### Monitoring and Alerts
+
+```yaml
+# PrometheusRule for scraper monitoring
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: scraper-alerts
+  namespace: dispensary-scraper
+spec:
+  groups:
+  - name: scraper.rules
+    rules:
+    - alert: ScraperJobFailed
+      expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0
+      for: 5m
+      labels:
+        severity: warning
+      annotations:
+        summary: "Scraper job failed"
+
+    - alert: ScraperMissedSchedule
+      expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000
+      for: 10m
+      labels:
+        severity: critical
+      annotations:
+        summary: "Scraper hasn't run successfully in 5+ hours"
+```
+
+---
+
+## 7. Summary
+
+| Constraint | Implementation |
+|------------|----------------|
+| **Frozen Crawler** | No changes to selectors, parsing, or request logic |
+| **4-Hour Schedule** | K8s CronJob at 0,4,8,12,16,20 UTC |
+| **12:01 AM Specials** | Timezone-specific CronJobs for store local midnight |
+| **Specials Detection** | API-layer detection via name pattern + price comparison |
+| **Append-Only Data** | Never DELETE; use status flags; `store_product_snapshots` is INSERT-only |
+| **Historical Preservation** | All crawls create snapshots; stale products marked, never deleted |
+
+This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.