# Crawl Operations & Data Philosophy This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system. --- ## 1. Frozen Crawler Policy > **CRITICAL CONSTRAINT**: The crawler code is FROZEN. Do NOT modify any crawler logic. ### What Is Frozen The following components are read-only and must not be modified: - **Selectors**: All CSS/XPath selectors for extracting data from Dutchie pages - **Parsing Logic**: Functions that transform raw HTML into structured data - **Request Patterns**: URL construction, pagination, API calls to Dutchie - **Browser Configuration**: Puppeteer settings, user agents, viewport sizes - **Rate Limiting**: Request delays, retry logic, concurrent request limits ### What CAN Be Modified You may build around the crawler's output: | Layer | Allowed Changes | |-------|-----------------| | **Scheduling** | CronJobs, run frequency, store queuing | | **Ingestion** | Post-processing of crawler output before DB insert | | **API Layer** | Query logic, computed fields, response transformations | | **Intelligence** | Aggregation tables, metrics computation | | **Infrastructure** | K8s resources, scaling, monitoring | ### Rationale The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk: - Breaking data extraction if Dutchie changes their UI - Introducing regressions that are hard to detect - Requiring re-validation across all store types All improvements must happen in **downstream processing**, not in the crawler itself. --- ## 2. Crawl Scheduling ### Standard Schedule: Every 4 Hours Run a full crawl for each store every 4 hours, 24/7. ```yaml # K8s CronJob: Every 4 hours apiVersion: batch/v1 kind: CronJob metadata: name: scraper-4h-cycle namespace: dispensary-scraper spec: schedule: "0 */4 * * *" # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: scraper image: code.cannabrands.app/creationshop/dispensary-scraper:latest command: ["node", "dist/scripts/run-all-stores.js"] env: - name: DATABASE_URL valueFrom: secretKeyRef: name: scraper-secrets key: database-url restartPolicy: OnFailure ``` ### Daily Specials Crawl: 12:01 AM Store Local Time Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone. ```yaml # K8s CronJob: Daily specials at store midnight (example for MST/Arizona) apiVersion: batch/v1 kind: CronJob metadata: name: scraper-daily-specials-mst namespace: dispensary-scraper spec: schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: scraper image: code.cannabrands.app/creationshop/dispensary-scraper:latest command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"] restartPolicy: OnFailure ``` ### Timezone-Aware Scheduling Stores table includes timezone information: ```sql ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix'; -- Lookup table for common dispensary timezones -- America/Phoenix (Arizona, no DST) -- America/Los_Angeles (California) -- America/Denver (Colorado) -- America/Chicago (Illinois) ``` ### Scripts Required ``` /backend/src/scripts/ ├── run-all-stores.ts # Run crawl for all enabled stores ├── run-stores-by-timezone.ts # Run crawl for stores in a specific timezone └── scheduler.ts # Orchestrates CronJob dispatch ``` --- ## 3. Specials Detection Logic > **Problem**: The Specials tab in the frontend is EMPTY even though products have discounts. ### Root Cause Analysis Database investigation reveals: | Metric | Count | |--------|-------| | Total products | 1,414 | | `is_special = true` | 0 | | Has "Special Offer" in name | 325 | | Has `sale_price < regular_price` | 4 | The crawler captures "Special Offer" **embedded in the product name** but doesn't set `is_special = true`. ### Solution: API-Layer Specials Detection Since the crawler is frozen, detect specials at query time: ```sql -- Computed is_on_special in API queries SELECT p.*, CASE WHEN p.name ILIKE '%Special Offer%' THEN TRUE WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE WHEN p.price IS NOT NULL AND p.original_price IS NOT NULL AND p.price::numeric < p.original_price::numeric THEN TRUE ELSE FALSE END AS is_on_special, -- Compute special type CASE WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer' WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off' ELSE NULL END AS computed_special_type, -- Compute discount percentage CASE WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL AND p.regular_price::numeric > 0 THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0) ELSE NULL END AS computed_discount_percent FROM products p WHERE p.store_id = :store_id; ``` ### Special Detection Rules (Priority Order) 1. **Name Contains "Special Offer"**: `name ILIKE '%Special Offer%'` - Type: `special_offer` - Badge: "Special" 2. **Price Discount (sale < regular)**: `sale_price < regular_price` - Type: `percent_off` - Badge: Computed as "X% OFF" 3. **Price Discount (current < original)**: `price < original_price` - Type: `percent_off` - Badge: Computed as "X% OFF" 4. **Metadata Offers** (future): `metadata->'offers' IS NOT NULL` - Parse offer type from metadata JSON ### Clean Product Name Strip "Special Offer" from display name: ```typescript function cleanProductName(rawName: string): string { return rawName .replace(/Special Offer$/i, '') .replace(/\s+$/, '') // Trim trailing whitespace .trim(); } ``` ### API Specials Endpoint ```typescript // GET /api/stores/:store_key/specials async function getStoreSpecials(storeKey: string, options: SpecialsOptions) { const query = ` WITH specials AS ( SELECT p.*, -- Detect special CASE WHEN p.name ILIKE '%Special Offer%' THEN TRUE WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE ELSE FALSE END AS is_on_special, -- Compute discount CASE WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100) ELSE NULL END AS discount_percent FROM products p JOIN stores s ON p.store_id = s.id WHERE s.store_key = $1 AND p.in_stock = TRUE ) SELECT * FROM specials WHERE is_on_special = TRUE ORDER BY discount_percent DESC NULLS LAST LIMIT $2 OFFSET $3 `; return db.query(query, [storeKey, options.limit, options.offset]); } ``` --- ## 4. Append-Only Data Philosophy > **Principle**: Every crawl should ADD information, never LOSE it. ### What Append-Only Means | Action | Allowed | Not Allowed | |--------|---------|-------------| | Insert new product | ✅ | - | | Update product price | ✅ | - | | Mark product out-of-stock | ✅ | - | | DELETE product row | ❌ | Never delete | | TRUNCATE table | ❌ | Never truncate | | UPDATE to remove data | ❌ | Never null-out existing data | ### Product Lifecycle States ```sql -- Products are never deleted, only state changes ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active'; -- Statuses: -- 'active' - Currently in stock or recently seen -- 'out_of_stock' - Seen but marked out of stock -- 'stale' - Not seen in last 3 crawls (likely discontinued) -- 'archived' - Manually marked as discontinued CREATE INDEX idx_products_status ON products(status); ``` ### Marking Products Stale (NOT Deleting) ```typescript // After crawl completes, mark unseen products as stale async function markStaleProducts(storeId: number, crawlRunId: number) { await db.query(` UPDATE products SET status = 'stale', updated_at = NOW() WHERE store_id = $1 AND id NOT IN ( SELECT DISTINCT product_id FROM store_product_snapshots WHERE crawl_run_id = $2 ) AND status = 'active' AND last_seen_at < NOW() - INTERVAL '3 days' `, [storeId, crawlRunId]); } ``` ### Store Product Snapshots: True Append-Only The `store_product_snapshots` table is strictly append-only: ```sql CREATE TABLE store_product_snapshots ( id SERIAL PRIMARY KEY, store_id INTEGER NOT NULL REFERENCES stores(id), product_id INTEGER NOT NULL REFERENCES products(id), crawl_run_id INTEGER NOT NULL REFERENCES crawl_runs(id), -- Snapshot of data at crawl time price_cents INTEGER, regular_price_cents INTEGER, sale_price_cents INTEGER, in_stock BOOLEAN NOT NULL, -- Computed at crawl time is_on_special BOOLEAN NOT NULL DEFAULT FALSE, special_type VARCHAR(50), discount_percent INTEGER, captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), -- Composite unique: one snapshot per product per crawl CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id) ); -- NO UPDATE or DELETE triggers - this table is INSERT-only -- For data corrections, insert a new snapshot with corrected flag CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id); CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC); ``` ### Crawl Runs Table Track every crawl execution: ```sql CREATE TABLE crawl_runs ( id SERIAL PRIMARY KEY, store_id INTEGER NOT NULL REFERENCES stores(id), started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), completed_at TIMESTAMPTZ, status VARCHAR(20) NOT NULL DEFAULT 'running', products_found INTEGER, products_new INTEGER, products_updated INTEGER, error_message TEXT, -- Scheduling metadata trigger_type VARCHAR(20) NOT NULL DEFAULT 'scheduled', -- 'scheduled', 'manual', 'daily_specials' CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed')) ); CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC); ``` ### Data Correction Pattern If data needs correction, don't UPDATE - insert a correction record: ```sql CREATE TABLE data_corrections ( id SERIAL PRIMARY KEY, table_name VARCHAR(50) NOT NULL, record_id INTEGER NOT NULL, field_name VARCHAR(100) NOT NULL, old_value JSONB, new_value JSONB, reason TEXT NOT NULL, corrected_by VARCHAR(100) NOT NULL, corrected_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); ``` --- ## 5. Safe Ingestion Patterns ### Upsert Products (Preserving History) ```typescript async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) { // 1. Find or create product const existing = await db.query( `SELECT id, price, regular_price, sale_price FROM products WHERE store_id = $1 AND dutchie_product_id = $2`, [storeId, product.dutchieId] ); let productId: number; if (existing.rows.length === 0) { // INSERT new product const result = await db.query(` INSERT INTO products ( store_id, dutchie_product_id, name, slug, price, regular_price, sale_price, in_stock, first_seen_at, last_seen_at, status ) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active') RETURNING id `, [storeId, product.dutchieId, product.name, product.slug, product.price, product.regularPrice, product.salePrice]); productId = result.rows[0].id; } else { // UPDATE existing - only update if values changed, never null-out productId = existing.rows[0].id; await db.query(` UPDATE products SET name = COALESCE($2, name), price = COALESCE($3, price), regular_price = COALESCE($4, regular_price), sale_price = COALESCE($5, sale_price), in_stock = TRUE, last_seen_at = NOW(), status = 'active', updated_at = NOW() WHERE id = $1 `, [productId, product.name, product.price, product.regularPrice, product.salePrice]); } // 2. Always create snapshot (append-only) const isOnSpecial = detectSpecial(product); const discountPercent = computeDiscount(product); await db.query(` INSERT INTO store_product_snapshots ( store_id, product_id, crawl_run_id, price_cents, regular_price_cents, sale_price_cents, in_stock, is_on_special, special_type, discount_percent ) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9) ON CONFLICT (product_id, crawl_run_id) DO NOTHING `, [ storeId, productId, crawlRunId, toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice), isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent ]); return productId; } function detectSpecial(product: ScrapedProduct): boolean { // Check name for "Special Offer" if (product.name?.includes('Special Offer')) return true; // Check price discount if (product.salePrice && product.regularPrice) { return parseFloat(product.salePrice) < parseFloat(product.regularPrice); } return false; } function computeDiscount(product: ScrapedProduct): number | null { if (!product.salePrice || !product.regularPrice) return null; const sale = parseFloat(product.salePrice); const regular = parseFloat(product.regularPrice); if (regular <= 0) return null; return Math.round((1 - sale / regular) * 100); } ``` --- ## 6. K8s Deployment Configuration ### CronJobs Overview ```yaml # All CronJobs for scheduling apiVersion: v1 kind: List items: # 1. Standard 4-hour crawl cycle - apiVersion: batch/v1 kind: CronJob metadata: name: scraper-4h-00 namespace: dispensary-scraper spec: schedule: "0 0,4,8,12,16,20 * * *" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 3 jobTemplate: spec: activeDeadlineSeconds: 3600 # 1 hour timeout template: spec: containers: - name: scraper image: code.cannabrands.app/creationshop/dispensary-scraper:latest command: ["node", "dist/scripts/run-all-stores.js"] resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" restartPolicy: OnFailure # 2. Daily specials crawl - Arizona (MST, no DST) - apiVersion: batch/v1 kind: CronJob metadata: name: scraper-daily-mst namespace: dispensary-scraper spec: schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: scraper command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"] # 3. Daily specials crawl - California (PST/PDT) - apiVersion: batch/v1 kind: CronJob metadata: name: scraper-daily-pst namespace: dispensary-scraper spec: schedule: "1 8 * * *" # 12:01 AM PST = 08:01 UTC (adjust for DST) concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: scraper command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"] ``` ### Monitoring and Alerts ```yaml # PrometheusRule for scraper monitoring apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: scraper-alerts namespace: dispensary-scraper spec: groups: - name: scraper.rules rules: - alert: ScraperJobFailed expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0 for: 5m labels: severity: warning annotations: summary: "Scraper job failed" - alert: ScraperMissedSchedule expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000 for: 10m labels: severity: critical annotations: summary: "Scraper hasn't run successfully in 5+ hours" ``` --- ## 7. Summary | Constraint | Implementation | |------------|----------------| | **Frozen Crawler** | No changes to selectors, parsing, or request logic | | **4-Hour Schedule** | K8s CronJob at 0,4,8,12,16,20 UTC | | **12:01 AM Specials** | Timezone-specific CronJobs for store local midnight | | **Specials Detection** | API-layer detection via name pattern + price comparison | | **Append-Only Data** | Never DELETE; use status flags; `store_product_snapshots` is INSERT-only | | **Historical Preservation** | All crawls create snapshots; stale products marked, never deleted | This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.