Fix category-crawler-jobs store lookup query

- Fix column name from s.dutchie_plus_url to s.dutchie_url
- Add availability tracking and product freshness APIs
- Add crawl script for sequential dispensary processing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Kelly
2025-12-01 00:07:00 -07:00
parent 20a7b69537
commit 9d8972aa86
15 changed files with 11604 additions and 42 deletions

592
docs/CRAWL_OPERATIONS.md Normal file
View File

@@ -0,0 +1,592 @@
# Crawl Operations & Data Philosophy
This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system.
---
## 1. Frozen Crawler Policy
> **CRITICAL CONSTRAINT**: The crawler code is FROZEN. Do NOT modify any crawler logic.
### What Is Frozen
The following components are read-only and must not be modified:
- **Selectors**: All CSS/XPath selectors for extracting data from Dutchie pages
- **Parsing Logic**: Functions that transform raw HTML into structured data
- **Request Patterns**: URL construction, pagination, API calls to Dutchie
- **Browser Configuration**: Puppeteer settings, user agents, viewport sizes
- **Rate Limiting**: Request delays, retry logic, concurrent request limits
### What CAN Be Modified
You may build around the crawler's output:
| Layer | Allowed Changes |
|-------|-----------------|
| **Scheduling** | CronJobs, run frequency, store queuing |
| **Ingestion** | Post-processing of crawler output before DB insert |
| **API Layer** | Query logic, computed fields, response transformations |
| **Intelligence** | Aggregation tables, metrics computation |
| **Infrastructure** | K8s resources, scaling, monitoring |
### Rationale
The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk:
- Breaking data extraction if Dutchie changes their UI
- Introducing regressions that are hard to detect
- Requiring re-validation across all store types
All improvements must happen in **downstream processing**, not in the crawler itself.
---
## 2. Crawl Scheduling
### Standard Schedule: Every 4 Hours
Run a full crawl for each store every 4 hours, 24/7.
```yaml
# K8s CronJob: Every 4 hours
apiVersion: batch/v1
kind: CronJob
metadata:
name: scraper-4h-cycle
namespace: dispensary-scraper
spec:
schedule: "0 */4 * * *" # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: scraper
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["node", "dist/scripts/run-all-stores.js"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: scraper-secrets
key: database-url
restartPolicy: OnFailure
```
### Daily Specials Crawl: 12:01 AM Store Local Time
Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone.
```yaml
# K8s CronJob: Daily specials at store midnight (example for MST/Arizona)
apiVersion: batch/v1
kind: CronJob
metadata:
name: scraper-daily-specials-mst
namespace: dispensary-scraper
spec:
schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: scraper
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
restartPolicy: OnFailure
```
### Timezone-Aware Scheduling
Stores table includes timezone information:
```sql
ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix';
-- Lookup table for common dispensary timezones
-- America/Phoenix (Arizona, no DST)
-- America/Los_Angeles (California)
-- America/Denver (Colorado)
-- America/Chicago (Illinois)
```
### Scripts Required
```
/backend/src/scripts/
├── run-all-stores.ts # Run crawl for all enabled stores
├── run-stores-by-timezone.ts # Run crawl for stores in a specific timezone
└── scheduler.ts # Orchestrates CronJob dispatch
```
---
## 3. Specials Detection Logic
> **Problem**: The Specials tab in the frontend is EMPTY even though products have discounts.
### Root Cause Analysis
Database investigation reveals:
| Metric | Count |
|--------|-------|
| Total products | 1,414 |
| `is_special = true` | 0 |
| Has "Special Offer" in name | 325 |
| Has `sale_price < regular_price` | 4 |
The crawler captures "Special Offer" **embedded in the product name** but doesn't set `is_special = true`.
### Solution: API-Layer Specials Detection
Since the crawler is frozen, detect specials at query time:
```sql
-- Computed is_on_special in API queries
SELECT
p.*,
CASE
WHEN p.name ILIKE '%Special Offer%' THEN TRUE
WHEN p.sale_price IS NOT NULL
AND p.regular_price IS NOT NULL
AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE
WHEN p.price IS NOT NULL
AND p.original_price IS NOT NULL
AND p.price::numeric < p.original_price::numeric THEN TRUE
ELSE FALSE
END AS is_on_special,
-- Compute special type
CASE
WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer'
WHEN p.sale_price IS NOT NULL
AND p.regular_price IS NOT NULL
AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off'
ELSE NULL
END AS computed_special_type,
-- Compute discount percentage
CASE
WHEN p.sale_price IS NOT NULL
AND p.regular_price IS NOT NULL
AND p.regular_price::numeric > 0
THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0)
ELSE NULL
END AS computed_discount_percent
FROM products p
WHERE p.store_id = :store_id;
```
### Special Detection Rules (Priority Order)
1. **Name Contains "Special Offer"**: `name ILIKE '%Special Offer%'`
- Type: `special_offer`
- Badge: "Special"
2. **Price Discount (sale < regular)**: `sale_price < regular_price`
- Type: `percent_off`
- Badge: Computed as "X% OFF"
3. **Price Discount (current < original)**: `price < original_price`
- Type: `percent_off`
- Badge: Computed as "X% OFF"
4. **Metadata Offers** (future): `metadata->'offers' IS NOT NULL`
- Parse offer type from metadata JSON
### Clean Product Name
Strip "Special Offer" from display name:
```typescript
function cleanProductName(rawName: string): string {
return rawName
.replace(/Special Offer$/i, '')
.replace(/\s+$/, '') // Trim trailing whitespace
.trim();
}
```
### API Specials Endpoint
```typescript
// GET /api/stores/:store_key/specials
async function getStoreSpecials(storeKey: string, options: SpecialsOptions) {
const query = `
WITH specials AS (
SELECT
p.*,
-- Detect special
CASE
WHEN p.name ILIKE '%Special Offer%' THEN TRUE
WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE
ELSE FALSE
END AS is_on_special,
-- Compute discount
CASE
WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL
THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100)
ELSE NULL
END AS discount_percent
FROM products p
JOIN stores s ON p.store_id = s.id
WHERE s.store_key = $1
AND p.in_stock = TRUE
)
SELECT * FROM specials
WHERE is_on_special = TRUE
ORDER BY discount_percent DESC NULLS LAST
LIMIT $2 OFFSET $3
`;
return db.query(query, [storeKey, options.limit, options.offset]);
}
```
---
## 4. Append-Only Data Philosophy
> **Principle**: Every crawl should ADD information, never LOSE it.
### What Append-Only Means
| Action | Allowed | Not Allowed |
|--------|---------|-------------|
| Insert new product | ✅ | - |
| Update product price | ✅ | - |
| Mark product out-of-stock | ✅ | - |
| DELETE product row | ❌ | Never delete |
| TRUNCATE table | ❌ | Never truncate |
| UPDATE to remove data | ❌ | Never null-out existing data |
### Product Lifecycle States
```sql
-- Products are never deleted, only state changes
ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active';
-- Statuses:
-- 'active' - Currently in stock or recently seen
-- 'out_of_stock' - Seen but marked out of stock
-- 'stale' - Not seen in last 3 crawls (likely discontinued)
-- 'archived' - Manually marked as discontinued
CREATE INDEX idx_products_status ON products(status);
```
### Marking Products Stale (NOT Deleting)
```typescript
// After crawl completes, mark unseen products as stale
async function markStaleProducts(storeId: number, crawlRunId: number) {
await db.query(`
UPDATE products
SET
status = 'stale',
updated_at = NOW()
WHERE store_id = $1
AND id NOT IN (
SELECT DISTINCT product_id
FROM store_product_snapshots
WHERE crawl_run_id = $2
)
AND status = 'active'
AND last_seen_at < NOW() - INTERVAL '3 days'
`, [storeId, crawlRunId]);
}
```
### Store Product Snapshots: True Append-Only
The `store_product_snapshots` table is strictly append-only:
```sql
CREATE TABLE store_product_snapshots (
id SERIAL PRIMARY KEY,
store_id INTEGER NOT NULL REFERENCES stores(id),
product_id INTEGER NOT NULL REFERENCES products(id),
crawl_run_id INTEGER NOT NULL REFERENCES crawl_runs(id),
-- Snapshot of data at crawl time
price_cents INTEGER,
regular_price_cents INTEGER,
sale_price_cents INTEGER,
in_stock BOOLEAN NOT NULL,
-- Computed at crawl time
is_on_special BOOLEAN NOT NULL DEFAULT FALSE,
special_type VARCHAR(50),
discount_percent INTEGER,
captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Composite unique: one snapshot per product per crawl
CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id)
);
-- NO UPDATE or DELETE triggers - this table is INSERT-only
-- For data corrections, insert a new snapshot with corrected flag
CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id);
CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC);
```
### Crawl Runs Table
Track every crawl execution:
```sql
CREATE TABLE crawl_runs (
id SERIAL PRIMARY KEY,
store_id INTEGER NOT NULL REFERENCES stores(id),
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
status VARCHAR(20) NOT NULL DEFAULT 'running',
products_found INTEGER,
products_new INTEGER,
products_updated INTEGER,
error_message TEXT,
-- Scheduling metadata
trigger_type VARCHAR(20) NOT NULL DEFAULT 'scheduled', -- 'scheduled', 'manual', 'daily_specials'
CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed'))
);
CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC);
```
### Data Correction Pattern
If data needs correction, don't UPDATE - insert a correction record:
```sql
CREATE TABLE data_corrections (
id SERIAL PRIMARY KEY,
table_name VARCHAR(50) NOT NULL,
record_id INTEGER NOT NULL,
field_name VARCHAR(100) NOT NULL,
old_value JSONB,
new_value JSONB,
reason TEXT NOT NULL,
corrected_by VARCHAR(100) NOT NULL,
corrected_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
```
---
## 5. Safe Ingestion Patterns
### Upsert Products (Preserving History)
```typescript
async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) {
// 1. Find or create product
const existing = await db.query(
`SELECT id, price, regular_price, sale_price FROM products
WHERE store_id = $1 AND dutchie_product_id = $2`,
[storeId, product.dutchieId]
);
let productId: number;
if (existing.rows.length === 0) {
// INSERT new product
const result = await db.query(`
INSERT INTO products (
store_id, dutchie_product_id, name, slug, price, regular_price, sale_price,
in_stock, first_seen_at, last_seen_at, status
) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active')
RETURNING id
`, [storeId, product.dutchieId, product.name, product.slug,
product.price, product.regularPrice, product.salePrice]);
productId = result.rows[0].id;
} else {
// UPDATE existing - only update if values changed, never null-out
productId = existing.rows[0].id;
await db.query(`
UPDATE products SET
name = COALESCE($2, name),
price = COALESCE($3, price),
regular_price = COALESCE($4, regular_price),
sale_price = COALESCE($5, sale_price),
in_stock = TRUE,
last_seen_at = NOW(),
status = 'active',
updated_at = NOW()
WHERE id = $1
`, [productId, product.name, product.price, product.regularPrice, product.salePrice]);
}
// 2. Always create snapshot (append-only)
const isOnSpecial = detectSpecial(product);
const discountPercent = computeDiscount(product);
await db.query(`
INSERT INTO store_product_snapshots (
store_id, product_id, crawl_run_id,
price_cents, regular_price_cents, sale_price_cents,
in_stock, is_on_special, special_type, discount_percent
) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9)
ON CONFLICT (product_id, crawl_run_id) DO NOTHING
`, [
storeId, productId, crawlRunId,
toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice),
isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent
]);
return productId;
}
function detectSpecial(product: ScrapedProduct): boolean {
// Check name for "Special Offer"
if (product.name?.includes('Special Offer')) return true;
// Check price discount
if (product.salePrice && product.regularPrice) {
return parseFloat(product.salePrice) < parseFloat(product.regularPrice);
}
return false;
}
function computeDiscount(product: ScrapedProduct): number | null {
if (!product.salePrice || !product.regularPrice) return null;
const sale = parseFloat(product.salePrice);
const regular = parseFloat(product.regularPrice);
if (regular <= 0) return null;
return Math.round((1 - sale / regular) * 100);
}
```
---
## 6. K8s Deployment Configuration
### CronJobs Overview
```yaml
# All CronJobs for scheduling
apiVersion: v1
kind: List
items:
# 1. Standard 4-hour crawl cycle
- apiVersion: batch/v1
kind: CronJob
metadata:
name: scraper-4h-00
namespace: dispensary-scraper
spec:
schedule: "0 0,4,8,12,16,20 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
activeDeadlineSeconds: 3600 # 1 hour timeout
template:
spec:
containers:
- name: scraper
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["node", "dist/scripts/run-all-stores.js"]
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
restartPolicy: OnFailure
# 2. Daily specials crawl - Arizona (MST, no DST)
- apiVersion: batch/v1
kind: CronJob
metadata:
name: scraper-daily-mst
namespace: dispensary-scraper
spec:
schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: scraper
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
# 3. Daily specials crawl - California (PST/PDT)
- apiVersion: batch/v1
kind: CronJob
metadata:
name: scraper-daily-pst
namespace: dispensary-scraper
spec:
schedule: "1 8 * * *" # 12:01 AM PST = 08:01 UTC (adjust for DST)
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: scraper
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"]
```
### Monitoring and Alerts
```yaml
# PrometheusRule for scraper monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: scraper-alerts
namespace: dispensary-scraper
spec:
groups:
- name: scraper.rules
rules:
- alert: ScraperJobFailed
expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Scraper job failed"
- alert: ScraperMissedSchedule
expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000
for: 10m
labels:
severity: critical
annotations:
summary: "Scraper hasn't run successfully in 5+ hours"
```
---
## 7. Summary
| Constraint | Implementation |
|------------|----------------|
| **Frozen Crawler** | No changes to selectors, parsing, or request logic |
| **4-Hour Schedule** | K8s CronJob at 0,4,8,12,16,20 UTC |
| **12:01 AM Specials** | Timezone-specific CronJobs for store local midnight |
| **Specials Detection** | API-layer detection via name pattern + price comparison |
| **Append-Only Data** | Never DELETE; use status flags; `store_product_snapshots` is INSERT-only |
| **Historical Preservation** | All crawls create snapshots; stale products marked, never deleted |
This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.