Fix category-crawler-jobs store lookup query
- Fix column name from s.dutchie_plus_url to s.dutchie_url - Add availability tracking and product freshness APIs - Add crawl script for sequential dispensary processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
592
docs/CRAWL_OPERATIONS.md
Normal file
592
docs/CRAWL_OPERATIONS.md
Normal file
@@ -0,0 +1,592 @@
|
||||
# Crawl Operations & Data Philosophy
|
||||
|
||||
This document defines the operational constraints, scheduling requirements, and data integrity philosophy for the dispensary scraper system.
|
||||
|
||||
---
|
||||
|
||||
## 1. Frozen Crawler Policy
|
||||
|
||||
> **CRITICAL CONSTRAINT**: The crawler code is FROZEN. Do NOT modify any crawler logic.
|
||||
|
||||
### What Is Frozen
|
||||
|
||||
The following components are read-only and must not be modified:
|
||||
|
||||
- **Selectors**: All CSS/XPath selectors for extracting data from Dutchie pages
|
||||
- **Parsing Logic**: Functions that transform raw HTML into structured data
|
||||
- **Request Patterns**: URL construction, pagination, API calls to Dutchie
|
||||
- **Browser Configuration**: Puppeteer settings, user agents, viewport sizes
|
||||
- **Rate Limiting**: Request delays, retry logic, concurrent request limits
|
||||
|
||||
### What CAN Be Modified
|
||||
|
||||
You may build around the crawler's output:
|
||||
|
||||
| Layer | Allowed Changes |
|
||||
|-------|-----------------|
|
||||
| **Scheduling** | CronJobs, run frequency, store queuing |
|
||||
| **Ingestion** | Post-processing of crawler output before DB insert |
|
||||
| **API Layer** | Query logic, computed fields, response transformations |
|
||||
| **Intelligence** | Aggregation tables, metrics computation |
|
||||
| **Infrastructure** | K8s resources, scaling, monitoring |
|
||||
|
||||
### Rationale
|
||||
|
||||
The crawler has been stabilized through extensive testing. Changes to selectors or parsing risk:
|
||||
- Breaking data extraction if Dutchie changes their UI
|
||||
- Introducing regressions that are hard to detect
|
||||
- Requiring re-validation across all store types
|
||||
|
||||
All improvements must happen in **downstream processing**, not in the crawler itself.
|
||||
|
||||
---
|
||||
|
||||
## 2. Crawl Scheduling
|
||||
|
||||
### Standard Schedule: Every 4 Hours
|
||||
|
||||
Run a full crawl for each store every 4 hours, 24/7.
|
||||
|
||||
```yaml
|
||||
# K8s CronJob: Every 4 hours
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: scraper-4h-cycle
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
schedule: "0 */4 * * *" # 00:00, 04:00, 08:00, 12:00, 16:00, 20:00 UTC
|
||||
concurrencyPolicy: Forbid
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: scraper
|
||||
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||||
command: ["node", "dist/scripts/run-all-stores.js"]
|
||||
env:
|
||||
- name: DATABASE_URL
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: scraper-secrets
|
||||
key: database-url
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
### Daily Specials Crawl: 12:01 AM Store Local Time
|
||||
|
||||
Dispensaries often update their daily specials at midnight. We ensure a crawl happens at 12:01 AM in each store's local timezone.
|
||||
|
||||
```yaml
|
||||
# K8s CronJob: Daily specials at store midnight (example for MST/Arizona)
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: scraper-daily-specials-mst
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC
|
||||
concurrencyPolicy: Forbid
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: scraper
|
||||
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||||
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
### Timezone-Aware Scheduling
|
||||
|
||||
Stores table includes timezone information:
|
||||
|
||||
```sql
|
||||
ALTER TABLE stores ADD COLUMN IF NOT EXISTS timezone VARCHAR(50) DEFAULT 'America/Phoenix';
|
||||
|
||||
-- Lookup table for common dispensary timezones
|
||||
-- America/Phoenix (Arizona, no DST)
|
||||
-- America/Los_Angeles (California)
|
||||
-- America/Denver (Colorado)
|
||||
-- America/Chicago (Illinois)
|
||||
```
|
||||
|
||||
### Scripts Required
|
||||
|
||||
```
|
||||
/backend/src/scripts/
|
||||
├── run-all-stores.ts # Run crawl for all enabled stores
|
||||
├── run-stores-by-timezone.ts # Run crawl for stores in a specific timezone
|
||||
└── scheduler.ts # Orchestrates CronJob dispatch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Specials Detection Logic
|
||||
|
||||
> **Problem**: The Specials tab in the frontend is EMPTY even though products have discounts.
|
||||
|
||||
### Root Cause Analysis
|
||||
|
||||
Database investigation reveals:
|
||||
|
||||
| Metric | Count |
|
||||
|--------|-------|
|
||||
| Total products | 1,414 |
|
||||
| `is_special = true` | 0 |
|
||||
| Has "Special Offer" in name | 325 |
|
||||
| Has `sale_price < regular_price` | 4 |
|
||||
|
||||
The crawler captures "Special Offer" **embedded in the product name** but doesn't set `is_special = true`.
|
||||
|
||||
### Solution: API-Layer Specials Detection
|
||||
|
||||
Since the crawler is frozen, detect specials at query time:
|
||||
|
||||
```sql
|
||||
-- Computed is_on_special in API queries
|
||||
SELECT
|
||||
p.*,
|
||||
CASE
|
||||
WHEN p.name ILIKE '%Special Offer%' THEN TRUE
|
||||
WHEN p.sale_price IS NOT NULL
|
||||
AND p.regular_price IS NOT NULL
|
||||
AND p.sale_price::numeric < p.regular_price::numeric THEN TRUE
|
||||
WHEN p.price IS NOT NULL
|
||||
AND p.original_price IS NOT NULL
|
||||
AND p.price::numeric < p.original_price::numeric THEN TRUE
|
||||
ELSE FALSE
|
||||
END AS is_on_special,
|
||||
|
||||
-- Compute special type
|
||||
CASE
|
||||
WHEN p.name ILIKE '%Special Offer%' THEN 'special_offer'
|
||||
WHEN p.sale_price IS NOT NULL
|
||||
AND p.regular_price IS NOT NULL
|
||||
AND p.sale_price::numeric < p.regular_price::numeric THEN 'percent_off'
|
||||
ELSE NULL
|
||||
END AS computed_special_type,
|
||||
|
||||
-- Compute discount percentage
|
||||
CASE
|
||||
WHEN p.sale_price IS NOT NULL
|
||||
AND p.regular_price IS NOT NULL
|
||||
AND p.regular_price::numeric > 0
|
||||
THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100, 0)
|
||||
ELSE NULL
|
||||
END AS computed_discount_percent
|
||||
|
||||
FROM products p
|
||||
WHERE p.store_id = :store_id;
|
||||
```
|
||||
|
||||
### Special Detection Rules (Priority Order)
|
||||
|
||||
1. **Name Contains "Special Offer"**: `name ILIKE '%Special Offer%'`
|
||||
- Type: `special_offer`
|
||||
- Badge: "Special"
|
||||
|
||||
2. **Price Discount (sale < regular)**: `sale_price < regular_price`
|
||||
- Type: `percent_off`
|
||||
- Badge: Computed as "X% OFF"
|
||||
|
||||
3. **Price Discount (current < original)**: `price < original_price`
|
||||
- Type: `percent_off`
|
||||
- Badge: Computed as "X% OFF"
|
||||
|
||||
4. **Metadata Offers** (future): `metadata->'offers' IS NOT NULL`
|
||||
- Parse offer type from metadata JSON
|
||||
|
||||
### Clean Product Name
|
||||
|
||||
Strip "Special Offer" from display name:
|
||||
|
||||
```typescript
|
||||
function cleanProductName(rawName: string): string {
|
||||
return rawName
|
||||
.replace(/Special Offer$/i, '')
|
||||
.replace(/\s+$/, '') // Trim trailing whitespace
|
||||
.trim();
|
||||
}
|
||||
```
|
||||
|
||||
### API Specials Endpoint
|
||||
|
||||
```typescript
|
||||
// GET /api/stores/:store_key/specials
|
||||
async function getStoreSpecials(storeKey: string, options: SpecialsOptions) {
|
||||
const query = `
|
||||
WITH specials AS (
|
||||
SELECT
|
||||
p.*,
|
||||
-- Detect special
|
||||
CASE
|
||||
WHEN p.name ILIKE '%Special Offer%' THEN TRUE
|
||||
WHEN p.sale_price::numeric < p.regular_price::numeric THEN TRUE
|
||||
ELSE FALSE
|
||||
END AS is_on_special,
|
||||
|
||||
-- Compute discount
|
||||
CASE
|
||||
WHEN p.sale_price IS NOT NULL AND p.regular_price IS NOT NULL
|
||||
THEN ROUND((1 - p.sale_price::numeric / p.regular_price::numeric) * 100)
|
||||
ELSE NULL
|
||||
END AS discount_percent
|
||||
|
||||
FROM products p
|
||||
JOIN stores s ON p.store_id = s.id
|
||||
WHERE s.store_key = $1
|
||||
AND p.in_stock = TRUE
|
||||
)
|
||||
SELECT * FROM specials
|
||||
WHERE is_on_special = TRUE
|
||||
ORDER BY discount_percent DESC NULLS LAST
|
||||
LIMIT $2 OFFSET $3
|
||||
`;
|
||||
|
||||
return db.query(query, [storeKey, options.limit, options.offset]);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Append-Only Data Philosophy
|
||||
|
||||
> **Principle**: Every crawl should ADD information, never LOSE it.
|
||||
|
||||
### What Append-Only Means
|
||||
|
||||
| Action | Allowed | Not Allowed |
|
||||
|--------|---------|-------------|
|
||||
| Insert new product | ✅ | - |
|
||||
| Update product price | ✅ | - |
|
||||
| Mark product out-of-stock | ✅ | - |
|
||||
| DELETE product row | ❌ | Never delete |
|
||||
| TRUNCATE table | ❌ | Never truncate |
|
||||
| UPDATE to remove data | ❌ | Never null-out existing data |
|
||||
|
||||
### Product Lifecycle States
|
||||
|
||||
```sql
|
||||
-- Products are never deleted, only state changes
|
||||
ALTER TABLE products ADD COLUMN IF NOT EXISTS status VARCHAR(20) DEFAULT 'active';
|
||||
|
||||
-- Statuses:
|
||||
-- 'active' - Currently in stock or recently seen
|
||||
-- 'out_of_stock' - Seen but marked out of stock
|
||||
-- 'stale' - Not seen in last 3 crawls (likely discontinued)
|
||||
-- 'archived' - Manually marked as discontinued
|
||||
|
||||
CREATE INDEX idx_products_status ON products(status);
|
||||
```
|
||||
|
||||
### Marking Products Stale (NOT Deleting)
|
||||
|
||||
```typescript
|
||||
// After crawl completes, mark unseen products as stale
|
||||
async function markStaleProducts(storeId: number, crawlRunId: number) {
|
||||
await db.query(`
|
||||
UPDATE products
|
||||
SET
|
||||
status = 'stale',
|
||||
updated_at = NOW()
|
||||
WHERE store_id = $1
|
||||
AND id NOT IN (
|
||||
SELECT DISTINCT product_id
|
||||
FROM store_product_snapshots
|
||||
WHERE crawl_run_id = $2
|
||||
)
|
||||
AND status = 'active'
|
||||
AND last_seen_at < NOW() - INTERVAL '3 days'
|
||||
`, [storeId, crawlRunId]);
|
||||
}
|
||||
```
|
||||
|
||||
### Store Product Snapshots: True Append-Only
|
||||
|
||||
The `store_product_snapshots` table is strictly append-only:
|
||||
|
||||
```sql
|
||||
CREATE TABLE store_product_snapshots (
|
||||
id SERIAL PRIMARY KEY,
|
||||
store_id INTEGER NOT NULL REFERENCES stores(id),
|
||||
product_id INTEGER NOT NULL REFERENCES products(id),
|
||||
crawl_run_id INTEGER NOT NULL REFERENCES crawl_runs(id),
|
||||
|
||||
-- Snapshot of data at crawl time
|
||||
price_cents INTEGER,
|
||||
regular_price_cents INTEGER,
|
||||
sale_price_cents INTEGER,
|
||||
in_stock BOOLEAN NOT NULL,
|
||||
|
||||
-- Computed at crawl time
|
||||
is_on_special BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
special_type VARCHAR(50),
|
||||
discount_percent INTEGER,
|
||||
|
||||
captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
|
||||
-- Composite unique: one snapshot per product per crawl
|
||||
CONSTRAINT uq_snapshot_product_crawl UNIQUE (product_id, crawl_run_id)
|
||||
);
|
||||
|
||||
-- NO UPDATE or DELETE triggers - this table is INSERT-only
|
||||
-- For data corrections, insert a new snapshot with corrected flag
|
||||
|
||||
CREATE INDEX idx_snapshots_crawl ON store_product_snapshots(crawl_run_id);
|
||||
CREATE INDEX idx_snapshots_product_time ON store_product_snapshots(product_id, captured_at DESC);
|
||||
```
|
||||
|
||||
### Crawl Runs Table
|
||||
|
||||
Track every crawl execution:
|
||||
|
||||
```sql
|
||||
CREATE TABLE crawl_runs (
|
||||
id SERIAL PRIMARY KEY,
|
||||
store_id INTEGER NOT NULL REFERENCES stores(id),
|
||||
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
completed_at TIMESTAMPTZ,
|
||||
status VARCHAR(20) NOT NULL DEFAULT 'running',
|
||||
products_found INTEGER,
|
||||
products_new INTEGER,
|
||||
products_updated INTEGER,
|
||||
error_message TEXT,
|
||||
|
||||
-- Scheduling metadata
|
||||
trigger_type VARCHAR(20) NOT NULL DEFAULT 'scheduled', -- 'scheduled', 'manual', 'daily_specials'
|
||||
|
||||
CONSTRAINT chk_crawl_status CHECK (status IN ('running', 'completed', 'failed'))
|
||||
);
|
||||
|
||||
CREATE INDEX idx_crawl_runs_store_time ON crawl_runs(store_id, started_at DESC);
|
||||
```
|
||||
|
||||
### Data Correction Pattern
|
||||
|
||||
If data needs correction, don't UPDATE - insert a correction record:
|
||||
|
||||
```sql
|
||||
CREATE TABLE data_corrections (
|
||||
id SERIAL PRIMARY KEY,
|
||||
table_name VARCHAR(50) NOT NULL,
|
||||
record_id INTEGER NOT NULL,
|
||||
field_name VARCHAR(100) NOT NULL,
|
||||
old_value JSONB,
|
||||
new_value JSONB,
|
||||
reason TEXT NOT NULL,
|
||||
corrected_by VARCHAR(100) NOT NULL,
|
||||
corrected_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Safe Ingestion Patterns
|
||||
|
||||
### Upsert Products (Preserving History)
|
||||
|
||||
```typescript
|
||||
async function upsertProduct(storeId: number, crawlRunId: number, product: ScrapedProduct) {
|
||||
// 1. Find or create product
|
||||
const existing = await db.query(
|
||||
`SELECT id, price, regular_price, sale_price FROM products
|
||||
WHERE store_id = $1 AND dutchie_product_id = $2`,
|
||||
[storeId, product.dutchieId]
|
||||
);
|
||||
|
||||
let productId: number;
|
||||
|
||||
if (existing.rows.length === 0) {
|
||||
// INSERT new product
|
||||
const result = await db.query(`
|
||||
INSERT INTO products (
|
||||
store_id, dutchie_product_id, name, slug, price, regular_price, sale_price,
|
||||
in_stock, first_seen_at, last_seen_at, status
|
||||
) VALUES ($1, $2, $3, $4, $5, $6, $7, TRUE, NOW(), NOW(), 'active')
|
||||
RETURNING id
|
||||
`, [storeId, product.dutchieId, product.name, product.slug,
|
||||
product.price, product.regularPrice, product.salePrice]);
|
||||
productId = result.rows[0].id;
|
||||
} else {
|
||||
// UPDATE existing - only update if values changed, never null-out
|
||||
productId = existing.rows[0].id;
|
||||
await db.query(`
|
||||
UPDATE products SET
|
||||
name = COALESCE($2, name),
|
||||
price = COALESCE($3, price),
|
||||
regular_price = COALESCE($4, regular_price),
|
||||
sale_price = COALESCE($5, sale_price),
|
||||
in_stock = TRUE,
|
||||
last_seen_at = NOW(),
|
||||
status = 'active',
|
||||
updated_at = NOW()
|
||||
WHERE id = $1
|
||||
`, [productId, product.name, product.price, product.regularPrice, product.salePrice]);
|
||||
}
|
||||
|
||||
// 2. Always create snapshot (append-only)
|
||||
const isOnSpecial = detectSpecial(product);
|
||||
const discountPercent = computeDiscount(product);
|
||||
|
||||
await db.query(`
|
||||
INSERT INTO store_product_snapshots (
|
||||
store_id, product_id, crawl_run_id,
|
||||
price_cents, regular_price_cents, sale_price_cents,
|
||||
in_stock, is_on_special, special_type, discount_percent
|
||||
) VALUES ($1, $2, $3, $4, $5, $6, TRUE, $7, $8, $9)
|
||||
ON CONFLICT (product_id, crawl_run_id) DO NOTHING
|
||||
`, [
|
||||
storeId, productId, crawlRunId,
|
||||
toCents(product.price), toCents(product.regularPrice), toCents(product.salePrice),
|
||||
isOnSpecial, isOnSpecial ? 'percent_off' : null, discountPercent
|
||||
]);
|
||||
|
||||
return productId;
|
||||
}
|
||||
|
||||
function detectSpecial(product: ScrapedProduct): boolean {
|
||||
// Check name for "Special Offer"
|
||||
if (product.name?.includes('Special Offer')) return true;
|
||||
|
||||
// Check price discount
|
||||
if (product.salePrice && product.regularPrice) {
|
||||
return parseFloat(product.salePrice) < parseFloat(product.regularPrice);
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
function computeDiscount(product: ScrapedProduct): number | null {
|
||||
if (!product.salePrice || !product.regularPrice) return null;
|
||||
|
||||
const sale = parseFloat(product.salePrice);
|
||||
const regular = parseFloat(product.regularPrice);
|
||||
|
||||
if (regular <= 0) return null;
|
||||
|
||||
return Math.round((1 - sale / regular) * 100);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. K8s Deployment Configuration
|
||||
|
||||
### CronJobs Overview
|
||||
|
||||
```yaml
|
||||
# All CronJobs for scheduling
|
||||
apiVersion: v1
|
||||
kind: List
|
||||
items:
|
||||
# 1. Standard 4-hour crawl cycle
|
||||
- apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: scraper-4h-00
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
schedule: "0 0,4,8,12,16,20 * * *"
|
||||
concurrencyPolicy: Forbid
|
||||
successfulJobsHistoryLimit: 3
|
||||
failedJobsHistoryLimit: 3
|
||||
jobTemplate:
|
||||
spec:
|
||||
activeDeadlineSeconds: 3600 # 1 hour timeout
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: scraper
|
||||
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||||
command: ["node", "dist/scripts/run-all-stores.js"]
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "1000m"
|
||||
restartPolicy: OnFailure
|
||||
|
||||
# 2. Daily specials crawl - Arizona (MST, no DST)
|
||||
- apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: scraper-daily-mst
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
schedule: "1 7 * * *" # 12:01 AM MST = 07:01 UTC
|
||||
concurrencyPolicy: Forbid
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: scraper
|
||||
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Phoenix"]
|
||||
|
||||
# 3. Daily specials crawl - California (PST/PDT)
|
||||
- apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: scraper-daily-pst
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
schedule: "1 8 * * *" # 12:01 AM PST = 08:01 UTC (adjust for DST)
|
||||
concurrencyPolicy: Forbid
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: scraper
|
||||
command: ["node", "dist/scripts/run-stores-by-timezone.js", "America/Los_Angeles"]
|
||||
```
|
||||
|
||||
### Monitoring and Alerts
|
||||
|
||||
```yaml
|
||||
# PrometheusRule for scraper monitoring
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: scraper-alerts
|
||||
namespace: dispensary-scraper
|
||||
spec:
|
||||
groups:
|
||||
- name: scraper.rules
|
||||
rules:
|
||||
- alert: ScraperJobFailed
|
||||
expr: kube_job_status_failed{namespace="dispensary-scraper"} > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Scraper job failed"
|
||||
|
||||
- alert: ScraperMissedSchedule
|
||||
expr: time() - kube_cronjob_status_last_successful_time{namespace="dispensary-scraper"} > 18000
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Scraper hasn't run successfully in 5+ hours"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Summary
|
||||
|
||||
| Constraint | Implementation |
|
||||
|------------|----------------|
|
||||
| **Frozen Crawler** | No changes to selectors, parsing, or request logic |
|
||||
| **4-Hour Schedule** | K8s CronJob at 0,4,8,12,16,20 UTC |
|
||||
| **12:01 AM Specials** | Timezone-specific CronJobs for store local midnight |
|
||||
| **Specials Detection** | API-layer detection via name pattern + price comparison |
|
||||
| **Append-Only Data** | Never DELETE; use status flags; `store_product_snapshots` is INSERT-only |
|
||||
| **Historical Preservation** | All crawls create snapshots; stale products marked, never deleted |
|
||||
|
||||
This design ensures we maximize the value of crawler data without risking breakage from crawler modifications.
|
||||
Reference in New Issue
Block a user