- Add dutchie-az module with GraphQL product crawler, scheduler, and admin UI - Add public API v1 endpoints (/api/v1/products, /categories, /brands, /specials, /menu) - API key auth maps dispensary to dutchie_az store for per-dispensary data access - Add frontend pages for Dutchie AZ stores, store details, and schedule management - Update Layout with Dutchie AZ navigation section 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.4 KiB
Dutchie AZ Pipeline
Overview
The Dutchie AZ pipeline is the only authorized way to crawl Dutchie dispensary menus. It uses Dutchie's GraphQL API directly (no DOM scraping) and writes to an isolated database with a proper snapshot model.
Key Principles
- GraphQL Only - All Dutchie data is fetched via their FilteredProducts GraphQL API
- Isolated Database - Data lives in
dutchie_az_*tables, NOT the legacyproductstable - Append-Only Snapshots - Every crawl creates snapshots, never overwrites historical data
- Stock Status Tracking - Derived from
POSMetaData.childreninventory data - Missing Product Detection - Products not in feed are marked with
isPresentInFeed=false
Directory Structure
src/dutchie-az/
├── db/
│ ├── connection.ts # Database connection pool
│ └── schema.ts # Table definitions and migrations
├── routes/
│ └── index.ts # REST API endpoints
├── services/
│ ├── graphql-client.ts # Direct GraphQL fetch (Mode A + Mode B)
│ ├── product-crawler.ts # Main crawler orchestration
│ └── scheduler.ts # Jittered scheduling with wandering intervals
└── types/
└── index.ts # TypeScript interfaces
Data Model
Tables
- dispensaries - Arizona Dutchie stores with
platform_dispensary_id - dutchie_products - Canonical product identity (one row per product per store)
- dutchie_product_snapshots - Historical state per crawl (append-only)
- job_schedules - Scheduler configuration with jitter support
- job_run_logs - Execution history
Stock Status
The stock_status field is derived from POSMetaData.children:
function deriveStockStatus(children?: POSChild[]): StockStatus {
if (!children || children.length === 0) return 'unknown';
const totalAvailable = children.reduce((sum, c) =>
sum + (c.quantityAvailable || 0), 0);
return totalAvailable > 0 ? 'in_stock' : 'out_of_stock';
}
Two-Mode Crawling
Mode A (UI Parity):
Status: null- Returns what the UI shows- Best for "current inventory" snapshot
Mode B (Max Coverage):
Status: 'Active'- Returns all active products- Catches items with
isBelowThreshold: true
Both modes are merged to get maximum product coverage.
API Endpoints
All endpoints are mounted at /api/dutchie-az/:
GET /api/dutchie-az/dispensaries - List all dispensaries
GET /api/dutchie-az/dispensaries/:id - Get dispensary details
GET /api/dutchie-az/products - List products (with filters)
GET /api/dutchie-az/products/:id - Get product with snapshots
GET /api/dutchie-az/products/:id/snapshots - Get product snapshot history
POST /api/dutchie-az/crawl/:dispensaryId - Trigger manual crawl
GET /api/dutchie-az/schedule - Get scheduler status
POST /api/dutchie-az/schedule/run - Manually run scheduled jobs
GET /api/dutchie-az/stats - Dashboard statistics
Scheduler
The scheduler uses jitter to avoid detection patterns:
// Each job has independent "wandering" timing
interface JobSchedule {
base_interval_minutes: number; // e.g., 240 (4 hours)
jitter_minutes: number; // e.g., 30 (±30 min)
next_run_at: Date; // Calculated with jitter after each run
}
Jobs run when next_run_at <= NOW(). After completion, the next run is calculated:
next_run_at = NOW() + base_interval + random(-jitter, +jitter)
This prevents crawls from clustering at predictable times.
Manual Testing
Run a single dispensary crawl:
DATABASE_URL="..." npx tsx -e "
const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler');
const { query } = require('./src/dutchie-az/db/connection');
async function test() {
const { rows } = await query('SELECT * FROM dispensaries LIMIT 1');
if (!rows[0]) return console.log('No dispensaries found');
const result = await crawlDispensaryProducts(rows[0], 'rec', { useBothModes: true });
console.log(JSON.stringify(result, null, 2));
}
test();
"
Check stock status distribution:
SELECT stock_status, COUNT(*)
FROM dutchie_products
GROUP BY stock_status;
View recent snapshots:
SELECT
p.name,
s.stock_status,
s.is_present_in_feed,
s.crawled_at
FROM dutchie_product_snapshots s
JOIN dutchie_products p ON p.id = s.dutchie_product_id
ORDER BY s.crawled_at DESC
LIMIT 20;
Deprecated Code
The following files are DEPRECATED and will throw errors if called:
src/scrapers/dutchie-graphql.ts- Wrote to legacyproductstablesrc/scrapers/dutchie-graphql-direct.ts- Wrote to legacyproductstablesrc/scrapers/templates/dutchie.ts- HTML/DOM scraper (unreliable)src/scraper-v2/engine.tsDutchieSpider - DOM-based extraction
If store-crawl-orchestrator.ts detects provider='dutchie' with mode='production', it now routes to this dutchie-az pipeline automatically.
Integration with Legacy System
The store-crawl-orchestrator.ts bridges the legacy stores system with dutchie-az:
- When a store has
product_provider='dutchie'andproduct_crawler_mode='production' - The orchestrator looks up the corresponding dispensary in
dutchie_az.dispensaries - It calls
crawlDispensaryProducts()from the dutchie-az pipeline - Results are logged but data stays in the dutchie_az tables
To use the dutchie-az pipeline independently:
- Navigate to
/dutchie-az-schedulein the UI - Use the REST API endpoints directly
- Run the scheduler service
Environment Variables
# Database connection for dutchie-az (same DB, separate tables)
DATABASE_URL=postgresql://user:pass@host:port/database
Troubleshooting
"Dispensary not found in dutchie-az database"
The dispensary must exist in dutchie_az.dispensaries before crawling. Either:
- Run discovery to populate dispensaries
- Manually insert the dispensary with
platform_dispensary_id
GraphQL returns empty products
- Check
platform_dispensary_idis correct (the internal Dutchie ID, not slug) - Verify the dispensary is online and has menu data
- Try both
recandmedpricing types
Snapshots show stock_status='unknown'
The product likely has no POSMetaData.children array. This happens for:
- Products without inventory tracking
- Manually managed inventory
Last updated: December 2025