# Dutchie AZ Pipeline ## Overview The Dutchie AZ pipeline is the **only** authorized way to crawl Dutchie dispensary menus. It uses Dutchie's GraphQL API directly (no DOM scraping) and writes to an isolated database with a proper snapshot model. ## Key Principles 1. **GraphQL Only** - All Dutchie data is fetched via their FilteredProducts GraphQL API 2. **Isolated Database** - Data lives in `dutchie_az_*` tables, NOT the legacy `products` table 3. **Append-Only Snapshots** - Every crawl creates snapshots, never overwrites historical data 4. **Stock Status Tracking** - Derived from `POSMetaData.children` inventory data 5. **Missing Product Detection** - Products not in feed are marked with `isPresentInFeed=false` ## Directory Structure ``` src/dutchie-az/ ├── db/ │ ├── connection.ts # Database connection pool │ └── schema.ts # Table definitions and migrations ├── routes/ │ └── index.ts # REST API endpoints ├── services/ │ ├── graphql-client.ts # Direct GraphQL fetch (Mode A + Mode B) │ ├── product-crawler.ts # Main crawler orchestration │ └── scheduler.ts # Jittered scheduling with wandering intervals └── types/ └── index.ts # TypeScript interfaces ``` ## Data Model ### Tables - **dispensaries** - Arizona Dutchie stores with `platform_dispensary_id` - **dutchie_products** - Canonical product identity (one row per product per store) - **dutchie_product_snapshots** - Historical state per crawl (append-only) - **job_schedules** - Scheduler configuration with jitter support - **job_run_logs** - Execution history ### Stock Status The `stock_status` field is derived from `POSMetaData.children`: ```typescript function deriveStockStatus(children?: POSChild[]): StockStatus { if (!children || children.length === 0) return 'unknown'; const totalAvailable = children.reduce((sum, c) => sum + (c.quantityAvailable || 0), 0); return totalAvailable > 0 ? 'in_stock' : 'out_of_stock'; } ``` ### Two-Mode Crawling Mode A (UI Parity): - `Status: null` - Returns what the UI shows - Best for "current inventory" snapshot Mode B (Max Coverage): - `Status: 'Active'` - Returns all active products - Catches items with `isBelowThreshold: true` Both modes are merged to get maximum product coverage. ## API Endpoints All endpoints are mounted at `/api/dutchie-az/`: ``` GET /api/dutchie-az/dispensaries - List all dispensaries GET /api/dutchie-az/dispensaries/:id - Get dispensary details GET /api/dutchie-az/products - List products (with filters) GET /api/dutchie-az/products/:id - Get product with snapshots GET /api/dutchie-az/products/:id/snapshots - Get product snapshot history POST /api/dutchie-az/crawl/:dispensaryId - Trigger manual crawl GET /api/dutchie-az/schedule - Get scheduler status POST /api/dutchie-az/schedule/run - Manually run scheduled jobs GET /api/dutchie-az/stats - Dashboard statistics ``` ## Scheduler The scheduler uses **jitter** to avoid detection patterns: ```typescript // Each job has independent "wandering" timing interface JobSchedule { base_interval_minutes: number; // e.g., 240 (4 hours) jitter_minutes: number; // e.g., 30 (±30 min) next_run_at: Date; // Calculated with jitter after each run } ``` Jobs run when `next_run_at <= NOW()`. After completion, the next run is calculated: ``` next_run_at = NOW() + base_interval + random(-jitter, +jitter) ``` This prevents crawls from clustering at predictable times. ## Manual Testing ### Run a single dispensary crawl: ```bash DATABASE_URL="..." npx tsx -e " const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const { query } = require('./src/dutchie-az/db/connection'); async function test() { const { rows } = await query('SELECT * FROM dispensaries LIMIT 1'); if (!rows[0]) return console.log('No dispensaries found'); const result = await crawlDispensaryProducts(rows[0], 'rec', { useBothModes: true }); console.log(JSON.stringify(result, null, 2)); } test(); " ``` ### Check stock status distribution: ```sql SELECT stock_status, COUNT(*) FROM dutchie_products GROUP BY stock_status; ``` ### View recent snapshots: ```sql SELECT p.name, s.stock_status, s.is_present_in_feed, s.crawled_at FROM dutchie_product_snapshots s JOIN dutchie_products p ON p.id = s.dutchie_product_id ORDER BY s.crawled_at DESC LIMIT 20; ``` ## Deprecated Code The following files are **DEPRECATED** and will throw errors if called: - `src/scrapers/dutchie-graphql.ts` - Wrote to legacy `products` table - `src/scrapers/dutchie-graphql-direct.ts` - Wrote to legacy `products` table - `src/scrapers/templates/dutchie.ts` - HTML/DOM scraper (unreliable) - `src/scraper-v2/engine.ts` DutchieSpider - DOM-based extraction If `store-crawl-orchestrator.ts` detects `provider='dutchie'` with `mode='production'`, it now routes to this dutchie-az pipeline automatically. ## Integration with Legacy System The `store-crawl-orchestrator.ts` bridges the legacy stores system with dutchie-az: 1. When a store has `product_provider='dutchie'` and `product_crawler_mode='production'` 2. The orchestrator looks up the corresponding dispensary in `dutchie_az.dispensaries` 3. It calls `crawlDispensaryProducts()` from the dutchie-az pipeline 4. Results are logged but data stays in the dutchie_az tables To use the dutchie-az pipeline independently: - Navigate to `/dutchie-az-schedule` in the UI - Use the REST API endpoints directly - Run the scheduler service ## Environment Variables ```bash # Database connection for dutchie-az (same DB, separate tables) DATABASE_URL=postgresql://user:pass@host:port/database ``` ## Troubleshooting ### "Dispensary not found in dutchie-az database" The dispensary must exist in `dutchie_az.dispensaries` before crawling. Either: 1. Run discovery to populate dispensaries 2. Manually insert the dispensary with `platform_dispensary_id` ### GraphQL returns empty products 1. Check `platform_dispensary_id` is correct (the internal Dutchie ID, not slug) 2. Verify the dispensary is online and has menu data 3. Try both `rec` and `med` pricing types ### Snapshots show `stock_status='unknown'` The product likely has no `POSMetaData.children` array. This happens for: - Products without inventory tracking - Manually managed inventory --- Last updated: December 2025