Files
cannaiq/backend/src/dutchie-az/README_DUTCHIE_AZ.md
Kelly 917e91297e Add Dutchie AZ data pipeline and public API v1
- Add dutchie-az module with GraphQL product crawler, scheduler, and admin UI
- Add public API v1 endpoints (/api/v1/products, /categories, /brands, /specials, /menu)
- API key auth maps dispensary to dutchie_az store for per-dispensary data access
- Add frontend pages for Dutchie AZ stores, store details, and schedule management
- Update Layout with Dutchie AZ navigation section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-02 09:43:26 -07:00

6.4 KiB

Dutchie AZ Pipeline

Overview

The Dutchie AZ pipeline is the only authorized way to crawl Dutchie dispensary menus. It uses Dutchie's GraphQL API directly (no DOM scraping) and writes to an isolated database with a proper snapshot model.

Key Principles

  1. GraphQL Only - All Dutchie data is fetched via their FilteredProducts GraphQL API
  2. Isolated Database - Data lives in dutchie_az_* tables, NOT the legacy products table
  3. Append-Only Snapshots - Every crawl creates snapshots, never overwrites historical data
  4. Stock Status Tracking - Derived from POSMetaData.children inventory data
  5. Missing Product Detection - Products not in feed are marked with isPresentInFeed=false

Directory Structure

src/dutchie-az/
├── db/
│   ├── connection.ts     # Database connection pool
│   └── schema.ts         # Table definitions and migrations
├── routes/
│   └── index.ts          # REST API endpoints
├── services/
│   ├── graphql-client.ts # Direct GraphQL fetch (Mode A + Mode B)
│   ├── product-crawler.ts # Main crawler orchestration
│   └── scheduler.ts      # Jittered scheduling with wandering intervals
└── types/
    └── index.ts          # TypeScript interfaces

Data Model

Tables

  • dispensaries - Arizona Dutchie stores with platform_dispensary_id
  • dutchie_products - Canonical product identity (one row per product per store)
  • dutchie_product_snapshots - Historical state per crawl (append-only)
  • job_schedules - Scheduler configuration with jitter support
  • job_run_logs - Execution history

Stock Status

The stock_status field is derived from POSMetaData.children:

function deriveStockStatus(children?: POSChild[]): StockStatus {
  if (!children || children.length === 0) return 'unknown';
  const totalAvailable = children.reduce((sum, c) =>
    sum + (c.quantityAvailable || 0), 0);
  return totalAvailable > 0 ? 'in_stock' : 'out_of_stock';
}

Two-Mode Crawling

Mode A (UI Parity):

  • Status: null - Returns what the UI shows
  • Best for "current inventory" snapshot

Mode B (Max Coverage):

  • Status: 'Active' - Returns all active products
  • Catches items with isBelowThreshold: true

Both modes are merged to get maximum product coverage.

API Endpoints

All endpoints are mounted at /api/dutchie-az/:

GET  /api/dutchie-az/dispensaries           - List all dispensaries
GET  /api/dutchie-az/dispensaries/:id       - Get dispensary details
GET  /api/dutchie-az/products               - List products (with filters)
GET  /api/dutchie-az/products/:id           - Get product with snapshots
GET  /api/dutchie-az/products/:id/snapshots - Get product snapshot history
POST /api/dutchie-az/crawl/:dispensaryId    - Trigger manual crawl
GET  /api/dutchie-az/schedule               - Get scheduler status
POST /api/dutchie-az/schedule/run           - Manually run scheduled jobs
GET  /api/dutchie-az/stats                  - Dashboard statistics

Scheduler

The scheduler uses jitter to avoid detection patterns:

// Each job has independent "wandering" timing
interface JobSchedule {
  base_interval_minutes: number;  // e.g., 240 (4 hours)
  jitter_minutes: number;         // e.g., 30 (±30 min)
  next_run_at: Date;              // Calculated with jitter after each run
}

Jobs run when next_run_at <= NOW(). After completion, the next run is calculated:

next_run_at = NOW() + base_interval + random(-jitter, +jitter)

This prevents crawls from clustering at predictable times.

Manual Testing

Run a single dispensary crawl:

DATABASE_URL="..." npx tsx -e "
const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler');
const { query } = require('./src/dutchie-az/db/connection');

async function test() {
  const { rows } = await query('SELECT * FROM dispensaries LIMIT 1');
  if (!rows[0]) return console.log('No dispensaries found');

  const result = await crawlDispensaryProducts(rows[0], 'rec', { useBothModes: true });
  console.log(JSON.stringify(result, null, 2));
}
test();
"

Check stock status distribution:

SELECT stock_status, COUNT(*)
FROM dutchie_products
GROUP BY stock_status;

View recent snapshots:

SELECT
  p.name,
  s.stock_status,
  s.is_present_in_feed,
  s.crawled_at
FROM dutchie_product_snapshots s
JOIN dutchie_products p ON p.id = s.dutchie_product_id
ORDER BY s.crawled_at DESC
LIMIT 20;

Deprecated Code

The following files are DEPRECATED and will throw errors if called:

  • src/scrapers/dutchie-graphql.ts - Wrote to legacy products table
  • src/scrapers/dutchie-graphql-direct.ts - Wrote to legacy products table
  • src/scrapers/templates/dutchie.ts - HTML/DOM scraper (unreliable)
  • src/scraper-v2/engine.ts DutchieSpider - DOM-based extraction

If store-crawl-orchestrator.ts detects provider='dutchie' with mode='production', it now routes to this dutchie-az pipeline automatically.

Integration with Legacy System

The store-crawl-orchestrator.ts bridges the legacy stores system with dutchie-az:

  1. When a store has product_provider='dutchie' and product_crawler_mode='production'
  2. The orchestrator looks up the corresponding dispensary in dutchie_az.dispensaries
  3. It calls crawlDispensaryProducts() from the dutchie-az pipeline
  4. Results are logged but data stays in the dutchie_az tables

To use the dutchie-az pipeline independently:

  • Navigate to /dutchie-az-schedule in the UI
  • Use the REST API endpoints directly
  • Run the scheduler service

Environment Variables

# Database connection for dutchie-az (same DB, separate tables)
DATABASE_URL=postgresql://user:pass@host:port/database

Troubleshooting

"Dispensary not found in dutchie-az database"

The dispensary must exist in dutchie_az.dispensaries before crawling. Either:

  1. Run discovery to populate dispensaries
  2. Manually insert the dispensary with platform_dispensary_id

GraphQL returns empty products

  1. Check platform_dispensary_id is correct (the internal Dutchie ID, not slug)
  2. Verify the dispensary is online and has menu data
  3. Try both rec and med pricing types

Snapshots show stock_status='unknown'

The product likely has no POSMetaData.children array. This happens for:

  • Products without inventory tracking
  • Manually managed inventory

Last updated: December 2025