# Dutchie Crawl Workflow Complete end-to-end documentation for the Dutchie GraphQL crawl pipeline, from store discovery to product management. --- ## Table of Contents 1. [Architecture Overview](#1-architecture-overview) 2. [Store Discovery](#2-store-discovery) 3. [Platform ID Resolution](#3-platform-id-resolution) 4. [Product Crawling](#4-product-crawling) 5. [Normalization Pipeline](#5-normalization-pipeline) 6. [Canonical Data Model](#6-canonical-data-model) 7. [Hydration (Writing to DB)](#7-hydration-writing-to-db) 8. [Key Files Reference](#8-key-files-reference) 9. [Common Issues & Solutions](#9-common-issues--solutions) 10. [Running Crawls](#10-running-crawls) --- ## 1. Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ DUTCHIE CRAWL PIPELINE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │ Discovery │ -> │ Resolution │ -> │ Crawl │ -> │ Hydrate │ │ │ │ (find URLs) │ │ (get IDs) │ │ (fetch data) │ │ (to DB) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │ │ │ │ │ │ │ │ v v v v │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │ │ │ dispensaries │ │ dispensaries │ │ Raw JSON │ │ store_ │ │ │ │ .menu_url │ │ .platform_ │ │ Products │ │ products │ │ │ │ │ │ dispensary_id│ │ │ │ snapshots │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ variants │ │ │ └───────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Key Principles 1. **GraphQL Only**: All Dutchie data comes from `https://dutchie.com/api-3/graphql` 2. **Curl-Based HTTP**: Uses curl via child_process to bypass TLS fingerprinting 3. **No Puppeteer**: The old DOM-based scraper is deprecated - DO NOT USE `scraper-v2/engine.ts` for Dutchie 4. **Historical Data**: Never delete products/snapshots - always append --- ## 2. Store Discovery ### How Stores Get Into the System Stores are added to the `dispensaries` table with a `menu_url` pointing to their Dutchie menu. **Menu URL Formats:** ``` https://dutchie.com/dispensary/ https://dutchie.com/embedded-menu/ https://.com/menu (redirects to Dutchie) ``` ### Required Fields for Crawling | Field | Required | Description | |-------|----------|-------------| | `menu_url` | Yes | URL to the Dutchie menu | | `menu_type` | Yes | Must be `'dutchie'` | | `platform_dispensary_id` | Yes | MongoDB ObjectId from Dutchie | **A store CANNOT be crawled until `platform_dispensary_id` is resolved.** --- ## 3. Platform ID Resolution ### What is `platform_dispensary_id`? Dutchie uses MongoDB ObjectIds internally (e.g., `6405ef617056e8014d79101b`). This ID is required for all GraphQL product queries. ### Resolution Process ```typescript // File: src/platforms/dutchie/queries.ts import { resolveDispensaryId } from '../platforms/dutchie'; // Extract slug from menu_url const slug = menuUrl.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/)?.[1]; // Resolve to platform ID via GraphQL const platformId = await resolveDispensaryId(slug); // Returns: "6405ef617056e8014d79101b" or null ``` ### GraphQL Query Used ```graphql query GetAddressBasedDispensaryData($dispensaryFilter: dispensaryFilter!) { dispensary(filter: $dispensaryFilter) { id # <-- This is the platform_dispensary_id name cName ... } } ``` **Variables:** ```json { "dispensaryFilter": { "cNameOrID": "AZ-Deeply-Rooted" } } ``` ### Persisted Query Hash ```typescript GRAPHQL_HASHES.GetAddressBasedDispensaryData = '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b' ``` --- ## 4. Product Crawling ### GraphQL Query: FilteredProducts This is the main query for fetching products from a dispensary. **Endpoint:** `https://dutchie.com/api-3/graphql` **Method:** POST (via curl) **Persisted Query Hash:** ```typescript GRAPHQL_HASHES.FilteredProducts = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0' ``` ### Query Variables ```typescript const variables = { includeEnterpriseSpecials: false, productsFilter: { dispensaryId: '6405ef617056e8014d79101b', // platform_dispensary_id pricingType: 'rec', // 'rec' or 'med' Status: 'Active', // CRITICAL: Use 'Active', NOT null types: [], // empty = all categories useCache: true, isDefaultSort: true, sortBy: 'popularSortIdx', sortDirection: 1, bypassOnlineThresholds: true, isKioskMenu: false, removeProductsBelowOptionThresholds: false, }, page: 0, // 0-indexed pagination perPage: 100, // max 100 per page }; ``` ### CRITICAL: Status Parameter | Value | Result | |-------|--------| | `'Active'` | Returns in-stock products WITH pricing | | `null` | Returns 0 products (broken) | | `'Inactive'` | Returns out-of-stock products only | **Always use `Status: 'Active'` for product crawls.** ### Response Structure ```json { "data": { "filteredProducts": { "products": [ { "_id": "product-mongo-id", "Name": "Product Name", "brandName": "Brand Name", "type": "Flower", "subcategory": "Indica", "Status": "Active", "recPrices": [45.00, 90.00], "recSpecialPrices": [], "THCContent": { "unit": "PERCENTAGE", "range": [28.24] }, "CBDContent": { "unit": "PERCENTAGE", "range": [0] }, "Image": "https://images.dutchie.com/...", "POSMetaData": { "children": [ { "option": "1/8oz", "recPrice": 45.00, "quantityAvailable": 10 }, { "option": "1/4oz", "recPrice": 90.00, "quantityAvailable": 5 } ] } } ], "queryInfo": { "totalCount": 1009, "totalPages": 11 } } } } ``` ### Pagination ```typescript const DUTCHIE_CONFIG = { perPage: 100, // Products per page maxPages: 200, // Safety limit pageDelayMs: 500, // Delay between pages }; // Fetch all pages let page = 0; let totalPages = 1; while (page < totalPages) { const result = await executeGraphQL('FilteredProducts', { ...variables, page }); const data = result.data.filteredProducts; totalPages = Math.ceil(data.queryInfo.totalCount / 100); allProducts.push(...data.products); page++; await sleep(500); // Rate limiting } ``` --- ## 5. Normalization Pipeline ### Purpose Convert raw Dutchie JSON into a standardized format before database insertion. ### Key File: `src/hydration/normalizers/dutchie.ts` ```typescript import { DutchieNormalizer } from '../hydration'; const normalizer = new DutchieNormalizer(); // Build RawPayload structure const rawPayload = { id: 'unique-id', dispensary_id: 112, crawl_run_id: null, platform: 'dutchie', payload_version: 1, raw_json: { products: rawProducts }, // <-- Products go here product_count: rawProducts.length, pricing_type: 'rec', crawl_mode: 'active', fetched_at: new Date(), processed: false, normalized_at: null, hydration_error: null, hydration_attempts: 0, created_at: new Date(), }; // Normalize const result = normalizer.normalize(rawPayload); // Result contains: // - result.products: NormalizedProduct[] // - result.pricing: Map // - result.availability: Map // - result.brands: NormalizedBrand[] ``` ### Field Mappings | Dutchie Field | Normalized Field | |---------------|------------------| | `_id` / `id` | `externalProductId` | | `Name` | `name` | | `brandName` | `brandName` | | `type` | `category` | | `subcategory` | `subcategory` | | `Status` | `status`, `isActive` | | `THCContent.range[0]` | `thcPercent` | | `CBDContent.range[0]` | `cbdPercent` | | `Image` | `primaryImageUrl` | | `recPrices[0]` | `priceRec` (in cents) | | `recSpecialPrices[0]` | `priceRecSpecial` (in cents) | ### Data Validation The normalizer handles edge cases: ```typescript // THC/CBD values > 100 are milligrams, not percentages - skip them if (thcPercent > 100) thcPercent = null; // Products without IDs are skipped if (!externalId) return null; // Products without names are skipped if (!name) return null; ``` --- ## 6. Canonical Data Model ### Tables #### `store_products` - Current product state per store ```sql CREATE TABLE store_products ( id SERIAL PRIMARY KEY, dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id), provider VARCHAR(50) NOT NULL DEFAULT 'dutchie', provider_product_id VARCHAR(100), -- Dutchie's _id name_raw VARCHAR(500) NOT NULL, brand_name_raw VARCHAR(255), category_raw VARCHAR(100), subcategory_raw VARCHAR(100), price_rec NUMERIC(10,2), price_med NUMERIC(10,2), price_rec_special NUMERIC(10,2), price_med_special NUMERIC(10,2), is_on_special BOOLEAN DEFAULT false, discount_percent NUMERIC(5,2), is_in_stock BOOLEAN DEFAULT true, stock_quantity INTEGER, stock_status VARCHAR(50) DEFAULT 'in_stock', thc_percent NUMERIC(5,2), -- Max 99.99 cbd_percent NUMERIC(5,2), -- Max 99.99 image_url TEXT, first_seen_at TIMESTAMPTZ DEFAULT NOW(), last_seen_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(dispensary_id, provider, provider_product_id) ); ``` #### `store_product_snapshots` - Historical price/stock records ```sql CREATE TABLE store_product_snapshots ( id SERIAL PRIMARY KEY, dispensary_id INTEGER NOT NULL, store_product_id INTEGER REFERENCES store_products(id), provider VARCHAR(50) NOT NULL, provider_product_id VARCHAR(100), crawl_run_id INTEGER, captured_at TIMESTAMPTZ NOT NULL, name_raw VARCHAR(500), brand_name_raw VARCHAR(255), category_raw VARCHAR(100), price_rec NUMERIC(10,2), price_med NUMERIC(10,2), price_rec_special NUMERIC(10,2), price_med_special NUMERIC(10,2), is_on_special BOOLEAN, is_in_stock BOOLEAN, stock_quantity INTEGER, stock_status VARCHAR(50), thc_percent NUMERIC(5,2), cbd_percent NUMERIC(5,2), raw_data JSONB -- Full raw product for debugging ); ``` #### `product_variants` - Per-weight pricing options ```sql CREATE TABLE product_variants ( id SERIAL PRIMARY KEY, store_product_id INTEGER NOT NULL REFERENCES store_products(id), dispensary_id INTEGER NOT NULL, option VARCHAR(50) NOT NULL, -- "1/8oz", "1g", "100mg" price_rec NUMERIC(10,2), price_med NUMERIC(10,2), price_rec_special NUMERIC(10,2), price_med_special NUMERIC(10,2), quantity INTEGER, in_stock BOOLEAN, weight_value NUMERIC(10,4), -- Parsed: 3.5 weight_unit VARCHAR(10), -- Parsed: "g" UNIQUE(store_product_id, option) ); ``` --- ## 7. Hydration (Writing to DB) ### Key File: `src/hydration/canonical-upsert.ts` ### Function: `hydrateToCanonical()` ```typescript import { hydrateToCanonical } from '../hydration'; const result = await hydrateToCanonical( pool, // pg Pool dispensaryId, // number normResult, // NormalizationResult from normalizer crawlRunId // number | null ); // Result: // { // productsUpserted: 1009, // productsNew: 50, // snapshotsCreated: 1009, // variantsUpserted: 1011, // brandsUpserted: 102, // } ``` ### Upsert Logic **Products:** `ON CONFLICT (dispensary_id, provider, provider_product_id) DO UPDATE` - Updates: name, prices, stock, THC/CBD, timestamps - Preserves: `first_seen_at`, `id` **Snapshots:** Always INSERT (append-only history) - One snapshot per product per crawl - Contains full state at capture time **Variants:** `ON CONFLICT (store_product_id, option) DO UPDATE` - Updates: prices, stock, quantity - Tracks: `last_price_change_at`, `last_stock_change_at` ### Data Transformations ```typescript // Prices: cents -> dollars priceRec: productPricing.priceRec / 100 // THC/CBD: Clamp to valid percentage range thcPercent: product.thcPercent <= 100 ? product.thcPercent : null // Stock status mapping stockStatus: availability.stockStatus || 'unknown' ``` --- ## 8. Key Files Reference ### HTTP Client | File | Purpose | |------|---------| | `src/platforms/dutchie/client.ts` | Curl-based HTTP client (LOCKED) | | `src/platforms/dutchie/queries.ts` | GraphQL query wrappers | | `src/platforms/dutchie/index.ts` | Public exports | ### Normalization | File | Purpose | |------|---------| | `src/hydration/normalizers/dutchie.ts` | Dutchie-specific normalization | | `src/hydration/normalizers/base.ts` | Base normalizer class | | `src/hydration/types.ts` | Type definitions | ### Database | File | Purpose | |------|---------| | `src/hydration/canonical-upsert.ts` | Upsert functions for canonical tables | | `src/hydration/index.ts` | Public exports | | `src/db/pool.ts` | Database connection pool | ### Scripts | File | Purpose | |------|---------| | `src/scripts/test-crawl-to-canonical.ts` | Test script for single dispensary | --- ## 9. Common Issues & Solutions ### Issue: GraphQL Returns 0 Products **Cause:** Using `Status: null` instead of `Status: 'Active'` **Solution:** ```typescript productsFilter: { Status: 'Active', // NOT null ... } ``` ### Issue: Numeric Field Overflow **Cause:** THC/CBD values in milligrams (e.g., 1400mg) stored in percentage field **Solution:** Clamp values > 100 to null: ```typescript thcPercent: value <= 100 ? value : null ``` ### Issue: Column "name" Does Not Exist **Cause:** Code uses `name` but table has `name_raw` **Column Mapping:** | Code | Database | |------|----------| | `name` | `name_raw` | | `brand_name` | `brand_name_raw` | | `category` | `category_raw` | | `subcategory` | `subcategory_raw` | ### Issue: 403 Forbidden **Cause:** TLS fingerprinting or rate limiting **Solution:** The curl-based client handles this with: - Browser fingerprint rotation - Proper headers (Origin, Referer, User-Agent) - Retry with exponential backoff ### Issue: Normalizer Returns 0 Products **Cause:** Wrong payload structure passed to `normalize()` **Solution:** Use `RawPayload` structure: ```typescript const rawPayload = { raw_json: { products: [...] }, // Products in raw_json dispensary_id: 112, // Required // ... other fields }; normalizer.normalize(rawPayload); // NOT (payload, id) ``` --- ## 10. Running Crawls ### Test Script (Single Dispensary) ```bash cd backend DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ npx tsx src/scripts/test-crawl-to-canonical.ts # Example: DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ npx tsx src/scripts/test-crawl-to-canonical.ts 112 ``` ### Expected Output ``` ============================================================ Test Crawl to Canonical - Dispensary 112 ============================================================ [Step 1] Getting dispensary info... Name: Deeply Rooted Boutique Cannabis Company Platform ID: 6405ef617056e8014d79101b Menu URL: https://azdeeplyrooted.com/home cName: dispensary [Step 2] Fetching products from Dutchie GraphQL... [Fetch] Starting fetch for 6405ef617056e8014d79101b (cName: dispensary) [Dutchie Client] curl POST FilteredProducts (attempt 1/4) [Dutchie Client] Response status: 200 [Fetch] Page 1/11: 100 products (total so far: 100) ... Total products fetched: 1009 [Step 3] Normalizing products... Validation: PASS Normalized products: 1009 Brands extracted: 102 [Step 4] Writing to canonical tables via hydrateToCanonical... Products upserted: 1009 Variants upserted: 1011 [Step 5] Verifying data in canonical tables... store_products count: 1060 product_variants count: 1011 store_product_snapshots count: 4315 ============================================================ SUCCESS - Crawl and hydration complete! ============================================================ ``` ### Verification Queries ```sql -- Check products for a dispensary SELECT id, name_raw, brand_name_raw, price_rec, is_in_stock FROM store_products WHERE dispensary_id = 112 ORDER BY last_seen_at DESC LIMIT 10; -- Check variants SELECT pv.option, pv.price_rec, pv.in_stock, sp.name_raw FROM product_variants pv JOIN store_products sp ON sp.id = pv.store_product_id WHERE pv.dispensary_id = 112 LIMIT 10; -- Check snapshot history SELECT COUNT(*) as total, MAX(captured_at) as latest FROM store_product_snapshots WHERE dispensary_id = 112; ``` --- ## Appendix: GraphQL Hashes All Dutchie GraphQL queries use persisted queries with SHA256 hashes: ```typescript export const GRAPHQL_HASHES = { FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0', GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b', DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6', }; ``` These hashes are fixed and tied to Dutchie's API version. If Dutchie changes their API, these may need updating. --- *Last updated: December 2024*