Files
cannaiq/docs/DUTCHIE_CRAWL_WORKFLOW.md
Kelly 2f483b3084 feat: SEO template library, discovery pipeline, and orchestrator enhancements
## SEO Template Library
- Add complete template library with 7 page types (state, city, category, brand, product, search, regeneration)
- Add Template Library tab in SEO Orchestrator with accordion-based editors
- Add template preview, validation, and variable injection engine
- Add API endpoints: /api/seo/templates, preview, validate, generate, regenerate

## Discovery Pipeline
- Add promotion.ts for discovery location validation and promotion
- Add discover-all-states.ts script for multi-state discovery
- Add promotion log migration (067)
- Enhance discovery routes and types

## Orchestrator & Admin
- Add crawl_enabled filter to stores page
- Add API permissions page
- Add job queue management
- Add price analytics routes
- Add markets and intelligence routes
- Enhance dashboard and worker monitoring

## Infrastructure
- Add migrations for worker definitions, SEO settings, field alignment
- Add canonical pipeline for scraper v2
- Update hydration and sync orchestrator
- Enhance multi-state query service

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 00:05:34 -07:00

18 KiB

Dutchie Crawl Workflow

Complete end-to-end documentation for the Dutchie GraphQL crawl pipeline, from store discovery to product management.


Table of Contents

  1. Architecture Overview
  2. Store Discovery
  3. Platform ID Resolution
  4. Product Crawling
  5. Normalization Pipeline
  6. Canonical Data Model
  7. Hydration (Writing to DB)
  8. Key Files Reference
  9. Common Issues & Solutions
  10. Running Crawls

1. Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DUTCHIE CRAWL PIPELINE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌───────────┐ │
│  │   Discovery  │ -> │  Resolution  │ -> │    Crawl     │ -> │  Hydrate  │ │
│  │  (find URLs) │    │ (get IDs)    │    │ (fetch data) │    │ (to DB)   │ │
│  └──────────────┘    └──────────────┘    └──────────────┘    └───────────┘ │
│         │                   │                   │                  │        │
│         v                   v                   v                  v        │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌───────────┐ │
│  │ dispensaries │    │ dispensaries │    │  Raw JSON    │    │ store_    │ │
│  │  .menu_url   │    │ .platform_   │    │  Products    │    │ products  │ │
│  │              │    │ dispensary_id│    │              │    │ snapshots │ │
│  └──────────────┘    └──────────────┘    └──────────────┘    │ variants  │ │
│                                                              └───────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

Key Principles

  1. GraphQL Only: All Dutchie data comes from https://dutchie.com/api-3/graphql
  2. Curl-Based HTTP: Uses curl via child_process to bypass TLS fingerprinting
  3. No Puppeteer: The old DOM-based scraper is deprecated - DO NOT USE scraper-v2/engine.ts for Dutchie
  4. Historical Data: Never delete products/snapshots - always append

2. Store Discovery

How Stores Get Into the System

Stores are added to the dispensaries table with a menu_url pointing to their Dutchie menu.

Menu URL Formats:

https://dutchie.com/dispensary/<slug>
https://dutchie.com/embedded-menu/<slug>
https://<custom-domain>.com/menu (redirects to Dutchie)

Required Fields for Crawling

Field Required Description
menu_url Yes URL to the Dutchie menu
menu_type Yes Must be 'dutchie'
platform_dispensary_id Yes MongoDB ObjectId from Dutchie

A store CANNOT be crawled until platform_dispensary_id is resolved.


3. Platform ID Resolution

What is platform_dispensary_id?

Dutchie uses MongoDB ObjectIds internally (e.g., 6405ef617056e8014d79101b). This ID is required for all GraphQL product queries.

Resolution Process

// File: src/platforms/dutchie/queries.ts

import { resolveDispensaryId } from '../platforms/dutchie';

// Extract slug from menu_url
const slug = menuUrl.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/)?.[1];

// Resolve to platform ID via GraphQL
const platformId = await resolveDispensaryId(slug);
// Returns: "6405ef617056e8014d79101b" or null

GraphQL Query Used

query GetAddressBasedDispensaryData($dispensaryFilter: dispensaryFilter!) {
  dispensary(filter: $dispensaryFilter) {
    id        # <-- This is the platform_dispensary_id
    name
    cName
    ...
  }
}

Variables:

{
  "dispensaryFilter": {
    "cNameOrID": "AZ-Deeply-Rooted"
  }
}

Persisted Query Hash

GRAPHQL_HASHES.GetAddressBasedDispensaryData = '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b'

4. Product Crawling

GraphQL Query: FilteredProducts

This is the main query for fetching products from a dispensary.

Endpoint: https://dutchie.com/api-3/graphql

Method: POST (via curl)

Persisted Query Hash:

GRAPHQL_HASHES.FilteredProducts = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'

Query Variables

const variables = {
  includeEnterpriseSpecials: false,
  productsFilter: {
    dispensaryId: '6405ef617056e8014d79101b',  // platform_dispensary_id
    pricingType: 'rec',                         // 'rec' or 'med'
    Status: 'Active',                           // CRITICAL: Use 'Active', NOT null
    types: [],                                  // empty = all categories
    useCache: true,
    isDefaultSort: true,
    sortBy: 'popularSortIdx',
    sortDirection: 1,
    bypassOnlineThresholds: true,
    isKioskMenu: false,
    removeProductsBelowOptionThresholds: false,
  },
  page: 0,        // 0-indexed pagination
  perPage: 100,   // max 100 per page
};

CRITICAL: Status Parameter

Value Result
'Active' Returns in-stock products WITH pricing
null Returns 0 products (broken)
'Inactive' Returns out-of-stock products only

Always use Status: 'Active' for product crawls.

Response Structure

{
  "data": {
    "filteredProducts": {
      "products": [
        {
          "_id": "product-mongo-id",
          "Name": "Product Name",
          "brandName": "Brand Name",
          "type": "Flower",
          "subcategory": "Indica",
          "Status": "Active",
          "recPrices": [45.00, 90.00],
          "recSpecialPrices": [],
          "THCContent": { "unit": "PERCENTAGE", "range": [28.24] },
          "CBDContent": { "unit": "PERCENTAGE", "range": [0] },
          "Image": "https://images.dutchie.com/...",
          "POSMetaData": {
            "children": [
              {
                "option": "1/8oz",
                "recPrice": 45.00,
                "quantityAvailable": 10
              },
              {
                "option": "1/4oz",
                "recPrice": 90.00,
                "quantityAvailable": 5
              }
            ]
          }
        }
      ],
      "queryInfo": {
        "totalCount": 1009,
        "totalPages": 11
      }
    }
  }
}

Pagination

const DUTCHIE_CONFIG = {
  perPage: 100,      // Products per page
  maxPages: 200,     // Safety limit
  pageDelayMs: 500,  // Delay between pages
};

// Fetch all pages
let page = 0;
let totalPages = 1;

while (page < totalPages) {
  const result = await executeGraphQL('FilteredProducts', { ...variables, page });
  const data = result.data.filteredProducts;

  totalPages = Math.ceil(data.queryInfo.totalCount / 100);
  allProducts.push(...data.products);

  page++;
  await sleep(500);  // Rate limiting
}

5. Normalization Pipeline

Purpose

Convert raw Dutchie JSON into a standardized format before database insertion.

Key File: src/hydration/normalizers/dutchie.ts

import { DutchieNormalizer } from '../hydration';

const normalizer = new DutchieNormalizer();

// Build RawPayload structure
const rawPayload = {
  id: 'unique-id',
  dispensary_id: 112,
  crawl_run_id: null,
  platform: 'dutchie',
  payload_version: 1,
  raw_json: { products: rawProducts },  // <-- Products go here
  product_count: rawProducts.length,
  pricing_type: 'rec',
  crawl_mode: 'active',
  fetched_at: new Date(),
  processed: false,
  normalized_at: null,
  hydration_error: null,
  hydration_attempts: 0,
  created_at: new Date(),
};

// Normalize
const result = normalizer.normalize(rawPayload);

// Result contains:
// - result.products: NormalizedProduct[]
// - result.pricing: Map<externalId, NormalizedPricing>
// - result.availability: Map<externalId, NormalizedAvailability>
// - result.brands: NormalizedBrand[]

Field Mappings

Dutchie Field Normalized Field
_id / id externalProductId
Name name
brandName brandName
type category
subcategory subcategory
Status status, isActive
THCContent.range[0] thcPercent
CBDContent.range[0] cbdPercent
Image primaryImageUrl
recPrices[0] priceRec (in cents)
recSpecialPrices[0] priceRecSpecial (in cents)

Data Validation

The normalizer handles edge cases:

// THC/CBD values > 100 are milligrams, not percentages - skip them
if (thcPercent > 100) thcPercent = null;

// Products without IDs are skipped
if (!externalId) return null;

// Products without names are skipped
if (!name) return null;

6. Canonical Data Model

Tables

store_products - Current product state per store

CREATE TABLE store_products (
  id SERIAL PRIMARY KEY,
  dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id),
  provider VARCHAR(50) NOT NULL DEFAULT 'dutchie',
  provider_product_id VARCHAR(100),  -- Dutchie's _id

  name_raw VARCHAR(500) NOT NULL,
  brand_name_raw VARCHAR(255),
  category_raw VARCHAR(100),
  subcategory_raw VARCHAR(100),

  price_rec NUMERIC(10,2),
  price_med NUMERIC(10,2),
  price_rec_special NUMERIC(10,2),
  price_med_special NUMERIC(10,2),
  is_on_special BOOLEAN DEFAULT false,
  discount_percent NUMERIC(5,2),

  is_in_stock BOOLEAN DEFAULT true,
  stock_quantity INTEGER,
  stock_status VARCHAR(50) DEFAULT 'in_stock',

  thc_percent NUMERIC(5,2),  -- Max 99.99
  cbd_percent NUMERIC(5,2),  -- Max 99.99

  image_url TEXT,

  first_seen_at TIMESTAMPTZ DEFAULT NOW(),
  last_seen_at TIMESTAMPTZ DEFAULT NOW(),

  UNIQUE(dispensary_id, provider, provider_product_id)
);

store_product_snapshots - Historical price/stock records

CREATE TABLE store_product_snapshots (
  id SERIAL PRIMARY KEY,
  dispensary_id INTEGER NOT NULL,
  store_product_id INTEGER REFERENCES store_products(id),
  provider VARCHAR(50) NOT NULL,
  provider_product_id VARCHAR(100),
  crawl_run_id INTEGER,
  captured_at TIMESTAMPTZ NOT NULL,

  name_raw VARCHAR(500),
  brand_name_raw VARCHAR(255),
  category_raw VARCHAR(100),

  price_rec NUMERIC(10,2),
  price_med NUMERIC(10,2),
  price_rec_special NUMERIC(10,2),
  price_med_special NUMERIC(10,2),
  is_on_special BOOLEAN,

  is_in_stock BOOLEAN,
  stock_quantity INTEGER,
  stock_status VARCHAR(50),

  thc_percent NUMERIC(5,2),
  cbd_percent NUMERIC(5,2),

  raw_data JSONB  -- Full raw product for debugging
);

product_variants - Per-weight pricing options

CREATE TABLE product_variants (
  id SERIAL PRIMARY KEY,
  store_product_id INTEGER NOT NULL REFERENCES store_products(id),
  dispensary_id INTEGER NOT NULL,

  option VARCHAR(50) NOT NULL,  -- "1/8oz", "1g", "100mg"

  price_rec NUMERIC(10,2),
  price_med NUMERIC(10,2),
  price_rec_special NUMERIC(10,2),
  price_med_special NUMERIC(10,2),

  quantity INTEGER,
  in_stock BOOLEAN,

  weight_value NUMERIC(10,4),  -- Parsed: 3.5
  weight_unit VARCHAR(10),     -- Parsed: "g"

  UNIQUE(store_product_id, option)
);

7. Hydration (Writing to DB)

Key File: src/hydration/canonical-upsert.ts

Function: hydrateToCanonical()

import { hydrateToCanonical } from '../hydration';

const result = await hydrateToCanonical(
  pool,           // pg Pool
  dispensaryId,   // number
  normResult,     // NormalizationResult from normalizer
  crawlRunId      // number | null
);

// Result:
// {
//   productsUpserted: 1009,
//   productsNew: 50,
//   snapshotsCreated: 1009,
//   variantsUpserted: 1011,
//   brandsUpserted: 102,
// }

Upsert Logic

Products: ON CONFLICT (dispensary_id, provider, provider_product_id) DO UPDATE

  • Updates: name, prices, stock, THC/CBD, timestamps
  • Preserves: first_seen_at, id

Snapshots: Always INSERT (append-only history)

  • One snapshot per product per crawl
  • Contains full state at capture time

Variants: ON CONFLICT (store_product_id, option) DO UPDATE

  • Updates: prices, stock, quantity
  • Tracks: last_price_change_at, last_stock_change_at

Data Transformations

// Prices: cents -> dollars
priceRec: productPricing.priceRec / 100

// THC/CBD: Clamp to valid percentage range
thcPercent: product.thcPercent <= 100 ? product.thcPercent : null

// Stock status mapping
stockStatus: availability.stockStatus || 'unknown'

8. Key Files Reference

HTTP Client

File Purpose
src/platforms/dutchie/client.ts Curl-based HTTP client (LOCKED)
src/platforms/dutchie/queries.ts GraphQL query wrappers
src/platforms/dutchie/index.ts Public exports

Normalization

File Purpose
src/hydration/normalizers/dutchie.ts Dutchie-specific normalization
src/hydration/normalizers/base.ts Base normalizer class
src/hydration/types.ts Type definitions

Database

File Purpose
src/hydration/canonical-upsert.ts Upsert functions for canonical tables
src/hydration/index.ts Public exports
src/db/pool.ts Database connection pool

Scripts

File Purpose
src/scripts/test-crawl-to-canonical.ts Test script for single dispensary

9. Common Issues & Solutions

Issue: GraphQL Returns 0 Products

Cause: Using Status: null instead of Status: 'Active'

Solution:

productsFilter: {
  Status: 'Active',  // NOT null
  ...
}

Issue: Numeric Field Overflow

Cause: THC/CBD values in milligrams (e.g., 1400mg) stored in percentage field

Solution: Clamp values > 100 to null:

thcPercent: value <= 100 ? value : null

Issue: Column "name" Does Not Exist

Cause: Code uses name but table has name_raw

Column Mapping:

Code Database
name name_raw
brand_name brand_name_raw
category category_raw
subcategory subcategory_raw

Issue: 403 Forbidden

Cause: TLS fingerprinting or rate limiting

Solution: The curl-based client handles this with:

  • Browser fingerprint rotation
  • Proper headers (Origin, Referer, User-Agent)
  • Retry with exponential backoff

Issue: Normalizer Returns 0 Products

Cause: Wrong payload structure passed to normalize()

Solution: Use RawPayload structure:

const rawPayload = {
  raw_json: { products: [...] },  // Products in raw_json
  dispensary_id: 112,             // Required
  // ... other fields
};
normalizer.normalize(rawPayload);  // NOT (payload, id)

10. Running Crawls

Test Script (Single Dispensary)

cd backend

DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts <dispensaryId>

# Example:
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts 112

Expected Output

============================================================
Test Crawl to Canonical - Dispensary 112
============================================================

[Step 1] Getting dispensary info...
  Name: Deeply Rooted Boutique Cannabis Company
  Platform ID: 6405ef617056e8014d79101b
  Menu URL: https://azdeeplyrooted.com/home
  cName: dispensary

[Step 2] Fetching products from Dutchie GraphQL...
[Fetch] Starting fetch for 6405ef617056e8014d79101b (cName: dispensary)
[Dutchie Client] curl POST FilteredProducts (attempt 1/4)
[Dutchie Client] Response status: 200
[Fetch] Page 1/11: 100 products (total so far: 100)
...
  Total products fetched: 1009

[Step 3] Normalizing products...
  Validation: PASS
  Normalized products: 1009
  Brands extracted: 102

[Step 4] Writing to canonical tables via hydrateToCanonical...
  Products upserted: 1009
  Variants upserted: 1011

[Step 5] Verifying data in canonical tables...
  store_products count: 1060
  product_variants count: 1011
  store_product_snapshots count: 4315

============================================================
SUCCESS - Crawl and hydration complete!
============================================================

Verification Queries

-- Check products for a dispensary
SELECT id, name_raw, brand_name_raw, price_rec, is_in_stock
FROM store_products
WHERE dispensary_id = 112
ORDER BY last_seen_at DESC
LIMIT 10;

-- Check variants
SELECT pv.option, pv.price_rec, pv.in_stock, sp.name_raw
FROM product_variants pv
JOIN store_products sp ON sp.id = pv.store_product_id
WHERE pv.dispensary_id = 112
LIMIT 10;

-- Check snapshot history
SELECT COUNT(*) as total, MAX(captured_at) as latest
FROM store_product_snapshots
WHERE dispensary_id = 112;

Appendix: GraphQL Hashes

All Dutchie GraphQL queries use persisted queries with SHA256 hashes:

export const GRAPHQL_HASHES = {
  FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
  GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
  ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
  DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
  GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};

These hashes are fixed and tied to Dutchie's API version. If Dutchie changes their API, these may need updating.


Last updated: December 2024