## SEO Template Library - Add complete template library with 7 page types (state, city, category, brand, product, search, regeneration) - Add Template Library tab in SEO Orchestrator with accordion-based editors - Add template preview, validation, and variable injection engine - Add API endpoints: /api/seo/templates, preview, validate, generate, regenerate ## Discovery Pipeline - Add promotion.ts for discovery location validation and promotion - Add discover-all-states.ts script for multi-state discovery - Add promotion log migration (067) - Enhance discovery routes and types ## Orchestrator & Admin - Add crawl_enabled filter to stores page - Add API permissions page - Add job queue management - Add price analytics routes - Add markets and intelligence routes - Enhance dashboard and worker monitoring ## Infrastructure - Add migrations for worker definitions, SEO settings, field alignment - Add canonical pipeline for scraper v2 - Update hydration and sync orchestrator - Enhance multi-state query service 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
18 KiB
Dutchie Crawl Workflow
Complete end-to-end documentation for the Dutchie GraphQL crawl pipeline, from store discovery to product management.
Table of Contents
- Architecture Overview
- Store Discovery
- Platform ID Resolution
- Product Crawling
- Normalization Pipeline
- Canonical Data Model
- Hydration (Writing to DB)
- Key Files Reference
- Common Issues & Solutions
- Running Crawls
1. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ DUTCHIE CRAWL PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Discovery │ -> │ Resolution │ -> │ Crawl │ -> │ Hydrate │ │
│ │ (find URLs) │ │ (get IDs) │ │ (fetch data) │ │ (to DB) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ dispensaries │ │ dispensaries │ │ Raw JSON │ │ store_ │ │
│ │ .menu_url │ │ .platform_ │ │ Products │ │ products │ │
│ │ │ │ dispensary_id│ │ │ │ snapshots │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │ variants │ │
│ └───────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Principles
- GraphQL Only: All Dutchie data comes from
https://dutchie.com/api-3/graphql - Curl-Based HTTP: Uses curl via child_process to bypass TLS fingerprinting
- No Puppeteer: The old DOM-based scraper is deprecated - DO NOT USE
scraper-v2/engine.tsfor Dutchie - Historical Data: Never delete products/snapshots - always append
2. Store Discovery
How Stores Get Into the System
Stores are added to the dispensaries table with a menu_url pointing to their Dutchie menu.
Menu URL Formats:
https://dutchie.com/dispensary/<slug>
https://dutchie.com/embedded-menu/<slug>
https://<custom-domain>.com/menu (redirects to Dutchie)
Required Fields for Crawling
| Field | Required | Description |
|---|---|---|
menu_url |
Yes | URL to the Dutchie menu |
menu_type |
Yes | Must be 'dutchie' |
platform_dispensary_id |
Yes | MongoDB ObjectId from Dutchie |
A store CANNOT be crawled until platform_dispensary_id is resolved.
3. Platform ID Resolution
What is platform_dispensary_id?
Dutchie uses MongoDB ObjectIds internally (e.g., 6405ef617056e8014d79101b). This ID is required for all GraphQL product queries.
Resolution Process
// File: src/platforms/dutchie/queries.ts
import { resolveDispensaryId } from '../platforms/dutchie';
// Extract slug from menu_url
const slug = menuUrl.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/)?.[1];
// Resolve to platform ID via GraphQL
const platformId = await resolveDispensaryId(slug);
// Returns: "6405ef617056e8014d79101b" or null
GraphQL Query Used
query GetAddressBasedDispensaryData($dispensaryFilter: dispensaryFilter!) {
dispensary(filter: $dispensaryFilter) {
id # <-- This is the platform_dispensary_id
name
cName
...
}
}
Variables:
{
"dispensaryFilter": {
"cNameOrID": "AZ-Deeply-Rooted"
}
}
Persisted Query Hash
GRAPHQL_HASHES.GetAddressBasedDispensaryData = '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b'
4. Product Crawling
GraphQL Query: FilteredProducts
This is the main query for fetching products from a dispensary.
Endpoint: https://dutchie.com/api-3/graphql
Method: POST (via curl)
Persisted Query Hash:
GRAPHQL_HASHES.FilteredProducts = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
Query Variables
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: '6405ef617056e8014d79101b', // platform_dispensary_id
pricingType: 'rec', // 'rec' or 'med'
Status: 'Active', // CRITICAL: Use 'Active', NOT null
types: [], // empty = all categories
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: 0, // 0-indexed pagination
perPage: 100, // max 100 per page
};
CRITICAL: Status Parameter
| Value | Result |
|---|---|
'Active' |
Returns in-stock products WITH pricing |
null |
Returns 0 products (broken) |
'Inactive' |
Returns out-of-stock products only |
Always use Status: 'Active' for product crawls.
Response Structure
{
"data": {
"filteredProducts": {
"products": [
{
"_id": "product-mongo-id",
"Name": "Product Name",
"brandName": "Brand Name",
"type": "Flower",
"subcategory": "Indica",
"Status": "Active",
"recPrices": [45.00, 90.00],
"recSpecialPrices": [],
"THCContent": { "unit": "PERCENTAGE", "range": [28.24] },
"CBDContent": { "unit": "PERCENTAGE", "range": [0] },
"Image": "https://images.dutchie.com/...",
"POSMetaData": {
"children": [
{
"option": "1/8oz",
"recPrice": 45.00,
"quantityAvailable": 10
},
{
"option": "1/4oz",
"recPrice": 90.00,
"quantityAvailable": 5
}
]
}
}
],
"queryInfo": {
"totalCount": 1009,
"totalPages": 11
}
}
}
}
Pagination
const DUTCHIE_CONFIG = {
perPage: 100, // Products per page
maxPages: 200, // Safety limit
pageDelayMs: 500, // Delay between pages
};
// Fetch all pages
let page = 0;
let totalPages = 1;
while (page < totalPages) {
const result = await executeGraphQL('FilteredProducts', { ...variables, page });
const data = result.data.filteredProducts;
totalPages = Math.ceil(data.queryInfo.totalCount / 100);
allProducts.push(...data.products);
page++;
await sleep(500); // Rate limiting
}
5. Normalization Pipeline
Purpose
Convert raw Dutchie JSON into a standardized format before database insertion.
Key File: src/hydration/normalizers/dutchie.ts
import { DutchieNormalizer } from '../hydration';
const normalizer = new DutchieNormalizer();
// Build RawPayload structure
const rawPayload = {
id: 'unique-id',
dispensary_id: 112,
crawl_run_id: null,
platform: 'dutchie',
payload_version: 1,
raw_json: { products: rawProducts }, // <-- Products go here
product_count: rawProducts.length,
pricing_type: 'rec',
crawl_mode: 'active',
fetched_at: new Date(),
processed: false,
normalized_at: null,
hydration_error: null,
hydration_attempts: 0,
created_at: new Date(),
};
// Normalize
const result = normalizer.normalize(rawPayload);
// Result contains:
// - result.products: NormalizedProduct[]
// - result.pricing: Map<externalId, NormalizedPricing>
// - result.availability: Map<externalId, NormalizedAvailability>
// - result.brands: NormalizedBrand[]
Field Mappings
| Dutchie Field | Normalized Field |
|---|---|
_id / id |
externalProductId |
Name |
name |
brandName |
brandName |
type |
category |
subcategory |
subcategory |
Status |
status, isActive |
THCContent.range[0] |
thcPercent |
CBDContent.range[0] |
cbdPercent |
Image |
primaryImageUrl |
recPrices[0] |
priceRec (in cents) |
recSpecialPrices[0] |
priceRecSpecial (in cents) |
Data Validation
The normalizer handles edge cases:
// THC/CBD values > 100 are milligrams, not percentages - skip them
if (thcPercent > 100) thcPercent = null;
// Products without IDs are skipped
if (!externalId) return null;
// Products without names are skipped
if (!name) return null;
6. Canonical Data Model
Tables
store_products - Current product state per store
CREATE TABLE store_products (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id),
provider VARCHAR(50) NOT NULL DEFAULT 'dutchie',
provider_product_id VARCHAR(100), -- Dutchie's _id
name_raw VARCHAR(500) NOT NULL,
brand_name_raw VARCHAR(255),
category_raw VARCHAR(100),
subcategory_raw VARCHAR(100),
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
is_on_special BOOLEAN DEFAULT false,
discount_percent NUMERIC(5,2),
is_in_stock BOOLEAN DEFAULT true,
stock_quantity INTEGER,
stock_status VARCHAR(50) DEFAULT 'in_stock',
thc_percent NUMERIC(5,2), -- Max 99.99
cbd_percent NUMERIC(5,2), -- Max 99.99
image_url TEXT,
first_seen_at TIMESTAMPTZ DEFAULT NOW(),
last_seen_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(dispensary_id, provider, provider_product_id)
);
store_product_snapshots - Historical price/stock records
CREATE TABLE store_product_snapshots (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL,
store_product_id INTEGER REFERENCES store_products(id),
provider VARCHAR(50) NOT NULL,
provider_product_id VARCHAR(100),
crawl_run_id INTEGER,
captured_at TIMESTAMPTZ NOT NULL,
name_raw VARCHAR(500),
brand_name_raw VARCHAR(255),
category_raw VARCHAR(100),
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
is_on_special BOOLEAN,
is_in_stock BOOLEAN,
stock_quantity INTEGER,
stock_status VARCHAR(50),
thc_percent NUMERIC(5,2),
cbd_percent NUMERIC(5,2),
raw_data JSONB -- Full raw product for debugging
);
product_variants - Per-weight pricing options
CREATE TABLE product_variants (
id SERIAL PRIMARY KEY,
store_product_id INTEGER NOT NULL REFERENCES store_products(id),
dispensary_id INTEGER NOT NULL,
option VARCHAR(50) NOT NULL, -- "1/8oz", "1g", "100mg"
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
quantity INTEGER,
in_stock BOOLEAN,
weight_value NUMERIC(10,4), -- Parsed: 3.5
weight_unit VARCHAR(10), -- Parsed: "g"
UNIQUE(store_product_id, option)
);
7. Hydration (Writing to DB)
Key File: src/hydration/canonical-upsert.ts
Function: hydrateToCanonical()
import { hydrateToCanonical } from '../hydration';
const result = await hydrateToCanonical(
pool, // pg Pool
dispensaryId, // number
normResult, // NormalizationResult from normalizer
crawlRunId // number | null
);
// Result:
// {
// productsUpserted: 1009,
// productsNew: 50,
// snapshotsCreated: 1009,
// variantsUpserted: 1011,
// brandsUpserted: 102,
// }
Upsert Logic
Products: ON CONFLICT (dispensary_id, provider, provider_product_id) DO UPDATE
- Updates: name, prices, stock, THC/CBD, timestamps
- Preserves:
first_seen_at,id
Snapshots: Always INSERT (append-only history)
- One snapshot per product per crawl
- Contains full state at capture time
Variants: ON CONFLICT (store_product_id, option) DO UPDATE
- Updates: prices, stock, quantity
- Tracks:
last_price_change_at,last_stock_change_at
Data Transformations
// Prices: cents -> dollars
priceRec: productPricing.priceRec / 100
// THC/CBD: Clamp to valid percentage range
thcPercent: product.thcPercent <= 100 ? product.thcPercent : null
// Stock status mapping
stockStatus: availability.stockStatus || 'unknown'
8. Key Files Reference
HTTP Client
| File | Purpose |
|---|---|
src/platforms/dutchie/client.ts |
Curl-based HTTP client (LOCKED) |
src/platforms/dutchie/queries.ts |
GraphQL query wrappers |
src/platforms/dutchie/index.ts |
Public exports |
Normalization
| File | Purpose |
|---|---|
src/hydration/normalizers/dutchie.ts |
Dutchie-specific normalization |
src/hydration/normalizers/base.ts |
Base normalizer class |
src/hydration/types.ts |
Type definitions |
Database
| File | Purpose |
|---|---|
src/hydration/canonical-upsert.ts |
Upsert functions for canonical tables |
src/hydration/index.ts |
Public exports |
src/db/pool.ts |
Database connection pool |
Scripts
| File | Purpose |
|---|---|
src/scripts/test-crawl-to-canonical.ts |
Test script for single dispensary |
9. Common Issues & Solutions
Issue: GraphQL Returns 0 Products
Cause: Using Status: null instead of Status: 'Active'
Solution:
productsFilter: {
Status: 'Active', // NOT null
...
}
Issue: Numeric Field Overflow
Cause: THC/CBD values in milligrams (e.g., 1400mg) stored in percentage field
Solution: Clamp values > 100 to null:
thcPercent: value <= 100 ? value : null
Issue: Column "name" Does Not Exist
Cause: Code uses name but table has name_raw
Column Mapping:
| Code | Database |
|---|---|
name |
name_raw |
brand_name |
brand_name_raw |
category |
category_raw |
subcategory |
subcategory_raw |
Issue: 403 Forbidden
Cause: TLS fingerprinting or rate limiting
Solution: The curl-based client handles this with:
- Browser fingerprint rotation
- Proper headers (Origin, Referer, User-Agent)
- Retry with exponential backoff
Issue: Normalizer Returns 0 Products
Cause: Wrong payload structure passed to normalize()
Solution: Use RawPayload structure:
const rawPayload = {
raw_json: { products: [...] }, // Products in raw_json
dispensary_id: 112, // Required
// ... other fields
};
normalizer.normalize(rawPayload); // NOT (payload, id)
10. Running Crawls
Test Script (Single Dispensary)
cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts <dispensaryId>
# Example:
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts 112
Expected Output
============================================================
Test Crawl to Canonical - Dispensary 112
============================================================
[Step 1] Getting dispensary info...
Name: Deeply Rooted Boutique Cannabis Company
Platform ID: 6405ef617056e8014d79101b
Menu URL: https://azdeeplyrooted.com/home
cName: dispensary
[Step 2] Fetching products from Dutchie GraphQL...
[Fetch] Starting fetch for 6405ef617056e8014d79101b (cName: dispensary)
[Dutchie Client] curl POST FilteredProducts (attempt 1/4)
[Dutchie Client] Response status: 200
[Fetch] Page 1/11: 100 products (total so far: 100)
...
Total products fetched: 1009
[Step 3] Normalizing products...
Validation: PASS
Normalized products: 1009
Brands extracted: 102
[Step 4] Writing to canonical tables via hydrateToCanonical...
Products upserted: 1009
Variants upserted: 1011
[Step 5] Verifying data in canonical tables...
store_products count: 1060
product_variants count: 1011
store_product_snapshots count: 4315
============================================================
SUCCESS - Crawl and hydration complete!
============================================================
Verification Queries
-- Check products for a dispensary
SELECT id, name_raw, brand_name_raw, price_rec, is_in_stock
FROM store_products
WHERE dispensary_id = 112
ORDER BY last_seen_at DESC
LIMIT 10;
-- Check variants
SELECT pv.option, pv.price_rec, pv.in_stock, sp.name_raw
FROM product_variants pv
JOIN store_products sp ON sp.id = pv.store_product_id
WHERE pv.dispensary_id = 112
LIMIT 10;
-- Check snapshot history
SELECT COUNT(*) as total, MAX(captured_at) as latest
FROM store_product_snapshots
WHERE dispensary_id = 112;
Appendix: GraphQL Hashes
All Dutchie GraphQL queries use persisted queries with SHA256 hashes:
export const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};
These hashes are fixed and tied to Dutchie's API version. If Dutchie changes their API, these may need updating.
Last updated: December 2024