Files
cannaiq/docs/DUTCHIE_CRAWL_WORKFLOW.md
Kelly 2f483b3084 feat: SEO template library, discovery pipeline, and orchestrator enhancements
## SEO Template Library
- Add complete template library with 7 page types (state, city, category, brand, product, search, regeneration)
- Add Template Library tab in SEO Orchestrator with accordion-based editors
- Add template preview, validation, and variable injection engine
- Add API endpoints: /api/seo/templates, preview, validate, generate, regenerate

## Discovery Pipeline
- Add promotion.ts for discovery location validation and promotion
- Add discover-all-states.ts script for multi-state discovery
- Add promotion log migration (067)
- Enhance discovery routes and types

## Orchestrator & Admin
- Add crawl_enabled filter to stores page
- Add API permissions page
- Add job queue management
- Add price analytics routes
- Add markets and intelligence routes
- Enhance dashboard and worker monitoring

## Infrastructure
- Add migrations for worker definitions, SEO settings, field alignment
- Add canonical pipeline for scraper v2
- Update hydration and sync orchestrator
- Enhance multi-state query service

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 00:05:34 -07:00

672 lines
18 KiB
Markdown

# Dutchie Crawl Workflow
Complete end-to-end documentation for the Dutchie GraphQL crawl pipeline, from store discovery to product management.
---
## Table of Contents
1. [Architecture Overview](#1-architecture-overview)
2. [Store Discovery](#2-store-discovery)
3. [Platform ID Resolution](#3-platform-id-resolution)
4. [Product Crawling](#4-product-crawling)
5. [Normalization Pipeline](#5-normalization-pipeline)
6. [Canonical Data Model](#6-canonical-data-model)
7. [Hydration (Writing to DB)](#7-hydration-writing-to-db)
8. [Key Files Reference](#8-key-files-reference)
9. [Common Issues & Solutions](#9-common-issues--solutions)
10. [Running Crawls](#10-running-crawls)
---
## 1. Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DUTCHIE CRAWL PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Discovery │ -> │ Resolution │ -> │ Crawl │ -> │ Hydrate │ │
│ │ (find URLs) │ │ (get IDs) │ │ (fetch data) │ │ (to DB) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ │ │ │ │ │
│ v v v v │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ dispensaries │ │ dispensaries │ │ Raw JSON │ │ store_ │ │
│ │ .menu_url │ │ .platform_ │ │ Products │ │ products │ │
│ │ │ │ dispensary_id│ │ │ │ snapshots │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │ variants │ │
│ └───────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Key Principles
1. **GraphQL Only**: All Dutchie data comes from `https://dutchie.com/api-3/graphql`
2. **Curl-Based HTTP**: Uses curl via child_process to bypass TLS fingerprinting
3. **No Puppeteer**: The old DOM-based scraper is deprecated - DO NOT USE `scraper-v2/engine.ts` for Dutchie
4. **Historical Data**: Never delete products/snapshots - always append
---
## 2. Store Discovery
### How Stores Get Into the System
Stores are added to the `dispensaries` table with a `menu_url` pointing to their Dutchie menu.
**Menu URL Formats:**
```
https://dutchie.com/dispensary/<slug>
https://dutchie.com/embedded-menu/<slug>
https://<custom-domain>.com/menu (redirects to Dutchie)
```
### Required Fields for Crawling
| Field | Required | Description |
|-------|----------|-------------|
| `menu_url` | Yes | URL to the Dutchie menu |
| `menu_type` | Yes | Must be `'dutchie'` |
| `platform_dispensary_id` | Yes | MongoDB ObjectId from Dutchie |
**A store CANNOT be crawled until `platform_dispensary_id` is resolved.**
---
## 3. Platform ID Resolution
### What is `platform_dispensary_id`?
Dutchie uses MongoDB ObjectIds internally (e.g., `6405ef617056e8014d79101b`). This ID is required for all GraphQL product queries.
### Resolution Process
```typescript
// File: src/platforms/dutchie/queries.ts
import { resolveDispensaryId } from '../platforms/dutchie';
// Extract slug from menu_url
const slug = menuUrl.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/)?.[1];
// Resolve to platform ID via GraphQL
const platformId = await resolveDispensaryId(slug);
// Returns: "6405ef617056e8014d79101b" or null
```
### GraphQL Query Used
```graphql
query GetAddressBasedDispensaryData($dispensaryFilter: dispensaryFilter!) {
dispensary(filter: $dispensaryFilter) {
id # <-- This is the platform_dispensary_id
name
cName
...
}
}
```
**Variables:**
```json
{
"dispensaryFilter": {
"cNameOrID": "AZ-Deeply-Rooted"
}
}
```
### Persisted Query Hash
```typescript
GRAPHQL_HASHES.GetAddressBasedDispensaryData = '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b'
```
---
## 4. Product Crawling
### GraphQL Query: FilteredProducts
This is the main query for fetching products from a dispensary.
**Endpoint:** `https://dutchie.com/api-3/graphql`
**Method:** POST (via curl)
**Persisted Query Hash:**
```typescript
GRAPHQL_HASHES.FilteredProducts = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
```
### Query Variables
```typescript
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: '6405ef617056e8014d79101b', // platform_dispensary_id
pricingType: 'rec', // 'rec' or 'med'
Status: 'Active', // CRITICAL: Use 'Active', NOT null
types: [], // empty = all categories
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: 0, // 0-indexed pagination
perPage: 100, // max 100 per page
};
```
### CRITICAL: Status Parameter
| Value | Result |
|-------|--------|
| `'Active'` | Returns in-stock products WITH pricing |
| `null` | Returns 0 products (broken) |
| `'Inactive'` | Returns out-of-stock products only |
**Always use `Status: 'Active'` for product crawls.**
### Response Structure
```json
{
"data": {
"filteredProducts": {
"products": [
{
"_id": "product-mongo-id",
"Name": "Product Name",
"brandName": "Brand Name",
"type": "Flower",
"subcategory": "Indica",
"Status": "Active",
"recPrices": [45.00, 90.00],
"recSpecialPrices": [],
"THCContent": { "unit": "PERCENTAGE", "range": [28.24] },
"CBDContent": { "unit": "PERCENTAGE", "range": [0] },
"Image": "https://images.dutchie.com/...",
"POSMetaData": {
"children": [
{
"option": "1/8oz",
"recPrice": 45.00,
"quantityAvailable": 10
},
{
"option": "1/4oz",
"recPrice": 90.00,
"quantityAvailable": 5
}
]
}
}
],
"queryInfo": {
"totalCount": 1009,
"totalPages": 11
}
}
}
}
```
### Pagination
```typescript
const DUTCHIE_CONFIG = {
perPage: 100, // Products per page
maxPages: 200, // Safety limit
pageDelayMs: 500, // Delay between pages
};
// Fetch all pages
let page = 0;
let totalPages = 1;
while (page < totalPages) {
const result = await executeGraphQL('FilteredProducts', { ...variables, page });
const data = result.data.filteredProducts;
totalPages = Math.ceil(data.queryInfo.totalCount / 100);
allProducts.push(...data.products);
page++;
await sleep(500); // Rate limiting
}
```
---
## 5. Normalization Pipeline
### Purpose
Convert raw Dutchie JSON into a standardized format before database insertion.
### Key File: `src/hydration/normalizers/dutchie.ts`
```typescript
import { DutchieNormalizer } from '../hydration';
const normalizer = new DutchieNormalizer();
// Build RawPayload structure
const rawPayload = {
id: 'unique-id',
dispensary_id: 112,
crawl_run_id: null,
platform: 'dutchie',
payload_version: 1,
raw_json: { products: rawProducts }, // <-- Products go here
product_count: rawProducts.length,
pricing_type: 'rec',
crawl_mode: 'active',
fetched_at: new Date(),
processed: false,
normalized_at: null,
hydration_error: null,
hydration_attempts: 0,
created_at: new Date(),
};
// Normalize
const result = normalizer.normalize(rawPayload);
// Result contains:
// - result.products: NormalizedProduct[]
// - result.pricing: Map<externalId, NormalizedPricing>
// - result.availability: Map<externalId, NormalizedAvailability>
// - result.brands: NormalizedBrand[]
```
### Field Mappings
| Dutchie Field | Normalized Field |
|---------------|------------------|
| `_id` / `id` | `externalProductId` |
| `Name` | `name` |
| `brandName` | `brandName` |
| `type` | `category` |
| `subcategory` | `subcategory` |
| `Status` | `status`, `isActive` |
| `THCContent.range[0]` | `thcPercent` |
| `CBDContent.range[0]` | `cbdPercent` |
| `Image` | `primaryImageUrl` |
| `recPrices[0]` | `priceRec` (in cents) |
| `recSpecialPrices[0]` | `priceRecSpecial` (in cents) |
### Data Validation
The normalizer handles edge cases:
```typescript
// THC/CBD values > 100 are milligrams, not percentages - skip them
if (thcPercent > 100) thcPercent = null;
// Products without IDs are skipped
if (!externalId) return null;
// Products without names are skipped
if (!name) return null;
```
---
## 6. Canonical Data Model
### Tables
#### `store_products` - Current product state per store
```sql
CREATE TABLE store_products (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id),
provider VARCHAR(50) NOT NULL DEFAULT 'dutchie',
provider_product_id VARCHAR(100), -- Dutchie's _id
name_raw VARCHAR(500) NOT NULL,
brand_name_raw VARCHAR(255),
category_raw VARCHAR(100),
subcategory_raw VARCHAR(100),
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
is_on_special BOOLEAN DEFAULT false,
discount_percent NUMERIC(5,2),
is_in_stock BOOLEAN DEFAULT true,
stock_quantity INTEGER,
stock_status VARCHAR(50) DEFAULT 'in_stock',
thc_percent NUMERIC(5,2), -- Max 99.99
cbd_percent NUMERIC(5,2), -- Max 99.99
image_url TEXT,
first_seen_at TIMESTAMPTZ DEFAULT NOW(),
last_seen_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(dispensary_id, provider, provider_product_id)
);
```
#### `store_product_snapshots` - Historical price/stock records
```sql
CREATE TABLE store_product_snapshots (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL,
store_product_id INTEGER REFERENCES store_products(id),
provider VARCHAR(50) NOT NULL,
provider_product_id VARCHAR(100),
crawl_run_id INTEGER,
captured_at TIMESTAMPTZ NOT NULL,
name_raw VARCHAR(500),
brand_name_raw VARCHAR(255),
category_raw VARCHAR(100),
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
is_on_special BOOLEAN,
is_in_stock BOOLEAN,
stock_quantity INTEGER,
stock_status VARCHAR(50),
thc_percent NUMERIC(5,2),
cbd_percent NUMERIC(5,2),
raw_data JSONB -- Full raw product for debugging
);
```
#### `product_variants` - Per-weight pricing options
```sql
CREATE TABLE product_variants (
id SERIAL PRIMARY KEY,
store_product_id INTEGER NOT NULL REFERENCES store_products(id),
dispensary_id INTEGER NOT NULL,
option VARCHAR(50) NOT NULL, -- "1/8oz", "1g", "100mg"
price_rec NUMERIC(10,2),
price_med NUMERIC(10,2),
price_rec_special NUMERIC(10,2),
price_med_special NUMERIC(10,2),
quantity INTEGER,
in_stock BOOLEAN,
weight_value NUMERIC(10,4), -- Parsed: 3.5
weight_unit VARCHAR(10), -- Parsed: "g"
UNIQUE(store_product_id, option)
);
```
---
## 7. Hydration (Writing to DB)
### Key File: `src/hydration/canonical-upsert.ts`
### Function: `hydrateToCanonical()`
```typescript
import { hydrateToCanonical } from '../hydration';
const result = await hydrateToCanonical(
pool, // pg Pool
dispensaryId, // number
normResult, // NormalizationResult from normalizer
crawlRunId // number | null
);
// Result:
// {
// productsUpserted: 1009,
// productsNew: 50,
// snapshotsCreated: 1009,
// variantsUpserted: 1011,
// brandsUpserted: 102,
// }
```
### Upsert Logic
**Products:** `ON CONFLICT (dispensary_id, provider, provider_product_id) DO UPDATE`
- Updates: name, prices, stock, THC/CBD, timestamps
- Preserves: `first_seen_at`, `id`
**Snapshots:** Always INSERT (append-only history)
- One snapshot per product per crawl
- Contains full state at capture time
**Variants:** `ON CONFLICT (store_product_id, option) DO UPDATE`
- Updates: prices, stock, quantity
- Tracks: `last_price_change_at`, `last_stock_change_at`
### Data Transformations
```typescript
// Prices: cents -> dollars
priceRec: productPricing.priceRec / 100
// THC/CBD: Clamp to valid percentage range
thcPercent: product.thcPercent <= 100 ? product.thcPercent : null
// Stock status mapping
stockStatus: availability.stockStatus || 'unknown'
```
---
## 8. Key Files Reference
### HTTP Client
| File | Purpose |
|------|---------|
| `src/platforms/dutchie/client.ts` | Curl-based HTTP client (LOCKED) |
| `src/platforms/dutchie/queries.ts` | GraphQL query wrappers |
| `src/platforms/dutchie/index.ts` | Public exports |
### Normalization
| File | Purpose |
|------|---------|
| `src/hydration/normalizers/dutchie.ts` | Dutchie-specific normalization |
| `src/hydration/normalizers/base.ts` | Base normalizer class |
| `src/hydration/types.ts` | Type definitions |
### Database
| File | Purpose |
|------|---------|
| `src/hydration/canonical-upsert.ts` | Upsert functions for canonical tables |
| `src/hydration/index.ts` | Public exports |
| `src/db/pool.ts` | Database connection pool |
### Scripts
| File | Purpose |
|------|---------|
| `src/scripts/test-crawl-to-canonical.ts` | Test script for single dispensary |
---
## 9. Common Issues & Solutions
### Issue: GraphQL Returns 0 Products
**Cause:** Using `Status: null` instead of `Status: 'Active'`
**Solution:**
```typescript
productsFilter: {
Status: 'Active', // NOT null
...
}
```
### Issue: Numeric Field Overflow
**Cause:** THC/CBD values in milligrams (e.g., 1400mg) stored in percentage field
**Solution:** Clamp values > 100 to null:
```typescript
thcPercent: value <= 100 ? value : null
```
### Issue: Column "name" Does Not Exist
**Cause:** Code uses `name` but table has `name_raw`
**Column Mapping:**
| Code | Database |
|------|----------|
| `name` | `name_raw` |
| `brand_name` | `brand_name_raw` |
| `category` | `category_raw` |
| `subcategory` | `subcategory_raw` |
### Issue: 403 Forbidden
**Cause:** TLS fingerprinting or rate limiting
**Solution:** The curl-based client handles this with:
- Browser fingerprint rotation
- Proper headers (Origin, Referer, User-Agent)
- Retry with exponential backoff
### Issue: Normalizer Returns 0 Products
**Cause:** Wrong payload structure passed to `normalize()`
**Solution:** Use `RawPayload` structure:
```typescript
const rawPayload = {
raw_json: { products: [...] }, // Products in raw_json
dispensary_id: 112, // Required
// ... other fields
};
normalizer.normalize(rawPayload); // NOT (payload, id)
```
---
## 10. Running Crawls
### Test Script (Single Dispensary)
```bash
cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts <dispensaryId>
# Example:
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/test-crawl-to-canonical.ts 112
```
### Expected Output
```
============================================================
Test Crawl to Canonical - Dispensary 112
============================================================
[Step 1] Getting dispensary info...
Name: Deeply Rooted Boutique Cannabis Company
Platform ID: 6405ef617056e8014d79101b
Menu URL: https://azdeeplyrooted.com/home
cName: dispensary
[Step 2] Fetching products from Dutchie GraphQL...
[Fetch] Starting fetch for 6405ef617056e8014d79101b (cName: dispensary)
[Dutchie Client] curl POST FilteredProducts (attempt 1/4)
[Dutchie Client] Response status: 200
[Fetch] Page 1/11: 100 products (total so far: 100)
...
Total products fetched: 1009
[Step 3] Normalizing products...
Validation: PASS
Normalized products: 1009
Brands extracted: 102
[Step 4] Writing to canonical tables via hydrateToCanonical...
Products upserted: 1009
Variants upserted: 1011
[Step 5] Verifying data in canonical tables...
store_products count: 1060
product_variants count: 1011
store_product_snapshots count: 4315
============================================================
SUCCESS - Crawl and hydration complete!
============================================================
```
### Verification Queries
```sql
-- Check products for a dispensary
SELECT id, name_raw, brand_name_raw, price_rec, is_in_stock
FROM store_products
WHERE dispensary_id = 112
ORDER BY last_seen_at DESC
LIMIT 10;
-- Check variants
SELECT pv.option, pv.price_rec, pv.in_stock, sp.name_raw
FROM product_variants pv
JOIN store_products sp ON sp.id = pv.store_product_id
WHERE pv.dispensary_id = 112
LIMIT 10;
-- Check snapshot history
SELECT COUNT(*) as total, MAX(captured_at) as latest
FROM store_product_snapshots
WHERE dispensary_id = 112;
```
---
## Appendix: GraphQL Hashes
All Dutchie GraphQL queries use persisted queries with SHA256 hashes:
```typescript
export const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};
```
These hashes are fixed and tied to Dutchie's API version. If Dutchie changes their API, these may need updating.
---
*Last updated: December 2024*