Files

Kelly a35976b9e9 chore: Clean up deprecated code and docs

- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 22:17:40 -07:00

17 KiB

Raw Blame History

Crawl Pipeline Documentation

Overview

The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.

Pipeline Stages

┌─────────────────────┐
│  store_discovery    │  Find new dispensaries
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│ entry_point_discovery│  Resolve slug → platform_dispensary_id
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  product_discovery  │  Initial product crawl
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   product_resync    │  Recurring crawl (every 4 hours)
└─────────────────────┘

Stage Details

1. Store Discovery

Purpose: Find new dispensaries to crawl

Handler: src/tasks/handlers/store-discovery.ts

Flow:

Query Dutchie ConsumerDispensaries GraphQL for cities/states
Extract dispensary info (name, address, menu_url)
Insert into dutchie_discovery_locations
Queue entry_point_discovery for each new location

2. Entry Point Discovery

Purpose: Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)

Handler: src/tasks/handlers/entry-point-discovery.ts

Flow:

Load dispensary from database
Extract slug from menu_url:
- /embedded-menu/<slug> or /dispensary/<slug>
Start stealth session (fingerprint + proxy)
Query resolveDispensaryIdWithDetails(slug) via GraphQL
Update dispensary with platform_dispensary_id
Queue product_discovery task

Example:

menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b

3. Product Discovery

Purpose: Initial crawl of a new dispensary

Handler: src/tasks/handlers/product-discovery.ts

Same as product_resync but for first-time crawls.

4. Product Resync

Purpose: Recurring crawl to capture price/stock changes

Handler: src/tasks/handlers/product-resync.ts

Flow:

Step 1: Load Dispensary Info

SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true

Step 2: Start Stealth Session

Generate random browser fingerprint
Set locale/timezone matching state
Optional proxy rotation

Step 3: Fetch Products via GraphQL

Endpoint: https://dutchie.com/api-3/graphql

Variables:

{
  includeEnterpriseSpecials: false,
  productsFilter: {
    dispensaryId: "<platform_dispensary_id>",
    pricingType: "rec",
    Status: "All",
    types: [],
    useCache: false,
    isDefaultSort: true,
    sortBy: "popularSortIdx",
    sortDirection: 1,
    bypassOnlineThresholds: true,
    isKioskMenu: false,
    removeProductsBelowOptionThresholds: false
  },
  page: 0,
  perPage: 100
}

Key Notes:

Status: "All" returns all products (Active returns same count)
Status: null returns 0 products (broken)
pricingType: "rec" returns BOTH rec and med prices
Paginate until products.length < perPage or allProducts.length >= totalCount

Step 4: Normalize Data

Transform raw Dutchie payload to canonical format via DutchieNormalizer.

Step 5: Upsert Products

Insert/update store_products table with normalized data.

Step 6: Create Snapshots

Insert point-in-time record to store_product_snapshots.

Step 7: Track Missing Products (OOS Detection)

-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id = ANY($2)

-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
  AND provider = 'dutchie'
  AND provider_product_id NOT IN (...)
  AND consecutive_misses < 3

-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
  AND consecutive_misses >= 3
  AND stock_status != 'oos'

Step 8: Download Images

For new products, download and store images locally.

Step 9: Update Dispensary

UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1

GraphQL Payload Structure

Product Fields (from filteredProducts.products[])

Field	Type	Description
`_id` / `id`	string	MongoDB ObjectId (24 hex chars)
`Name`	string	Product display name
`brandName`	string	Brand name
`brand.name`	string	Brand name (nested)
`brand.description`	string	Brand description
`type`	string	Category (Flower, Edible, Concentrate, etc.)
`subcategory`	string	Subcategory
`strainType`	string	Hybrid, Indica, Sativa, N/A
`Status`	string	Always "Active" in feed
`Image`	string	Primary image URL
`images[]`	array	All product images

Pricing Fields

Field	Type	Description
`Prices[]`	number[]	Rec prices per option
`recPrices[]`	number[]	Rec prices
`medicalPrices[]`	number[]	Medical prices
`recSpecialPrices[]`	number[]	Rec sale prices
`medicalSpecialPrices[]`	number[]	Medical sale prices
`Options[]`	string[]	Size options ("1/8oz", "1g", etc.)
`rawOptions[]`	string[]	Raw weight options ("3.5g")

Inventory Fields (POSMetaData.children[])

Field	Type	Description
`quantity`	number	Total inventory count
`quantityAvailable`	number	Available for online orders
`kioskQuantityAvailable`	number	Available for kiosk orders
`option`	string	Which size option this is for

Potency Fields

Field	Type	Description
`THCContent.range[]`	number[]	THC percentage
`CBDContent.range[]`	number[]	CBD percentage
`cannabinoidsV2[]`	array	Detailed cannabinoid breakdown

Specials (specialData.bogoSpecials[])

Field	Type	Description
`specialName`	string	Deal name
`specialType`	string	"bogo", "sale", etc.
`itemsForAPrice.value`	string	Bundle price
`bogoRewards[].totalQuantity.quantity`	number	Required quantity

OOS Detection Logic

Products disappear from the Dutchie feed when they go out of stock. We track this via consecutive_misses:

Scenario	Action
Product in feed	`consecutive_misses = 0`
Product missing 1st time	`consecutive_misses = 1`
Product missing 2nd time	`consecutive_misses = 2`
Product missing 3rd time	`consecutive_misses = 3`, mark `stock_status = 'oos'`
Product returns to feed	`consecutive_misses = 0`, update stock_status

Why 3 misses?

Protects against false positives from crawl failures
Single bad crawl doesn't trigger mass OOS alerts
Balances detection speed vs accuracy

Database Tables

store_products

Current state of each product:

provider_product_id - Dutchie's MongoDB ObjectId
name_raw, brand_name_raw - Raw values from feed
price_rec, price_med - Current prices
is_in_stock, stock_status - Availability
consecutive_misses - OOS detection counter
last_seen_at - Last time product was in feed

store_product_snapshots

Point-in-time records for historical analysis:

One row per product per crawl
Captures price, stock, potency at that moment
Used for price history, analytics

dispensaries

Store metadata:

platform_dispensary_id - MongoDB ObjectId for GraphQL
menu_url - Source URL
last_crawl_at - Last successful crawl
crawl_enabled - Whether to crawl

Worker Roles

Workers pull tasks from the worker_tasks queue based on their assigned role.

Role	Name	Description	Handler
`product_resync`	Product Resync	Re-crawl dispensary products for price/stock changes	`handleProductResync`
`product_discovery`	Product Discovery	Initial product discovery for new dispensaries	`handleProductDiscovery`
`store_discovery`	Store Discovery	Discover new dispensary locations	`handleStoreDiscovery`
`entry_point_discovery`	Entry Point Discovery	Resolve platform IDs from menu URLs	`handleEntryPointDiscovery`
`analytics_refresh`	Analytics Refresh	Refresh materialized views and analytics	`handleAnalyticsRefresh`

API Endpoint: GET /api/worker-registry/roles

Scheduling

Crawls are scheduled via worker_tasks table:

Role	Frequency	Description
`product_resync`	Every 4 hours	Regular product refresh
`product_discovery`	On-demand	First crawl for new stores
`entry_point_discovery`	On-demand	New store setup
`store_discovery`	Daily	Find new stores
`analytics_refresh`	Daily	Refresh analytics materialized views

Priority & On-Demand Tasks

Tasks are claimed by workers in order of priority DESC, created_at ASC.

Priority Levels

Priority	Use Case	Example
0	Scheduled/batch tasks	Daily product_resync generation
10	On-demand/chained tasks	entry_point → product_discovery
Higher	Urgent/manual triggers	Admin-triggered immediate crawl

Task Chaining

When a task completes, the system automatically creates follow-up tasks:

store_discovery (completed)
    └─► entry_point_discovery (priority: 10) for each new store

entry_point_discovery (completed, success)
    └─► product_discovery (priority: 10) for that store

product_discovery (completed)
    └─► [no chain] Store enters regular resync schedule

On-Demand Task Creation

Use the task service to create high-priority tasks:

// Create immediate product resync for a store
await taskService.createTask({
  role: 'product_resync',
  dispensary_id: 123,
  platform: 'dutchie',
  priority: 20, // Higher than batch tasks
});

// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');

Claim Function

The claim_task() SQL function atomically claims tasks:

Respects priority ordering (higher = first)
Uses FOR UPDATE SKIP LOCKED for concurrency
Prevents multiple active tasks per store

Image Storage

Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.

Storage Path

/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp

Example:

/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp

Image Proxy API

Served via /img/* with on-demand resizing using sharp:

GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200

Param	Description
`w`	Width in pixels (max 4000)
`h`	Height in pixels (max 4000)
`q`	Quality 1-100 (default 80)
`fit`	cover, contain, fill, inside, outside
`blur`	Blur sigma (0.3-1000)
`gray`	Grayscale (1 = enabled)
`format`	webp, jpeg, png, avif (default webp)

Key Files

File	Purpose
`src/utils/image-storage.ts`	Download & save images to local filesystem
`src/routes/image-proxy.ts`	On-demand resize/transform at `/img/*`

Download Rules

Scenario	Image Action
New product (first crawl)	Download if `primaryImageUrl` exists
Existing product (refresh)	Download only if `local_image_path` is NULL (backfill)
Product already has local image	Skip download entirely

Logic:

Images are downloaded once and never re-downloaded on subsequent crawls
skipIfExists: true - filesystem check prevents re-download even if queued
First crawl: all products get images
Refresh crawl: only new products or products missing local images

Storage Rules

NO MinIO - local filesystem only (STORAGE_DRIVER=local)
Store full resolution, resize on-demand via /img proxy
Convert to webp for consistency using sharp
Preserve original Dutchie URL as fallback in image_url column
Local path stored in local_image_path column

Stealth & Anti-Detection

PROXIES ARE REQUIRED - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.

Workers automatically initialize anti-detection systems on startup.

Components

Component	Purpose	Source
CrawlRotator	Coordinates proxy + UA rotation	`src/services/crawl-rotator.ts`
ProxyRotator	Round-robin proxy selection, health tracking	`src/services/crawl-rotator.ts`
UserAgentRotator	Cycles through realistic browser fingerprints	`src/services/crawl-rotator.ts`
Dutchie Client	Curl-based HTTP with auto-retry on 403	`src/platforms/dutchie/client.ts`

Initialization Flow

Worker Start
    │
    ├─► initializeStealth()
    │       │
    │       ├─► CrawlRotator.initialize()
    │       │       └─► Load proxies from `proxies` table
    │       │
    │       └─► setCrawlRotator(rotator)
    │               └─► Wire to Dutchie client
    │
    └─► Process tasks...

Stealth Session (per task)

Each crawl task starts a stealth session:

// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');

This creates a new identity with:

Random fingerprint: Chrome/Firefox/Safari/Edge on Win/Mac/Linux
Accept-Language: Matches timezone (e.g., America/Phoenix → en-US,en;q=0.9)
sec-ch-ua headers: Proper Client Hints for the browser profile

On 403 Block

When Dutchie returns 403, the client automatically:

Records failure on current proxy (increments failure_count)
If proxy has 5+ failures, deactivates it
Rotates to next healthy proxy
Rotates fingerprint
Retries the request

Proxy Table Schema

CREATE TABLE proxies (
  id SERIAL PRIMARY KEY,
  host VARCHAR(255) NOT NULL,
  port INTEGER NOT NULL,
  username VARCHAR(100),
  password VARCHAR(100),
  protocol VARCHAR(10) DEFAULT 'http',  -- http, https, socks5
  is_active BOOLEAN DEFAULT true,
  last_used_at TIMESTAMPTZ,
  failure_count INTEGER DEFAULT 0,
  success_count INTEGER DEFAULT 0,
  avg_response_time_ms INTEGER,
  last_failure_at TIMESTAMPTZ,
  last_error TEXT
);

Configuration

Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.

User-Agent Generation

See workflow-12102025.md for full specification.

Summary:

Uses intoli/user-agents library (daily-updated market share data)
Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
Browser whitelist: Chrome, Safari, Edge, Firefox only
UA sticks until IP rotates (403 or manual rotation)
Failure = alert admin + stop crawl (no fallback)

Each fingerprint includes proper sec-ch-ua, sec-ch-ua-platform, and sec-ch-ua-mobile headers.

Error Handling

GraphQL errors: Logged, task marked failed, retried later
Normalization errors: Logged as warnings, continue with valid products
Image download errors: Non-fatal, logged, continue
Database errors: Task fails, will be retried
403 blocks: Auto-rotate proxy + fingerprint, retry (up to 3 retries)

Files

File	Purpose
`src/tasks/handlers/product-resync.ts`	Main crawl handler
`src/tasks/handlers/entry-point-discovery.ts`	Slug → ID resolution
`src/platforms/dutchie/index.ts`	GraphQL client, session management
`src/hydration/normalizers/dutchie.ts`	Payload normalization
`src/hydration/canonical-upsert.ts`	Database upsert logic
`src/utils/image-storage.ts`	Image download and local storage
`src/routes/image-proxy.ts`	On-demand image resizing
`migrations/075_consecutive_misses.sql`	OOS tracking column

17 KiB Raw Blame History

Crawl Pipeline Documentation

Overview

Pipeline Stages

Stage Details

1. Store Discovery

2. Entry Point Discovery

3. Product Discovery

4. Product Resync

Step 1: Load Dispensary Info

Step 2: Start Stealth Session

Step 3: Fetch Products via GraphQL

Step 4: Normalize Data

Step 5: Upsert Products

Step 6: Create Snapshots

Step 7: Track Missing Products (OOS Detection)

Step 8: Download Images

Step 9: Update Dispensary

GraphQL Payload Structure

Product Fields (from filteredProducts.products[])

Pricing Fields

Inventory Fields (POSMetaData.children[])

Potency Fields

Specials (specialData.bogoSpecials[])

OOS Detection Logic

Database Tables

store_products

store_product_snapshots

dispensaries

Worker Roles

Scheduling

Priority & On-Demand Tasks

Priority Levels

Task Chaining

On-Demand Task Creation

Claim Function

Image Storage

Storage Path

Image Proxy API

Key Files

Download Rules

Storage Rules

Stealth & Anti-Detection

Components

Initialization Flow

Stealth Session (per task)

On 403 Block

Proxy Table Schema

Configuration

User-Agent Generation

Error Handling

Files

17 KiB

Raw Blame History