Files

Kelly 6cd1f55119 fix(workers): Preserve fantasy names on pod restart

- Re-registration no longer overwrites pod_name with K8s name
- New workers get fantasy name (Aethelgard, Xylos, etc.) as pod_name
- Document worker naming convention in CLAUDE.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 20:35:25 -07:00

45 KiB

Raw Blame History

Claude Guidelines for this Project

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETION OF DATA — EVER

CannaiQ is a historical analytics system. Data retention is permanent by design.

NEVER delete:

Product records
Crawled snapshots
Images
Directories
Logs
Orchestrator traces
Profiles
Selector configs
Crawl outcomes
Store data
Brand data

NEVER automate cleanup:

No cron or scheduled job may rm, unlink, delete, purge, prune, clean, or reset any storage directory or DB row
No migration may DELETE data — only add/update/alter columns
If cleanup is required, ONLY the user may issue a manual command

Code enforcement:

local-storage.ts must only: write files, create directories, read files
No deleteImage, deleteProductImages, or similar functions

2. NO PROCESS KILLING — EVER

Claude must NEVER run process-killing commands:

No pkill
No kill -9
No xargs kill
No lsof | kill
No killall
No fuser -k

Claude must NOT manage host processes. Only user scripts manage the local environment.

Correct behavior:

If backend is running on port 3010 → say: "Backend already running"
If backend is NOT running → say: "Please run ./setup-local.sh"

Process management is done ONLY by user scripts:

./setup-local.sh   # Start local environment
./stop-local.sh    # Stop local environment

3. NO MANUAL SERVER STARTUP — EVER

Claude must NEVER start the backend manually:

No npx tsx src/index.ts
No node dist/index.js
No npm run dev with custom env vars
No DATABASE_URL=... npx tsx ...

Claude must NEVER set DATABASE_URL in shell commands:

DB connection uses CANNAIQ_DB_* env vars or CANNAIQ_DB_URL from the user's environment
Never hardcode connection strings in bash commands
Never override env vars to bypass the user's DB setup

If backend is not running:

Say: "Please run ./setup-local.sh"
Do NOT attempt to start it yourself

If a dependency is missing:

Add it to package.json
Say: "Please run cd backend && npm install"
Do NOT try to solve it by starting a custom dev server

The ONLY way to start local services:

cd backend
./setup-local.sh

4. DEPLOYMENT AUTHORIZATION REQUIRED

NEVER deploy to production unless the user explicitly says:

"CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."

Until then:

All work is LOCAL ONLY
No kubectl apply, docker push, or remote operations
No port-forwarding to production
No connecting to Kubernetes clusters

5. DATABASE CONNECTION ARCHITECTURE

Migration code is CLI-only. Runtime code must NOT import src/db/migrate.ts.

Module	Purpose	Import From
`src/db/migrate.ts`	CLI migrations only	NEVER import at runtime
`src/db/pool.ts`	Runtime database pool	`import { pool } from '../db/pool'`
`src/dutchie-az/db/connection.ts`	Canonical connection helper	Alternative for runtime

Runtime gets DB connections ONLY via:

import { pool } from '../db/pool';
// or
import { getPool } from '../dutchie-az/db/connection';

To run migrations:

cd backend
npx tsx src/db/migrate.ts

Why this matters:

migrate.ts validates env vars strictly and throws at module load time
Importing it at runtime causes startup crashes if env vars aren't perfect
pool.ts uses lazy initialization - only validates when first query is made

6. ALL API ROUTES REQUIRE AUTHENTICATION — NO EXCEPTIONS

Every API router MUST apply authMiddleware at the router level.

import { authMiddleware } from '../auth/middleware';

const router = Router();
router.use(authMiddleware);  // REQUIRED - first line after router creation

Authentication flow (see src/auth/middleware.ts):

Check Bearer token (JWT or API token) → grant access if valid
Check trusted origins (cannaiq.co, findadispo.com, localhost, etc.) → grant access
Check trusted IPs (127.0.0.1, ::1, internal pod IPs) → grant access
Return 401 Unauthorized if none of the above

NEVER create API routes without auth middleware:

No "public" endpoints that bypass authentication
No "read-only" exceptions
No "analytics-only" exceptions
If an endpoint exists under /api/*, it MUST be protected

When creating new route files:

Import authMiddleware from ../auth/middleware
Add router.use(authMiddleware) immediately after creating the router
Document security requirements in file header comments

Trusted origins (defined in middleware):

https://cannaiq.co
https://findadispo.com
https://findagram.co
*.cannabrands.app domains
localhost:* for development

7. LOCAL DEVELOPMENT BY DEFAULT

Quick Start:

./setup-local.sh

Services (all started by setup-local.sh):

Service	URL	Purpose
PostgreSQL	localhost:54320	cannaiq-postgres container
Backend API	http://localhost:3010	Express API server
CannaiQ Admin	http://localhost:8080/admin	B2B admin dashboard
FindADispo	http://localhost:3001	Consumer dispensary finder
Findagram	http://localhost:3002	Consumer delivery marketplace

In local mode:

Use docker-compose.local.yml (NO MinIO)
Use local filesystem storage at ./storage
Connect to cannaiq-postgres at localhost:54320
Backend runs at localhost:3010
All three frontends run on separate ports (8080, 3001, 3002)
NO remote connections, NO Kubernetes, NO MinIO

Environment:

All DB config is in backend/.env
STORAGE_DRIVER=local
STORAGE_BASE_PATH=./storage

Local Admin Bootstrap:

cd backend
npx tsx src/scripts/bootstrap-local-admin.ts

Creates/resets a deterministic local admin user:

Field	Value
Email	`admin@local.test`
Password	`admin123`
Role	`superadmin`

This is a LOCAL-DEV helper only. Never use these credentials in production.

Manual startup (if not using setup-local.sh):

# Terminal 1: Start PostgreSQL
docker-compose -f docker-compose.local.yml up -d

# Terminal 2: Start Backend
cd backend && npm run dev

# Terminal 3: Start Frontend
cd cannaiq && npm run dev:admin

Stop services:

./stop-local.sh

DATABASE MODEL (CRITICAL)

Database Architecture

CannaiQ has TWO databases with distinct purposes:

Database	Purpose	Access
`dutchie_menus`	Canonical CannaiQ database - All schema, migrations, and application data	READ/WRITE
`dutchie_legacy`	Legacy read-only archive - Historical data from old system	READ-ONLY

Store vs Dispensary Terminology

"Store" and "Dispensary" are SYNONYMS in CannaiQ.

Term	Usage	DB Table
Store	API routes (`/api/stores`)	`dispensaries`
Dispensary	DB table, internal code	`dispensaries`

/api/stores and /api/dispensaries both query the dispensaries table
There is NO stores table in use - it's a legacy empty table
Use these terms interchangeably in code and documentation

Canonical vs Legacy Tables

CANONICAL TABLES (USE THESE):

Table	Purpose	Row Count
`dispensaries`	Store/dispensary records	~188+ rows
`store_products`	Product catalog	~37,000+ rows
`store_product_snapshots`	Price/stock history	~millions

LEGACY TABLES (EMPTY - DO NOT USE):

Table	Status	Action
`stores`	EMPTY (0 rows)	Use `dispensaries` instead
`products`	EMPTY (0 rows)	Use `store_products` instead
`dutchie_products`	LEGACY (0 rows)	Use `store_products` instead
`dutchie_product_snapshots`	LEGACY (0 rows)	Use `store_product_snapshots` instead
`categories`	EMPTY (0 rows)	Categories stored in product records

Code must NEVER:

Query the stores table (use dispensaries)
Query the products table (use store_products)
Query the dutchie_products table (use store_products)
Query the categories table (categories are in product records)

CRITICAL RULES:

Migrations ONLY run on dutchie_menus - NEVER on dutchie_legacy
Application code connects ONLY to dutchie_menus
ETL scripts READ from dutchie_legacy, WRITE to dutchie_menus
dutchie_legacy is frozen - NO writes, NO schema changes, NO migrations

Environment Variables

CannaiQ Database (dutchie_menus) - PRIMARY:

# All application/migration DB access uses these env vars:
CANNAIQ_DB_HOST=localhost       # Database host
CANNAIQ_DB_PORT=54320           # Database port
CANNAIQ_DB_NAME=dutchie_menus   # MUST be dutchie_menus
CANNAIQ_DB_USER=dutchie         # Database user
CANNAIQ_DB_PASS=<password>      # Database password

# OR use a full connection string:
CANNAIQ_DB_URL=postgresql://user:pass@host:port/dutchie_menus

Legacy Database (dutchie_legacy) - ETL ONLY:

# Only used by ETL scripts for reading legacy data:
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_NAME=dutchie_legacy   # READ-ONLY - never migrated
LEGACY_DB_USER=dutchie
LEGACY_DB_PASS=<password>

# OR use a full connection string:
LEGACY_DB_URL=postgresql://user:pass@host:port/dutchie_legacy

Key Rules:

CANNAIQ_DB_NAME MUST be dutchie_menus for application/migrations
LEGACY_DB_NAME is dutchie_legacy - READ-ONLY for ETL only
ALL application code MUST use CANNAIQ_DB_* environment variables
No hardcoded database names anywhere in the codebase
backend/.env controls all database access for local development

State Modeling:

States (AZ, MI, CA, NV, etc.) are modeled via states table + state_id on dispensaries
NO separate databases per state
Use state_code or state_id columns for filtering

Migration and ETL Procedure

Step 1: Run schema migration (on dutchie_menus ONLY):

cd backend
psql "postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
  -f migrations/041_cannaiq_canonical_schema.sql

Step 2: Run ETL to copy legacy data:

cd backend
npx tsx src/scripts/etl/042_legacy_import.ts
# Reads from dutchie_legacy, writes to dutchie_menus

Database Access Rules

Claude MUST NOT:

Connect to any database besides the canonical CannaiQ database
Use raw connection strings in shell commands
Run psql commands directly
Construct database URLs manually
Create or rename databases automatically
Run npm run migrate without explicit user authorization
Patch schema at runtime (no ALTER TABLE from scripts)

All data access MUST go through:

LOCAL CannaiQ backend HTTP API endpoints
Internal CannaiQ application code (using canonical connection pool)
Ask user to run SQL manually if absolutely needed

Local service management:

User starts services via ./setup-local.sh (ONLY the user runs this)
If port 3010 responds, assume backend is running
If port 3010 does NOT respond, tell user: "Backend is not running; please run ./setup-local.sh"
Claude may only access the app via HTTP: http://localhost:3010 (API), http://localhost:8080/admin (UI)
Never restart, kill, or manage local processes — that is the user's responsibility

Migrations

Rules:

Migrations may be WRITTEN but only the USER runs them after review
Never execute migrations automatically
Only additive migrations (no DROP/DELETE)
Write schema-tolerant code that handles missing optional columns

If schema changes are needed:

Generate a proper migration file in backend/migrations/*.sql
Show the migration to the user
Wait for explicit authorization before running
Never run migrations automatically - only the user runs them after review

Schema tolerance:

If a column is missing at runtime, prefer making the code tolerant (treat field as optional) instead of auto-creating the column
Queries should gracefully handle missing columns by omitting them or using NULL defaults

Canonical Schema Migration (041/042)

Migration 041 (backend/migrations/041_cannaiq_canonical_schema.sql):

Creates canonical CannaiQ tables: states, chains, brands, store_products, store_product_snapshots, crawl_runs
Adds state_id and chain_id columns to dispensaries
Adds status columns to dispensary_crawler_profiles
SCHEMA ONLY - no data inserts from legacy tables

ETL Script 042 (backend/src/scripts/etl/042_legacy_import.ts):

Copies data from legacy dutchie_legacy.dutchie_products → store_products
Copies data from legacy dutchie_legacy.dutchie_product_snapshots → store_product_snapshots
Extracts brands from product data into brands table
Links dispensaries to chains and states
INSERT-ONLY and IDEMPOTENT (uses ON CONFLICT DO NOTHING)
Run manually: cd backend && npx tsx src/scripts/etl/042_legacy_import.ts

Tables touched by ETL:

Source Table (dutchie_legacy)	Target Table (dutchie_menus)
`dutchie_products`	`store_products`
`dutchie_product_snapshots`	`store_product_snapshots`
(brand names extracted)	`brands`
(state codes mapped)	`dispensaries.state_id`
(chain names matched)	`dispensaries.chain_id`

Note: The legacy dutchie_products and dutchie_product_snapshots tables in dutchie_legacy are read-only sources. All new crawl data goes directly to store_products and store_product_snapshots.

Migration 045 (backend/migrations/045_add_image_columns.sql):

Adds thumbnail_url to store_products and store_product_snapshots
image_url already exists from migration 041
ETL 042 populates image_url from legacy primary_image_url where present
thumbnail_url is NULL for legacy data - future crawls can populate it

Deprecated Connection Module

The custom connection module at src/dutchie-az/db/connection is DEPRECATED.

All code using getClient from this module must be refactored to:

Use the CannaiQ API endpoints instead
Use the orchestrator through the API
Use the canonical DB pool from the main application

PERFORMANCE REQUIREMENTS

Database Queries:

NEVER write N+1 queries - always batch fetch related data before iterating
NEVER run queries inside loops - batch them before the loop
Avoid multiple queries when one JOIN or subquery works
Dashboard/index pages should use MAX 5-10 queries total, not 50+
Mentally trace query count - if a page would run 20+ queries, refactor
Cache expensive aggregations (in-memory or Redis, 5-min TTL) instead of recalculating every request
Use query logging during development to verify query count

Before submitting route/controller code, verify:

No queries inside forEach/map/for loops
All related data fetched in batches before iteration
Aggregations done in SQL (COUNT, SUM, AVG, GROUP BY), not in JS
Would this cause a 503 under load? If unsure, simplify.

Examples of BAD patterns:

// BAD: N+1 query - runs a query for each store
const stores = await getStores();
for (const store of stores) {
  store.products = await getProductsByStoreId(store.id); // N queries!
}

// BAD: Query inside map
const results = await Promise.all(
  storeIds.map(id => pool.query('SELECT * FROM products WHERE store_id = $1', [id]))
);

Examples of GOOD patterns:

// GOOD: Batch fetch all products, then group in JS
const stores = await getStores();
const storeIds = stores.map(s => s.id);
const allProducts = await pool.query(
  'SELECT * FROM products WHERE store_id = ANY($1)', [storeIds]
);
const productsByStore = groupBy(allProducts.rows, 'store_id');
stores.forEach(s => s.products = productsByStore[s.id] || []);

// GOOD: Single query with JOIN
const result = await pool.query(`
  SELECT s.*, COUNT(p.id) as product_count
  FROM stores s
  LEFT JOIN products p ON p.store_id = s.id
  GROUP BY s.id
`);

FORBIDDEN ACTIONS

Deleting any data (products, snapshots, images, logs, traces)
Deploying without explicit authorization
Connecting to Kubernetes without authorization
Port-forwarding to production without authorization
Starting MinIO in local development
Using S3/MinIO SDKs when STORAGE_DRIVER=local
Automating cleanup of any kind
Dropping database tables or columns
Overwriting historical records (always append snapshots)
Runtime schema patching (ALTER TABLE from scripts)
Using getClient from deprecated connection module
Creating ad-hoc database connections outside the canonical pool
Auto-adding missing columns at runtime
Killing local processes (pkill, kill, kill -9, etc.)
Starting backend/frontend directly with custom env vars
Running lsof -ti:PORT | xargs kill or similar process-killing commands
Using hardcoded database names in code or comments
Creating or connecting to a second database
Creating API routes without authMiddleware (all /api/* routes MUST be protected)

STORAGE BEHAVIOR

Local Storage Structure

/storage/images/products/{state}/{store}/{brand}/{product}/
  image-{hash}.webp

/storage/images/brands/{brand}/
  logo-{hash}.webp

Image Proxy API (On-Demand Resizing)

Images are stored at full resolution and resized on-demand via the /img endpoint.

Endpoint: GET /img/<path>?<params>

Parameters:

Param	Description	Example
`w`	Width in pixels (max 4000)	`?w=200`
`h`	Height in pixels (max 4000)	`?h=200`
`q`	Quality 1-100 (default 80)	`?q=70`
`fit`	Resize mode: cover, contain, fill, inside, outside	`?fit=cover`
`blur`	Blur sigma 0.3-1000	`?blur=5`
`gray`	Grayscale (1 = enabled)	`?gray=1`
`format`	Output: webp, jpeg, png, avif (default webp)	`?format=jpeg`

Examples:

# Thumbnail (50px)
GET /img/products/az/store/brand/product/image-abc123.webp?w=50

# Card image (200px, cover fit)
GET /img/products/az/store/brand/product/image-abc123.webp?w=200&h=200&fit=cover

# JPEG at 70% quality
GET /img/products/az/store/brand/product/image-abc123.webp?w=400&format=jpeg&q=70

# Grayscale blur
GET /img/products/az/store/brand/product/image-abc123.webp?w=200&gray=1&blur=3

Frontend Usage:

import { getImageUrl, ImageSizes } from '../lib/images';

// Returns /img/products/.../image.webp?w=50 for local images
// Returns original URL for remote images (CDN, etc.)
const thumbUrl = getImageUrl(product.image_url, ImageSizes.thumb);
const cardUrl = getImageUrl(product.image_url, ImageSizes.medium);
const detailUrl = getImageUrl(product.image_url, ImageSizes.detail);

Size Presets:

Preset	Width	Use Case
`thumb`	50px	Table thumbnails
`small`	100px	Small cards
`medium`	200px	Grid cards
`large`	400px	Large cards
`detail`	600px	Product detail
`full`	-	No resize

Storage Adapter

import { saveImage, getImageUrl } from '../utils/storage-adapter';

// Automatically uses local storage when STORAGE_DRIVER=local

Files

File	Purpose
`backend/src/utils/image-storage.ts`	Image download and storage
`backend/src/routes/image-proxy.ts`	On-demand image resizing endpoint
`cannaiq/src/lib/images.ts`	Frontend image URL helper
`docker-compose.local.yml`	Local stack without MinIO
`start-local.sh`	Convenience startup script

UI ANONYMIZATION RULES

No vendor names in forward-facing URLs
No "dutchie", "treez", "jane", "weedmaps", "leafly" visible in consumer UIs
Internal admin tools may show provider names for debugging

DUTCHIE DISCOVERY PIPELINE (Added 2025-01)

Overview

Automated discovery of Dutchie-powered dispensaries across all US states.

Flow

1. getAllCitiesByState GraphQL → Get all cities for a state
2. ConsumerDispensaries GraphQL → Get stores for each city
3. Upsert to dutchie_discovery_locations (keyed by platform_location_id)
4. AUTO-VALIDATE: Check required fields
5. AUTO-PROMOTE: Create/update dispensaries with crawl_enabled=true
6. Log all actions to dutchie_promotion_log

Tables

Table	Purpose
`dutchie_discovery_cities`	Cities known to have dispensaries
`dutchie_discovery_locations`	Raw discovered store data
`dispensaries`	Canonical stores (promoted from discovery)
`dutchie_promotion_log`	Audit trail for validation/promotion

Files

File	Purpose
`src/discovery/discovery-crawler.ts`	Main orchestrator
`src/discovery/location-discovery.ts`	GraphQL fetching
`src/discovery/promotion.ts`	Validation & promotion logic
`src/scripts/run-discovery.ts`	CLI interface
`migrations/067_promotion_log.sql`	Audit log table

GraphQL Hashes (in `src/platforms/dutchie/client.ts`)

Query	Hash
`GetAllCitiesByState`	`ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6`
`ConsumerDispensaries`	`0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b`

Usage

# Discover all stores in a state
npx tsx src/scripts/run-discovery.ts discover:state AZ
npx tsx src/scripts/run-discovery.ts discover:state CA

# Check stats
npx tsx src/scripts/run-discovery.ts stats

Validation Rules

A discovery location must have:

platform_location_id (MongoDB ObjectId, 24 hex chars)
name
city
state_code
platform_menu_url

Invalid records are marked status='rejected' with errors logged.

Key Design Decisions

platform_location_id MUST be MongoDB ObjectId (not slug)
Old geo-based discovery stored slugs → deleted as garbage data
Rate limit: 2 seconds between city requests to avoid API throttling
Promotion is idempotent via ON CONFLICT (platform_dispensary_id)

FUTURE TODO / PENDING FEATURES

Orchestrator observability dashboard
Crawl profile management UI
State machine sandbox (disabled until authorized)
Multi-state expansion beyond AZ

Multi-Site Architecture (CRITICAL)

This project has 4 active locations (plus 1 deprecated) - always clarify which one before making changes:

Folder	Domain	Type	Purpose
`backend/`	(shared)	Express API	Single backend serving all frontends
`frontend/`	(DEPRECATED)	React SPA (Vite)	DEPRECATED - was dispos.crawlsy.com, now removed
`cannaiq/`	cannaiq.co	React SPA + PWA	Admin dashboard / B2B analytics
`findadispo/`	findadispo.com	React SPA + PWA	Consumer dispensary finder
`findagram/`	findagram.co	React SPA + PWA	Consumer delivery marketplace

NOTE: frontend/ folder is DEPRECATED:

frontend/ = OLD/legacy dashboard - NO LONGER DEPLOYED (removed from k8s)
cannaiq/ = Primary admin dashboard, deployed to cannaiq.co
Do NOT use or modify frontend/ folder - it will be archived/removed

Before any frontend work, ASK: "Which site? cannaiq, findadispo, or findagram?"

All three active frontends share:

Same backend API (port 3010)
Same PostgreSQL database
Same Kubernetes deployment for backend

Each frontend has:

Its own folder, package.json, Dockerfile
Its own domain and branding
Its own PWA manifest and service worker (cannaiq, findadispo, findagram)
Separate Docker containers in production

Multi-Domain Hosting Architecture

All three frontends are served from the same IP using host-based routing:

Kubernetes Ingress (Production):

# Each domain routes to its own frontend service
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: multi-site-ingress
spec:
  rules:
  - host: cannaiq.co
    http:
      paths:
      - path: /
        backend:
          service:
            name: cannaiq-frontend
            port: 80
      - path: /api
        backend:
          service:
            name: scraper  # shared backend
            port: 3010
  - host: findadispo.com
    http:
      paths:
      - path: /
        backend:
          service:
            name: findadispo-frontend
            port: 80
      - path: /api
        backend:
          service:
            name: scraper
            port: 3010
  - host: findagram.co
    http:
      paths:
      - path: /
        backend:
          service:
            name: findagram-frontend
            port: 80
      - path: /api
        backend:
          service:
            name: scraper
            port: 3010

Key Points:

DNS A records for all 3 domains point to same IP
Ingress controller routes based on Host header
Each frontend is a separate Docker container (nginx serving static files)
All frontends share the same backend API at /api/*
SSL/TLS handled at ingress level (cert-manager)

PWA Setup Requirements

Each frontend is a Progressive Web App (PWA). Required files in each public/ folder:

manifest.json - App metadata, icons, theme colors
service-worker.js - Offline caching, background sync
Icons - 192x192 and 512x512 PNG icons

Vite PWA Plugin Setup (in each frontend's vite.config.ts):

import { VitePWA } from 'vite-plugin-pwa'

export default defineConfig({
  plugins: [
    react(),
    VitePWA({
      registerType: 'autoUpdate',
      manifest: {
        name: 'Site Name',
        short_name: 'Short',
        theme_color: '#10b981',
        icons: [
          { src: '/icon-192.png', sizes: '192x192', type: 'image/png' },
          { src: '/icon-512.png', sizes: '512x512', type: 'image/png' }
        ]
      },
      workbox: {
        globPatterns: ['**/*.{js,css,html,ico,png,svg,woff2}']
      }
    })
  ]
})

Core Rules Summary

DB: Use the single CannaiQ database via CANNAIQ_DB_* env vars. No hardcoded names.
Images: No MinIO. Save to local /images/products//-.webp (and brands); preserve original URL; serve via backend static.
Dutchie GraphQL: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). CRITICAL: Use Status: 'Active', NOT null (null returns 0 products).
cName/slug: Derive cName from each store's menu_url (/embedded-menu/ or /dispensary/). No hardcoded defaults.
Batch DB writes: Chunk products/snapshots/missing (100–200) to avoid OOM.
API/Frontend: Use /api/stores, /api/products, /api/workers, /api/pipeline endpoints.
Scheduling: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter.
THC/CBD values: Clamp to ≤100 - some products report milligrams as percentages.
Column names: Use name_raw, brand_name_raw, category_raw, subcategory_raw (NOT name, brand_name, etc.)
Monitor: /api/workers shows active/recent jobs from job queue.
No slug guessing: Never use defaults. Always derive per store from menu_url and resolve platform IDs per location.

📖 Full Documentation: See docs/DUTCHIE_CRAWL_WORKFLOW.md for complete pipeline documentation.

Detailed Rules

Dispensary = Store (SAME THING)
- "Dispensary" and "store" are synonyms in CannaiQ. Use interchangeably.
- API endpoint: /api/stores (NOT /api/dispensaries)
- DB table: dispensaries
- When you need to create/query stores via API, use /api/stores
- Use the record's menu_url and platform_dispensary_id.
API Authentication
- Trusted Origins (no auth needed):
  - IPs: 127.0.0.1, ::1, ::ffff:127.0.0.1
  - Origins: https://cannaiq.co, https://findadispo.com, https://findagram.co
  - Also: http://localhost:3010, http://localhost:8080, http://localhost:5173
- Requests from trusted IPs/origins get automatic admin access (role: 'internal')
- Remote (non-trusted): Use Bearer token (JWT or API token). NO username/password auth.
- Never try to login with username/password via API - use tokens only.
- See src/auth/middleware.ts for TRUSTED_ORIGINS and TRUSTED_IPS lists.
Menu detection and platform IDs
- Set menu_type from menu_url detection; resolve platform_dispensary_id for menu_type='dutchie'.
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when menu_type='dutchie' AND platform_dispensary_id is set.
Queries and mapping
- The DB returns snake_case; code expects camelCase. Always alias/map:
  - platform_dispensary_id AS "platformDispensaryId"
  - Map via mapDbRowToDispensary when loading dispensaries (scheduler, crawler, admin crawl).
- Avoid SELECT *; explicitly select and/or map fields.
Scheduling
- /scraper-schedule should accept filters/search (All vs AZ-only, name).
- "Run Now"/scheduler must skip or warn if menu_type!='dutchie' or platform_dispensary_id missing.
- Use dispensary_crawl_status view; show reason when not crawlable.
Crawling
- Trigger dutchie crawls by dispensary ID (e.g., POST /api/admin/crawl/:id).
- Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (/images/...), store local URLs.
- Use dutchie GraphQL pipeline only for menu_type='dutchie'.
Frontend
- Forward-facing URLs should not contain vendor names.
- /scraper-schedule: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls.
No slug guessing
- Do not guess slugs; use the DB record's menu_url and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
Image storage (no MinIO)
- Save images to local filesystem only. Do not create or use MinIO in Docker.
- Product images: /images/products/<dispensary_id>/<product_id>-<hash>.webp (+medium/+thumb).
- Brand images: /images/brands/<brand_slug_or_sku>-<hash>.webp.
- Store local URLs in DB fields (keep original URLs as fallback only).
- Serve /images via backend static middleware.
Dutchie GraphQL fetch rules
- Endpoint: https://dutchie.com/api-3/graphql
- Variables: Use productsFilter.dispensaryId = platform_dispensary_id (MongoDB ObjectId).
- Mode A: Status: "Active" - returns active products with pricing
- Mode B: Status: null / activeOnly: false - returns all products including OOS/inactive
- Headers (server-side axios only): Chrome UA, Origin: https://dutchie.com, Referer: https://dutchie.com/embedded-menu/<cName>.
Batch DB writes to avoid OOM
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
- Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk.
Use dual-mode crawls by default
- Always run with useBothModes:true to combine Mode A (pricing) + Mode B (full coverage).
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
Capture OOS and missing items
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false).
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed.
- If an existing product is absent from both modes, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
Preserve all stock statuses (including unknown)
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored.
- Expected values: in_stock, out_of_stock, missing_from_feed, unknown.
Never delete or overwrite historical data
- Do not delete products/snapshots or overwrite historical records.
- Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records.
Per-location cName and platform_dispensary_id resolution
- For each dispensary, menu_url and cName must be valid for that exact location.
- Derive cName from menu_url per store: /embedded-menu/<cName> or /dispensary/<cName>.
- Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
- If the slug is invalid/missing, mark the store not crawlable and log it.
API Route Semantics

Route Groups (as registered in src/index.ts):
- /api/stores = Store/dispensary CRUD and listing
- /api/products = Product listing and details
- /api/workers = Job queue monitoring (replaces legacy /api/dutchie-az/...)
- /api/pipeline = Crawl pipeline triggers
- /api/admin/orchestrator = Orchestrator admin actions
- /api/discovery = Platform discovery (Dutchie, etc.)
- /api/v1/... = Public API for external consumers (WordPress, etc.)
Crawl Trigger: Check /api/pipeline or /api/admin/orchestrator routes for crawl triggers. The legacy POST /api/admin/crawl/:dispensaryId does NOT exist.
Monitoring and logging
- /api/workers shows active/recent jobs from job queue
- Auto-refresh every 30 seconds
- System Logs page should show real log data, not just startup messages
Dashboard Architecture
- Frontend: Rebuild the frontend with VITE_API_URL pointing to the correct backend and redeploy.
- Backend: /api/dashboard/stats MUST use the canonical DB pool. Use the correct tables: store_products, dispensaries, and views like v_dashboard_stats, v_latest_snapshots.

Deployment (Gitea + Kubernetes)

Registry: Gitea at code.cannabrands.app/creationshop/dispensary-scraper

Build and push (from backend directory):

docker login code.cannabrands.app
cd backend
docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
docker push code.cannabrands.app/creationshop/dispensary-scraper:latest

Deploy to Kubernetes:

kubectl rollout restart deployment/scraper -n dispensary-scraper
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
kubectl rollout status deployment/scraper -n dispensary-scraper

K8s manifests are in /k8s/ folder (scraper.yaml, scraper-worker.yaml, etc.)

Crawler Architecture
- Scraper pod (1 replica): Runs the Express API server + scheduler.
- Scraper-worker pods (25 replicas): Each runs dist/tasks/task-worker.js, polling the job queue.
- Worker naming: Pods use fantasy names (Aethelgard, Xylos, Kryll, Coriolis, etc.) - see k8s/scraper-worker.yaml ConfigMap. Worker IDs: {PodName}-worker-{n}
- Job types: menu_detection, menu_detection_single, dutchie_product_crawl
- Job schedules (managed in job_schedules table):
  - dutchie_az_menu_detection: Runs daily with 60-min jitter
  - dutchie_az_product_crawl: Runs every 4 hours with 30-min jitter
- Monitor jobs: GET /api/workers
- Trigger crawls: Check /api/pipeline routes
Frontend Architecture - AVOID OVER-ENGINEERING

Key Principles:
- ONE BACKEND serves ALL domains (cannaiq.co, findadispo.com, findagram.co)
- Do NOT create separate backend services for each domain
Frontend Build Differences:
- cannaiq/ uses Vite (outputs to dist/, uses VITE_ env vars) → cannaiq.co
- findadispo/ uses Create React App (outputs to build/, uses REACT_APP_ env vars) → findadispo.com
- findagram/ uses Create React App (outputs to build/, uses REACT_APP_ env vars) → findagram.co
CRA vs Vite Dockerfile Differences:
```
# Vite (cannaiq)
ENV VITE_API_URL=https://api.domain.com
RUN npm run build
COPY --from=builder /app/dist /usr/share/nginx/html

# CRA (findadispo, findagram)
ENV REACT_APP_API_URL=https://api.domain.com
RUN npm run build
COPY --from=builder /app/build /usr/share/nginx/html
```
Common Mistakes to AVOID:
- Creating a FastAPI/Express backend just for findagram or findadispo
- Creating separate Docker images per domain when one would work
- Using npm ci in Dockerfiles when package-lock.json doesn't exist (use npm install)

Admin UI Integration (Dutchie Discovery System)

The admin frontend includes a dedicated Discovery page located at:

cannaiq/src/pages/Discovery.tsx

This page is the operational interface that administrators use for managing the Dutchie discovery pipeline. While it does not define API features itself, it is the primary consumer of the Dutchie Discovery API.

Responsibilities of the Discovery UI

The UI enables administrators to:

View all discovered Dutchie locations
Filter by status:
- discovered
- verified
- merged (linked to an existing dispensary)
- rejected
Inspect individual location details (metadata, raw address, menu URL)
Verify & create a new canonical dispensary
Verify & link to an existing canonical dispensary
Reject or unreject discovered locations
Promote verified/merged locations into full crawlers via the orchestrator

API Endpoints Consumed by the Discovery UI

The Discovery UI uses platform-agnostic routes with neutral slugs (see docs/platform-slug-mapping.md):

Platform Slug: dt = Dutchie (trademark-safe URL)

GET /api/discovery/platforms/dt/locations
GET /api/discovery/platforms/dt/locations/:id
POST /api/discovery/platforms/dt/locations/:id/verify-create
POST /api/discovery/platforms/dt/locations/:id/verify-link
POST /api/discovery/platforms/dt/locations/:id/reject
POST /api/discovery/platforms/dt/locations/:id/unreject
GET /api/discovery/platforms/dt/locations/:id/match-candidates
GET /api/discovery/platforms/dt/cities
GET /api/discovery/platforms/dt/summary
POST /api/orchestrator/platforms/dt/promote/:id

These endpoints are defined in:

backend/src/dutchie-az/discovery/routes.ts
backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts

Frontend API Helper

The file:

cannaiq/src/lib/api.ts

implements the client-side wrappers for calling these endpoints:

getPlatformDiscoverySummary(platformSlug)
getPlatformDiscoveryLocations(platformSlug, params)
getPlatformDiscoveryLocation(platformSlug, id)
verifyCreatePlatformLocation(platformSlug, id, verifiedBy)
verifyLinkPlatformLocation(platformSlug, id, dispensaryId, verifiedBy)
rejectPlatformLocation(platformSlug, id, reason, verifiedBy)
unrejectPlatformLocation(platformSlug, id)
getPlatformLocationMatchCandidates(platformSlug, id)
getPlatformDiscoveryCities(platformSlug, params)
promotePlatformDiscoveryLocation(platformSlug, id)

Where platformSlug is a neutral two-letter slug (e.g., 'dt' for Dutchie). These helpers must be kept synchronized with backend routes.

UI/Backend Contract

The Discovery UI must always:

Treat discovery data as non-canonical until verified.
Not assume a discovery location is crawl-ready.
Initiate promotion only after verification steps.
Handle all statuses safely: discovered, verified, merged, rejected.

The backend must always:

Preserve discovery data even if rejected.
Never automatically merge or promote a location.
Allow idempotent verification and linking actions.
Expose complete metadata to help operators make verification decisions.

Coordinate Capture (Platform Discovery)

The DtLocationDiscoveryService captures geographic coordinates (latitude, longitude) whenever a platform's store payload provides them.

Behavior:

On INSERT:
- If the Dutchie API/GraphQL payload includes coordinates, they are saved into:
  - dutchie_discovery_locations.latitude
  - dutchie_discovery_locations.longitude
On UPDATE:
- Coordinates are only filled if the existing row has NULL values.
- Coordinates are never overwritten once set (prevents pollution if later payloads omit or degrade coordinate accuracy).
Logging:
- When coordinates are detected and captured: "Extracted coordinates for : , "
Summary Statistics:
- The discovery runner reports a count of: - locations with coordinates - locations without coordinates

Purpose:

Coordinate capture enables:

City/state validation (cross-checking submitted address vs lat/lng)
Distance-based duplicate detection
Location clustering for analytics
Mapping/front-end visualization
Future multi-platform reconciliation
Improved dispensary matching during verify-link flow

Coordinate capture is part of the discovery phase only. Canonical dispensaries entries may later be enriched with verified coordinates during promotion.

CannaiQ — Analytics V2 Examples & API Structure Extension

This section contains examples from backend/docs/ANALYTICS_V2_EXAMPLES.md and extends the Analytics V2 API definition to include:

response payload formats
time window semantics
rec/med segmentation usage
SQL/TS pseudo-code examples
endpoint expectations

Analytics V2: Supported Endpoints

Base URL prefix: /api/analytics/v2

All endpoints accept ?window=7d|30d|90d unless noted otherwise.

1. Price Analytics

GET /api/analytics/v2/price/product/:storeProductId

Returns price history for a canonical store product.

Example response: { "storeProductId": 123, "window": "30d", "points": [ { "date": "2025-02-01", "price": 32, "in_stock": true }, { "date": "2025-02-02", "price": 30, "in_stock": true } ] }

GET /api/analytics/v2/price/rec-vs-med?categoryId=XYZ

Compares category pricing between recreational and medical-only states.

Example response: { "categoryId": "flower", "rec": { "avg": 29.44, "median": 28.00, "states": ["CO", "WA", ...] }, "med": { "avg": 33.10, "median": 31.00, "states": ["FL", "PA", ...] } }

2. Brand Analytics

GET /api/analytics/v2/brand/:name/penetration

Returns penetration across states.

{ "brand": "Wyld", "window": "90d", "penetration": [ { "state": "AZ", "stores": 28 }, { "state": "MI", "stores": 34 } ] }

GET /api/analytics/v2/brand/:name/rec-vs-med

Returns penetration split by rec vs med segmentation.

3. Category Analytics

GET /api/analytics/v2/category/:name/growth

7d/30d/90d snapshot comparison:

{ "category": "vape", "window": "30d", "growth": { "current_sku_count": 420, "previous_sku_count": 380, "delta": 40 } }

GET /api/analytics/v2/category/rec-vs-med

Category-level comparisons.

4. Store Analytics

GET /api/analytics/v2/store/:storeId/changes

Product-level changes:

{ "storeId": 88, "window": "30d", "added": [...], "removed": [...], "price_changes": [...], "restocks": [...], "oos_events": [...] }

GET /api/analytics/v2/store/:storeId/summary

5. State Analytics

GET /api/analytics/v2/state/legal-breakdown

State rec/med/no-program segmentation summary.

GET /api/analytics/v2/state/rec-vs-med-pricing

State-level pricing comparison.

GET /api/analytics/v2/state/recreational

List rec-legal state codes.

GET /api/analytics/v2/state/medical-only

List med-only state codes.

Windowing Semantics

Definition: window is applied to canonical snapshots. Equivalent to:

WHERE snapshot_at >= NOW() - INTERVAL ''

Rec/Med Segmentation Rules

rec_states: states.recreational_legal = TRUE

med_only_states: states.medical_legal = TRUE AND states.recreational_legal = FALSE

no_program: both flags FALSE or NULL

Analytics must use this segmentation consistently.

Response Structure Requirements

Every analytics v2 endpoint must:

include the window used
include segmentation if relevant
include state codes when state-level grouping is used
return safe empty arrays if no data
NEVER throw on missing data
be versionable (v2 must not break previous analytics APIs)

Service Responsibilities Summary

PriceAnalyticsService

compute time-series price trends
compute average/median price by state
compute rec-vs-med price comparisons

BrandPenetrationService

compute presence across stores and states
rec-vs-med brand footprint
detect expansion / contraction

CategoryAnalyticsService

compute SKU count changes
category pricing
rec-vs-med category dynamics

StoreAnalyticsService

detect SKU additions/drops
price changes
restocks & OOS events

StateAnalyticsService

legal breakdown
coverage gaps
rec-vs-med scoring

END Analytics V2 spec extension

WordPress Plugin Versioning

The WordPress plugin version is tracked in wordpress-plugin/VERSION.

Current version: Check wordpress-plugin/VERSION for the latest version.

Versioning rules:

Minor bumps (x.x.N): Bug fixes, small improvements - default for most changes
Middle bumps (x.N.0): New features, significant improvements
Major bumps (N.0.0): Breaking changes, major rewrites - only when user explicitly requests

When making WP plugin changes:

Read wordpress-plugin/VERSION to get current version
Bump the version number (minor by default)
Update both files:
- wordpress-plugin/VERSION
- Plugin header Version: in cannaiq-menus.php and/or crawlsy-menus.php
- The define('..._VERSION', '...') constant in each plugin file

Plugin files:

File	Brand	API URL
`cannaiq-menus.php`	CannaIQ	`https://cannaiq.co/api/v1`
`crawlsy-menus.php`	Crawlsy (legacy)	`https://cannaiq.co/api/v1`

Both plugins use the same API endpoint. The Crawlsy version exists for backward compatibility with existing installations.

45 KiB Raw Blame History Unescape Escape

Claude Guidelines for this Project

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETION OF DATA — EVER

2. NO PROCESS KILLING — EVER

3. NO MANUAL SERVER STARTUP — EVER

4. DEPLOYMENT AUTHORIZATION REQUIRED

5. DATABASE CONNECTION ARCHITECTURE

6. ALL API ROUTES REQUIRE AUTHENTICATION — NO EXCEPTIONS

7. LOCAL DEVELOPMENT BY DEFAULT

DATABASE MODEL (CRITICAL)

Database Architecture

Store vs Dispensary Terminology

Canonical vs Legacy Tables

Environment Variables

Migration and ETL Procedure

Database Access Rules

Migrations

Canonical Schema Migration (041/042)

Deprecated Connection Module

PERFORMANCE REQUIREMENTS

FORBIDDEN ACTIONS

STORAGE BEHAVIOR

Local Storage Structure

Image Proxy API (On-Demand Resizing)

Storage Adapter

Files

UI ANONYMIZATION RULES

DUTCHIE DISCOVERY PIPELINE (Added 2025-01)

Overview

Flow

Tables

Files

GraphQL Hashes (in src/platforms/dutchie/client.ts)

Usage

Validation Rules

Key Design Decisions

FUTURE TODO / PENDING FEATURES

Multi-Site Architecture (CRITICAL)

Multi-Domain Hosting Architecture

PWA Setup Requirements

Core Rules Summary

Detailed Rules

Admin UI Integration (Dutchie Discovery System)

Responsibilities of the Discovery UI

API Endpoints Consumed by the Discovery UI

Frontend API Helper

UI/Backend Contract

Coordinate Capture (Platform Discovery)

Behavior:

Purpose:

CannaiQ — Analytics V2 Examples & API Structure Extension

Analytics V2: Supported Endpoints

1. Price Analytics

GET /api/analytics/v2/price/product/:storeProductId

GET /api/analytics/v2/price/rec-vs-med?categoryId=XYZ

2. Brand Analytics

GET /api/analytics/v2/brand/:name/penetration

GET /api/analytics/v2/brand/:name/rec-vs-med

3. Category Analytics

GET /api/analytics/v2/category/:name/growth

GET /api/analytics/v2/category/rec-vs-med

4. Store Analytics

GET /api/analytics/v2/store/:storeId/changes

GET /api/analytics/v2/store/:storeId/summary

5. State Analytics

GET /api/analytics/v2/state/legal-breakdown

GET /api/analytics/v2/state/rec-vs-med-pricing

GET /api/analytics/v2/state/recreational

GET /api/analytics/v2/state/medical-only

Windowing Semantics

Rec/Med Segmentation Rules

Response Structure Requirements

Service Responsibilities Summary

PriceAnalyticsService

BrandPenetrationService

CategoryAnalyticsService

StoreAnalyticsService

StateAnalyticsService

END Analytics V2 spec extension

45 KiB

Raw Blame History

GraphQL Hashes (in `src/platforms/dutchie/client.ts`)