Files
cannaiq/CLAUDE.md
Kelly 6cd1f55119 fix(workers): Preserve fantasy names on pod restart
- Re-registration no longer overwrites pod_name with K8s name
- New workers get fantasy name (Aethelgard, Xylos, etc.) as pod_name
- Document worker naming convention in CLAUDE.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 20:35:25 -07:00

1317 lines
45 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Claude Guidelines for this Project
---
## PERMANENT RULES (NEVER VIOLATE)
### 1. NO DELETION OF DATA — EVER
CannaiQ is a **historical analytics system**. Data retention is **permanent by design**.
**NEVER delete:**
- Product records
- Crawled snapshots
- Images
- Directories
- Logs
- Orchestrator traces
- Profiles
- Selector configs
- Crawl outcomes
- Store data
- Brand data
**NEVER automate cleanup:**
- No cron or scheduled job may `rm`, `unlink`, `delete`, `purge`, `prune`, `clean`, or `reset` any storage directory or DB row
- No migration may DELETE data — only add/update/alter columns
- If cleanup is required, ONLY the user may issue a manual command
**Code enforcement:**
- `local-storage.ts` must only: write files, create directories, read files
- No `deleteImage`, `deleteProductImages`, or similar functions
### 2. NO PROCESS KILLING — EVER
**Claude must NEVER run process-killing commands:**
- No `pkill`
- No `kill -9`
- No `xargs kill`
- No `lsof | kill`
- No `killall`
- No `fuser -k`
**Claude must NOT manage host processes.** Only user scripts manage the local environment.
**Correct behavior:**
- If backend is running on port 3010 → say: "Backend already running"
- If backend is NOT running → say: "Please run `./setup-local.sh`"
**Process management is done ONLY by user scripts:**
```bash
./setup-local.sh # Start local environment
./stop-local.sh # Stop local environment
```
### 3. NO MANUAL SERVER STARTUP — EVER
**Claude must NEVER start the backend manually:**
- No `npx tsx src/index.ts`
- No `node dist/index.js`
- No `npm run dev` with custom env vars
- No `DATABASE_URL=... npx tsx ...`
**Claude must NEVER set DATABASE_URL in shell commands:**
- DB connection uses `CANNAIQ_DB_*` env vars or `CANNAIQ_DB_URL` from the user's environment
- Never hardcode connection strings in bash commands
- Never override env vars to bypass the user's DB setup
**If backend is not running:**
- Say: "Please run `./setup-local.sh`"
- Do NOT attempt to start it yourself
**If a dependency is missing:**
- Add it to `package.json`
- Say: "Please run `cd backend && npm install`"
- Do NOT try to solve it by starting a custom dev server
**The ONLY way to start local services:**
```bash
cd backend
./setup-local.sh
```
### 4. DEPLOYMENT AUTHORIZATION REQUIRED
**NEVER deploy to production unless the user explicitly says:**
> "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
Until then:
- All work is LOCAL ONLY
- No `kubectl apply`, `docker push`, or remote operations
- No port-forwarding to production
- No connecting to Kubernetes clusters
### 5. DATABASE CONNECTION ARCHITECTURE
**Migration code is CLI-only. Runtime code must NOT import `src/db/migrate.ts`.**
| Module | Purpose | Import From |
|--------|---------|-------------|
| `src/db/migrate.ts` | CLI migrations only | **NEVER import at runtime** |
| `src/db/pool.ts` | Runtime database pool | `import { pool } from '../db/pool'` |
| `src/dutchie-az/db/connection.ts` | Canonical connection helper | Alternative for runtime |
**Runtime gets DB connections ONLY via:**
```typescript
import { pool } from '../db/pool';
// or
import { getPool } from '../dutchie-az/db/connection';
```
**To run migrations:**
```bash
cd backend
npx tsx src/db/migrate.ts
```
**Why this matters:**
- `migrate.ts` validates env vars strictly and throws at module load time
- Importing it at runtime causes startup crashes if env vars aren't perfect
- `pool.ts` uses lazy initialization - only validates when first query is made
### 6. ALL API ROUTES REQUIRE AUTHENTICATION — NO EXCEPTIONS
**Every API router MUST apply `authMiddleware` at the router level.**
```typescript
import { authMiddleware } from '../auth/middleware';
const router = Router();
router.use(authMiddleware); // REQUIRED - first line after router creation
```
**Authentication flow (see `src/auth/middleware.ts`):**
1. Check Bearer token (JWT or API token) → grant access if valid
2. Check trusted origins (cannaiq.co, findadispo.com, localhost, etc.) → grant access
3. Check trusted IPs (127.0.0.1, ::1, internal pod IPs) → grant access
4. **Return 401 Unauthorized** if none of the above
**NEVER create API routes without auth middleware:**
- No "public" endpoints that bypass authentication
- No "read-only" exceptions
- No "analytics-only" exceptions
- If an endpoint exists under `/api/*`, it MUST be protected
**When creating new route files:**
1. Import `authMiddleware` from `../auth/middleware`
2. Add `router.use(authMiddleware)` immediately after creating the router
3. Document security requirements in file header comments
**Trusted origins (defined in middleware):**
- `https://cannaiq.co`
- `https://findadispo.com`
- `https://findagram.co`
- `*.cannabrands.app` domains
- `localhost:*` for development
### 7. LOCAL DEVELOPMENT BY DEFAULT
**Quick Start:**
```bash
./setup-local.sh
```
**Services (all started by setup-local.sh):**
| Service | URL | Purpose |
|---------|-----|---------|
| PostgreSQL | localhost:54320 | cannaiq-postgres container |
| Backend API | http://localhost:3010 | Express API server |
| CannaiQ Admin | http://localhost:8080/admin | B2B admin dashboard |
| FindADispo | http://localhost:3001 | Consumer dispensary finder |
| Findagram | http://localhost:3002 | Consumer delivery marketplace |
**In local mode:**
- Use `docker-compose.local.yml` (NO MinIO)
- Use local filesystem storage at `./storage`
- Connect to `cannaiq-postgres` at `localhost:54320`
- Backend runs at `localhost:3010`
- All three frontends run on separate ports (8080, 3001, 3002)
- NO remote connections, NO Kubernetes, NO MinIO
**Environment:**
- All DB config is in `backend/.env`
- STORAGE_DRIVER=local
- STORAGE_BASE_PATH=./storage
**Local Admin Bootstrap:**
```bash
cd backend
npx tsx src/scripts/bootstrap-local-admin.ts
```
Creates/resets a deterministic local admin user:
| Field | Value |
|-------|-------|
| Email | `admin@local.test` |
| Password | `admin123` |
| Role | `superadmin` |
This is a LOCAL-DEV helper only. Never use these credentials in production.
**Manual startup (if not using setup-local.sh):**
```bash
# Terminal 1: Start PostgreSQL
docker-compose -f docker-compose.local.yml up -d
# Terminal 2: Start Backend
cd backend && npm run dev
# Terminal 3: Start Frontend
cd cannaiq && npm run dev:admin
```
**Stop services:**
```bash
./stop-local.sh
```
---
## DATABASE MODEL (CRITICAL)
### Database Architecture
CannaiQ has **TWO databases** with distinct purposes:
| Database | Purpose | Access |
|----------|---------|--------|
| `dutchie_menus` | **Canonical CannaiQ database** - All schema, migrations, and application data | READ/WRITE |
| `dutchie_legacy` | **Legacy read-only archive** - Historical data from old system | READ-ONLY |
### Store vs Dispensary Terminology
**"Store" and "Dispensary" are SYNONYMS in CannaiQ.**
| Term | Usage | DB Table |
|------|-------|----------|
| Store | API routes (`/api/stores`) | `dispensaries` |
| Dispensary | DB table, internal code | `dispensaries` |
- `/api/stores` and `/api/dispensaries` both query the `dispensaries` table
- There is NO `stores` table in use - it's a legacy empty table
- Use these terms interchangeably in code and documentation
### Canonical vs Legacy Tables
**CANONICAL TABLES (USE THESE):**
| Table | Purpose | Row Count |
|-------|---------|-----------|
| `dispensaries` | Store/dispensary records | ~188+ rows |
| `store_products` | Product catalog | ~37,000+ rows |
| `store_product_snapshots` | Price/stock history | ~millions |
**LEGACY TABLES (EMPTY - DO NOT USE):**
| Table | Status | Action |
|-------|--------|--------|
| `stores` | EMPTY (0 rows) | Use `dispensaries` instead |
| `products` | EMPTY (0 rows) | Use `store_products` instead |
| `dutchie_products` | LEGACY (0 rows) | Use `store_products` instead |
| `dutchie_product_snapshots` | LEGACY (0 rows) | Use `store_product_snapshots` instead |
| `categories` | EMPTY (0 rows) | Categories stored in product records |
**Code must NEVER:**
- Query the `stores` table (use `dispensaries`)
- Query the `products` table (use `store_products`)
- Query the `dutchie_products` table (use `store_products`)
- Query the `categories` table (categories are in product records)
**CRITICAL RULES:**
- **Migrations ONLY run on `dutchie_menus`** - NEVER on `dutchie_legacy`
- **Application code connects ONLY to `dutchie_menus`**
- **ETL scripts READ from `dutchie_legacy`, WRITE to `dutchie_menus`**
- `dutchie_legacy` is frozen - NO writes, NO schema changes, NO migrations
### Environment Variables
**CannaiQ Database (dutchie_menus) - PRIMARY:**
```bash
# All application/migration DB access uses these env vars:
CANNAIQ_DB_HOST=localhost # Database host
CANNAIQ_DB_PORT=54320 # Database port
CANNAIQ_DB_NAME=dutchie_menus # MUST be dutchie_menus
CANNAIQ_DB_USER=dutchie # Database user
CANNAIQ_DB_PASS=<password> # Database password
# OR use a full connection string:
CANNAIQ_DB_URL=postgresql://user:pass@host:port/dutchie_menus
```
**Legacy Database (dutchie_legacy) - ETL ONLY:**
```bash
# Only used by ETL scripts for reading legacy data:
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_NAME=dutchie_legacy # READ-ONLY - never migrated
LEGACY_DB_USER=dutchie
LEGACY_DB_PASS=<password>
# OR use a full connection string:
LEGACY_DB_URL=postgresql://user:pass@host:port/dutchie_legacy
```
**Key Rules:**
- `CANNAIQ_DB_NAME` MUST be `dutchie_menus` for application/migrations
- `LEGACY_DB_NAME` is `dutchie_legacy` - READ-ONLY for ETL only
- ALL application code MUST use `CANNAIQ_DB_*` environment variables
- No hardcoded database names anywhere in the codebase
- `backend/.env` controls all database access for local development
**State Modeling:**
- States (AZ, MI, CA, NV, etc.) are modeled via `states` table + `state_id` on dispensaries
- NO separate databases per state
- Use `state_code` or `state_id` columns for filtering
### Migration and ETL Procedure
**Step 1: Run schema migration (on dutchie_menus ONLY):**
```bash
cd backend
psql "postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
-f migrations/041_cannaiq_canonical_schema.sql
```
**Step 2: Run ETL to copy legacy data:**
```bash
cd backend
npx tsx src/scripts/etl/042_legacy_import.ts
# Reads from dutchie_legacy, writes to dutchie_menus
```
### Database Access Rules
**Claude MUST NOT:**
- Connect to any database besides the canonical CannaiQ database
- Use raw connection strings in shell commands
- Run `psql` commands directly
- Construct database URLs manually
- Create or rename databases automatically
- Run `npm run migrate` without explicit user authorization
- Patch schema at runtime (no ALTER TABLE from scripts)
**All data access MUST go through:**
- LOCAL CannaiQ backend HTTP API endpoints
- Internal CannaiQ application code (using canonical connection pool)
- Ask user to run SQL manually if absolutely needed
**Local service management:**
- User starts services via `./setup-local.sh` (ONLY the user runs this)
- If port 3010 responds, assume backend is running
- If port 3010 does NOT respond, tell user: "Backend is not running; please run `./setup-local.sh`"
- Claude may only access the app via HTTP: `http://localhost:3010` (API), `http://localhost:8080/admin` (UI)
- Never restart, kill, or manage local processes — that is the user's responsibility
### Migrations
**Rules:**
- Migrations may be WRITTEN but only the USER runs them after review
- Never execute migrations automatically
- Only additive migrations (no DROP/DELETE)
- Write schema-tolerant code that handles missing optional columns
**If schema changes are needed:**
1. Generate a proper migration file in `backend/migrations/*.sql`
2. Show the migration to the user
3. Wait for explicit authorization before running
4. Never run migrations automatically - only the user runs them after review
**Schema tolerance:**
- If a column is missing at runtime, prefer making the code tolerant (treat field as optional) instead of auto-creating the column
- Queries should gracefully handle missing columns by omitting them or using NULL defaults
### Canonical Schema Migration (041/042)
**Migration 041** (`backend/migrations/041_cannaiq_canonical_schema.sql`):
- Creates canonical CannaiQ tables: `states`, `chains`, `brands`, `store_products`, `store_product_snapshots`, `crawl_runs`
- Adds `state_id` and `chain_id` columns to `dispensaries`
- Adds status columns to `dispensary_crawler_profiles`
- SCHEMA ONLY - no data inserts from legacy tables
**ETL Script 042** (`backend/src/scripts/etl/042_legacy_import.ts`):
- Copies data from legacy `dutchie_legacy.dutchie_products``store_products`
- Copies data from legacy `dutchie_legacy.dutchie_product_snapshots``store_product_snapshots`
- Extracts brands from product data into `brands` table
- Links dispensaries to chains and states
- INSERT-ONLY and IDEMPOTENT (uses ON CONFLICT DO NOTHING)
- Run manually: `cd backend && npx tsx src/scripts/etl/042_legacy_import.ts`
**Tables touched by ETL:**
| Source Table (dutchie_legacy) | Target Table (dutchie_menus) |
|-------------------------------|------------------------------|
| `dutchie_products` | `store_products` |
| `dutchie_product_snapshots` | `store_product_snapshots` |
| (brand names extracted) | `brands` |
| (state codes mapped) | `dispensaries.state_id` |
| (chain names matched) | `dispensaries.chain_id` |
**Note:** The legacy `dutchie_products` and `dutchie_product_snapshots` tables in `dutchie_legacy` are read-only sources. All new crawl data goes directly to `store_products` and `store_product_snapshots`.
**Migration 045** (`backend/migrations/045_add_image_columns.sql`):
- Adds `thumbnail_url` to `store_products` and `store_product_snapshots`
- `image_url` already exists from migration 041
- ETL 042 populates `image_url` from legacy `primary_image_url` where present
- `thumbnail_url` is NULL for legacy data - future crawls can populate it
### Deprecated Connection Module
The custom connection module at `src/dutchie-az/db/connection` is **DEPRECATED**.
**All code using `getClient` from this module must be refactored to:**
- Use the CannaiQ API endpoints instead
- Use the orchestrator through the API
- Use the canonical DB pool from the main application
---
## PERFORMANCE REQUIREMENTS
**Database Queries:**
- NEVER write N+1 queries - always batch fetch related data before iterating
- NEVER run queries inside loops - batch them before the loop
- Avoid multiple queries when one JOIN or subquery works
- Dashboard/index pages should use MAX 5-10 queries total, not 50+
- Mentally trace query count - if a page would run 20+ queries, refactor
- Cache expensive aggregations (in-memory or Redis, 5-min TTL) instead of recalculating every request
- Use query logging during development to verify query count
**Before submitting route/controller code, verify:**
1. No queries inside `forEach`/`map`/`for` loops
2. All related data fetched in batches before iteration
3. Aggregations done in SQL (`COUNT`, `SUM`, `AVG`, `GROUP BY`), not in JS
4. **Would this cause a 503 under load? If unsure, simplify.**
**Examples of BAD patterns:**
```typescript
// BAD: N+1 query - runs a query for each store
const stores = await getStores();
for (const store of stores) {
store.products = await getProductsByStoreId(store.id); // N queries!
}
// BAD: Query inside map
const results = await Promise.all(
storeIds.map(id => pool.query('SELECT * FROM products WHERE store_id = $1', [id]))
);
```
**Examples of GOOD patterns:**
```typescript
// GOOD: Batch fetch all products, then group in JS
const stores = await getStores();
const storeIds = stores.map(s => s.id);
const allProducts = await pool.query(
'SELECT * FROM products WHERE store_id = ANY($1)', [storeIds]
);
const productsByStore = groupBy(allProducts.rows, 'store_id');
stores.forEach(s => s.products = productsByStore[s.id] || []);
// GOOD: Single query with JOIN
const result = await pool.query(`
SELECT s.*, COUNT(p.id) as product_count
FROM stores s
LEFT JOIN products p ON p.store_id = s.id
GROUP BY s.id
`);
```
---
## FORBIDDEN ACTIONS
1. **Deleting any data** (products, snapshots, images, logs, traces)
2. **Deploying without explicit authorization**
3. **Connecting to Kubernetes** without authorization
4. **Port-forwarding to production** without authorization
5. **Starting MinIO** in local development
6. **Using S3/MinIO SDKs** when `STORAGE_DRIVER=local`
7. **Automating cleanup** of any kind
8. **Dropping database tables or columns**
9. **Overwriting historical records** (always append snapshots)
10. **Runtime schema patching** (ALTER TABLE from scripts)
11. **Using `getClient` from deprecated connection module**
12. **Creating ad-hoc database connections** outside the canonical pool
13. **Auto-adding missing columns** at runtime
14. **Killing local processes** (`pkill`, `kill`, `kill -9`, etc.)
15. **Starting backend/frontend directly** with custom env vars
16. **Running `lsof -ti:PORT | xargs kill`** or similar process-killing commands
17. **Using hardcoded database names** in code or comments
18. **Creating or connecting to a second database**
19. **Creating API routes without authMiddleware** (all `/api/*` routes MUST be protected)
---
## STORAGE BEHAVIOR
### Local Storage Structure
```
/storage/images/products/{state}/{store}/{brand}/{product}/
image-{hash}.webp
/storage/images/brands/{brand}/
logo-{hash}.webp
```
### Image Proxy API (On-Demand Resizing)
Images are stored at full resolution and resized on-demand via the `/img` endpoint.
**Endpoint:** `GET /img/<path>?<params>`
**Parameters:**
| Param | Description | Example |
|-------|-------------|---------|
| `w` | Width in pixels (max 4000) | `?w=200` |
| `h` | Height in pixels (max 4000) | `?h=200` |
| `q` | Quality 1-100 (default 80) | `?q=70` |
| `fit` | Resize mode: cover, contain, fill, inside, outside | `?fit=cover` |
| `blur` | Blur sigma 0.3-1000 | `?blur=5` |
| `gray` | Grayscale (1 = enabled) | `?gray=1` |
| `format` | Output: webp, jpeg, png, avif (default webp) | `?format=jpeg` |
**Examples:**
```bash
# Thumbnail (50px)
GET /img/products/az/store/brand/product/image-abc123.webp?w=50
# Card image (200px, cover fit)
GET /img/products/az/store/brand/product/image-abc123.webp?w=200&h=200&fit=cover
# JPEG at 70% quality
GET /img/products/az/store/brand/product/image-abc123.webp?w=400&format=jpeg&q=70
# Grayscale blur
GET /img/products/az/store/brand/product/image-abc123.webp?w=200&gray=1&blur=3
```
**Frontend Usage:**
```typescript
import { getImageUrl, ImageSizes } from '../lib/images';
// Returns /img/products/.../image.webp?w=50 for local images
// Returns original URL for remote images (CDN, etc.)
const thumbUrl = getImageUrl(product.image_url, ImageSizes.thumb);
const cardUrl = getImageUrl(product.image_url, ImageSizes.medium);
const detailUrl = getImageUrl(product.image_url, ImageSizes.detail);
```
**Size Presets:**
| Preset | Width | Use Case |
|--------|-------|----------|
| `thumb` | 50px | Table thumbnails |
| `small` | 100px | Small cards |
| `medium` | 200px | Grid cards |
| `large` | 400px | Large cards |
| `detail` | 600px | Product detail |
| `full` | - | No resize |
### Storage Adapter
```typescript
import { saveImage, getImageUrl } from '../utils/storage-adapter';
// Automatically uses local storage when STORAGE_DRIVER=local
```
### Files
| File | Purpose |
|------|---------|
| `backend/src/utils/image-storage.ts` | Image download and storage |
| `backend/src/routes/image-proxy.ts` | On-demand image resizing endpoint |
| `cannaiq/src/lib/images.ts` | Frontend image URL helper |
| `docker-compose.local.yml` | Local stack without MinIO |
| `start-local.sh` | Convenience startup script |
---
## UI ANONYMIZATION RULES
- No vendor names in forward-facing URLs
- No "dutchie", "treez", "jane", "weedmaps", "leafly" visible in consumer UIs
- Internal admin tools may show provider names for debugging
---
## DUTCHIE DISCOVERY PIPELINE (Added 2025-01)
### Overview
Automated discovery of Dutchie-powered dispensaries across all US states.
### Flow
```
1. getAllCitiesByState GraphQL → Get all cities for a state
2. ConsumerDispensaries GraphQL → Get stores for each city
3. Upsert to dutchie_discovery_locations (keyed by platform_location_id)
4. AUTO-VALIDATE: Check required fields
5. AUTO-PROMOTE: Create/update dispensaries with crawl_enabled=true
6. Log all actions to dutchie_promotion_log
```
### Tables
| Table | Purpose |
|-------|---------|
| `dutchie_discovery_cities` | Cities known to have dispensaries |
| `dutchie_discovery_locations` | Raw discovered store data |
| `dispensaries` | Canonical stores (promoted from discovery) |
| `dutchie_promotion_log` | Audit trail for validation/promotion |
### Files
| File | Purpose |
|------|---------|
| `src/discovery/discovery-crawler.ts` | Main orchestrator |
| `src/discovery/location-discovery.ts` | GraphQL fetching |
| `src/discovery/promotion.ts` | Validation & promotion logic |
| `src/scripts/run-discovery.ts` | CLI interface |
| `migrations/067_promotion_log.sql` | Audit log table |
### GraphQL Hashes (in `src/platforms/dutchie/client.ts`)
| Query | Hash |
|-------|------|
| `GetAllCitiesByState` | `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6` |
| `ConsumerDispensaries` | `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b` |
### Usage
```bash
# Discover all stores in a state
npx tsx src/scripts/run-discovery.ts discover:state AZ
npx tsx src/scripts/run-discovery.ts discover:state CA
# Check stats
npx tsx src/scripts/run-discovery.ts stats
```
### Validation Rules
A discovery location must have:
- `platform_location_id` (MongoDB ObjectId, 24 hex chars)
- `name`
- `city`
- `state_code`
- `platform_menu_url`
Invalid records are marked `status='rejected'` with errors logged.
### Key Design Decisions
- `platform_location_id` MUST be MongoDB ObjectId (not slug)
- Old geo-based discovery stored slugs → deleted as garbage data
- Rate limit: 2 seconds between city requests to avoid API throttling
- Promotion is idempotent via `ON CONFLICT (platform_dispensary_id)`
---
## FUTURE TODO / PENDING FEATURES
- [ ] Orchestrator observability dashboard
- [ ] Crawl profile management UI
- [ ] State machine sandbox (disabled until authorized)
- [ ] Multi-state expansion beyond AZ
---
### Multi-Site Architecture (CRITICAL)
This project has **4 active locations** (plus 1 deprecated) - always clarify which one before making changes:
| Folder | Domain | Type | Purpose |
|--------|--------|------|---------|
| `backend/` | (shared) | Express API | Single backend serving all frontends |
| `frontend/` | (DEPRECATED) | React SPA (Vite) | DEPRECATED - was dispos.crawlsy.com, now removed |
| `cannaiq/` | cannaiq.co | React SPA + PWA | Admin dashboard / B2B analytics |
| `findadispo/` | findadispo.com | React SPA + PWA | Consumer dispensary finder |
| `findagram/` | findagram.co | React SPA + PWA | Consumer delivery marketplace |
**NOTE: `frontend/` folder is DEPRECATED:**
- `frontend/` = OLD/legacy dashboard - NO LONGER DEPLOYED (removed from k8s)
- `cannaiq/` = Primary admin dashboard, deployed to `cannaiq.co`
- Do NOT use or modify `frontend/` folder - it will be archived/removed
**Before any frontend work, ASK: "Which site? cannaiq, findadispo, or findagram?"**
All three active frontends share:
- Same backend API (port 3010)
- Same PostgreSQL database
- Same Kubernetes deployment for backend
Each frontend has:
- Its own folder, package.json, Dockerfile
- Its own domain and branding
- Its own PWA manifest and service worker (cannaiq, findadispo, findagram)
- Separate Docker containers in production
---
### Multi-Domain Hosting Architecture
All three frontends are served from the **same IP** using **host-based routing**:
**Kubernetes Ingress (Production):**
```yaml
# Each domain routes to its own frontend service
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: multi-site-ingress
spec:
rules:
- host: cannaiq.co
http:
paths:
- path: /
backend:
service:
name: cannaiq-frontend
port: 80
- path: /api
backend:
service:
name: scraper # shared backend
port: 3010
- host: findadispo.com
http:
paths:
- path: /
backend:
service:
name: findadispo-frontend
port: 80
- path: /api
backend:
service:
name: scraper
port: 3010
- host: findagram.co
http:
paths:
- path: /
backend:
service:
name: findagram-frontend
port: 80
- path: /api
backend:
service:
name: scraper
port: 3010
```
**Key Points:**
- DNS A records for all 3 domains point to same IP
- Ingress controller routes based on `Host` header
- Each frontend is a separate Docker container (nginx serving static files)
- All frontends share the same backend API at `/api/*`
- SSL/TLS handled at ingress level (cert-manager)
---
### PWA Setup Requirements
Each frontend is a **Progressive Web App (PWA)**. Required files in each `public/` folder:
1. **manifest.json** - App metadata, icons, theme colors
2. **service-worker.js** - Offline caching, background sync
3. **Icons** - 192x192 and 512x512 PNG icons
**Vite PWA Plugin Setup** (in each frontend's vite.config.ts):
```typescript
import { VitePWA } from 'vite-plugin-pwa'
export default defineConfig({
plugins: [
react(),
VitePWA({
registerType: 'autoUpdate',
manifest: {
name: 'Site Name',
short_name: 'Short',
theme_color: '#10b981',
icons: [
{ src: '/icon-192.png', sizes: '192x192', type: 'image/png' },
{ src: '/icon-512.png', sizes: '512x512', type: 'image/png' }
]
},
workbox: {
globPatterns: ['**/*.{js,css,html,ico,png,svg,woff2}']
}
})
]
})
```
---
### Core Rules Summary
- **DB**: Use the single CannaiQ database via `CANNAIQ_DB_*` env vars. No hardcoded names.
- **Images**: No MinIO. Save to local /images/products/<disp>/<prod>-<hash>.webp (and brands); preserve original URL; serve via backend static.
- **Dutchie GraphQL**: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). **CRITICAL: Use `Status: 'Active'`, NOT `null`** (null returns 0 products).
- **cName/slug**: Derive cName from each store's menu_url (/embedded-menu/<cName> or /dispensary/<slug>). No hardcoded defaults.
- **Batch DB writes**: Chunk products/snapshots/missing (100200) to avoid OOM.
- **API/Frontend**: Use `/api/stores`, `/api/products`, `/api/workers`, `/api/pipeline` endpoints.
- **Scheduling**: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter.
- **THC/CBD values**: Clamp to ≤100 - some products report milligrams as percentages.
- **Column names**: Use `name_raw`, `brand_name_raw`, `category_raw`, `subcategory_raw` (NOT `name`, `brand_name`, etc.)
- **Monitor**: `/api/workers` shows active/recent jobs from job queue.
- **No slug guessing**: Never use defaults. Always derive per store from menu_url and resolve platform IDs per location.
**📖 Full Documentation: See `docs/DUTCHIE_CRAWL_WORKFLOW.md` for complete pipeline documentation.**
---
### Detailed Rules
1) **Dispensary = Store (SAME THING)**
- "Dispensary" and "store" are synonyms in CannaiQ. Use interchangeably.
- **API endpoint**: `/api/stores` (NOT `/api/dispensaries`)
- **DB table**: `dispensaries`
- When you need to create/query stores via API, use `/api/stores`
- Use the record's `menu_url` and `platform_dispensary_id`.
2) **API Authentication**
- **Trusted Origins (no auth needed)**:
- IPs: `127.0.0.1`, `::1`, `::ffff:127.0.0.1`
- Origins: `https://cannaiq.co`, `https://findadispo.com`, `https://findagram.co`
- Also: `http://localhost:3010`, `http://localhost:8080`, `http://localhost:5173`
- Requests from trusted IPs/origins get automatic admin access (`role: 'internal'`)
- **Remote (non-trusted)**: Use Bearer token (JWT or API token). NO username/password auth.
- Never try to login with username/password via API - use tokens only.
- See `src/auth/middleware.ts` for `TRUSTED_ORIGINS` and `TRUSTED_IPS` lists.
3) **Menu detection and platform IDs**
- Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`.
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.
4) **Queries and mapping**
- The DB returns snake_case; code expects camelCase. Always alias/map:
- `platform_dispensary_id AS "platformDispensaryId"`
- Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl).
- Avoid `SELECT *`; explicitly select and/or map fields.
4) **Scheduling**
- `/scraper-schedule` should accept filters/search (All vs AZ-only, name).
- "Run Now"/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing.
- Use `dispensary_crawl_status` view; show reason when not crawlable.
5) **Crawling**
- Trigger dutchie crawls by dispensary ID (e.g., `POST /api/admin/crawl/:id`).
- Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (`/images/...`), store local URLs.
- Use dutchie GraphQL pipeline only for `menu_type='dutchie'`.
6) **Frontend**
- Forward-facing URLs should not contain vendor names.
- `/scraper-schedule`: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls.
7) **No slug guessing**
- Do not guess slugs; use the DB record's `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
8) **Image storage (no MinIO)**
- Save images to local filesystem only. Do not create or use MinIO in Docker.
- Product images: `/images/products/<dispensary_id>/<product_id>-<hash>.webp` (+medium/+thumb).
- Brand images: `/images/brands/<brand_slug_or_sku>-<hash>.webp`.
- Store local URLs in DB fields (keep original URLs as fallback only).
- Serve `/images` via backend static middleware.
9) **Dutchie GraphQL fetch rules**
- **Endpoint**: `https://dutchie.com/api-3/graphql`
- **Variables**: Use `productsFilter.dispensaryId` = `platform_dispensary_id` (MongoDB ObjectId).
- **Mode A**: `Status: "Active"` - returns active products with pricing
- **Mode B**: `Status: null` / `activeOnly: false` - returns all products including OOS/inactive
- **Headers** (server-side axios only): Chrome UA, `Origin: https://dutchie.com`, `Referer: https://dutchie.com/embedded-menu/<cName>`.
10) **Batch DB writes to avoid OOM**
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
- Chunk arrays (e.g., 100200 items) and upsert/insert in a loop; drop references after each chunk.
11) **Use dual-mode crawls by default**
- Always run with `useBothModes:true` to combine Mode A (pricing) + Mode B (full coverage).
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
12) **Capture OOS and missing items**
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false).
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed.
- If an existing product is absent from both modes, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
13) **Preserve all stock statuses (including unknown)**
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored.
- Expected values: in_stock, out_of_stock, missing_from_feed, unknown.
14) **Never delete or overwrite historical data**
- Do not delete products/snapshots or overwrite historical records.
- Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records.
15) **Per-location cName and platform_dispensary_id resolution**
- For each dispensary, menu_url and cName must be valid for that exact location.
- Derive cName from menu_url per store: `/embedded-menu/<cName>` or `/dispensary/<cName>`.
- Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
- If the slug is invalid/missing, mark the store not crawlable and log it.
16) **API Route Semantics**
**Route Groups (as registered in `src/index.ts`):**
- `/api/stores` = Store/dispensary CRUD and listing
- `/api/products` = Product listing and details
- `/api/workers` = Job queue monitoring (replaces legacy `/api/dutchie-az/...`)
- `/api/pipeline` = Crawl pipeline triggers
- `/api/admin/orchestrator` = Orchestrator admin actions
- `/api/discovery` = Platform discovery (Dutchie, etc.)
- `/api/v1/...` = Public API for external consumers (WordPress, etc.)
**Crawl Trigger:**
Check `/api/pipeline` or `/api/admin/orchestrator` routes for crawl triggers.
The legacy `POST /api/admin/crawl/:dispensaryId` does NOT exist.
17) **Monitoring and logging**
- `/api/workers` shows active/recent jobs from job queue
- Auto-refresh every 30 seconds
- System Logs page should show real log data, not just startup messages
18) **Dashboard Architecture**
- **Frontend**: Rebuild the frontend with `VITE_API_URL` pointing to the correct backend and redeploy.
- **Backend**: `/api/dashboard/stats` MUST use the canonical DB pool. Use the correct tables: `store_products`, `dispensaries`, and views like `v_dashboard_stats`, `v_latest_snapshots`.
19) **Deployment (Gitea + Kubernetes)**
- **Registry**: Gitea at `code.cannabrands.app/creationshop/dispensary-scraper`
- **Build and push** (from backend directory):
```bash
docker login code.cannabrands.app
cd backend
docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
docker push code.cannabrands.app/creationshop/dispensary-scraper:latest
```
- **Deploy to Kubernetes**:
```bash
kubectl rollout restart deployment/scraper -n dispensary-scraper
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
kubectl rollout status deployment/scraper -n dispensary-scraper
```
- K8s manifests are in `/k8s/` folder (scraper.yaml, scraper-worker.yaml, etc.)
20) **Crawler Architecture**
- **Scraper pod (1 replica)**: Runs the Express API server + scheduler.
- **Scraper-worker pods (25 replicas)**: Each runs `dist/tasks/task-worker.js`, polling the job queue.
- **Worker naming**: Pods use fantasy names (Aethelgard, Xylos, Kryll, Coriolis, etc.) - see `k8s/scraper-worker.yaml` ConfigMap. Worker IDs: `{PodName}-worker-{n}`
- **Job types**: `menu_detection`, `menu_detection_single`, `dutchie_product_crawl`
- **Job schedules** (managed in `job_schedules` table):
- `dutchie_az_menu_detection`: Runs daily with 60-min jitter
- `dutchie_az_product_crawl`: Runs every 4 hours with 30-min jitter
- **Monitor jobs**: `GET /api/workers`
- **Trigger crawls**: Check `/api/pipeline` routes
21) **Frontend Architecture - AVOID OVER-ENGINEERING**
**Key Principles:**
- **ONE BACKEND** serves ALL domains (cannaiq.co, findadispo.com, findagram.co)
- Do NOT create separate backend services for each domain
**Frontend Build Differences:**
- `cannaiq/` uses **Vite** (outputs to `dist/`, uses `VITE_` env vars) → cannaiq.co
- `findadispo/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findadispo.com
- `findagram/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findagram.co
**CRA vs Vite Dockerfile Differences:**
```dockerfile
# Vite (cannaiq)
ENV VITE_API_URL=https://api.domain.com
RUN npm run build
COPY --from=builder /app/dist /usr/share/nginx/html
# CRA (findadispo, findagram)
ENV REACT_APP_API_URL=https://api.domain.com
RUN npm run build
COPY --from=builder /app/build /usr/share/nginx/html
```
**Common Mistakes to AVOID:**
- Creating a FastAPI/Express backend just for findagram or findadispo
- Creating separate Docker images per domain when one would work
- Using `npm ci` in Dockerfiles when package-lock.json doesn't exist (use `npm install`)
---
## Admin UI Integration (Dutchie Discovery System)
The admin frontend includes a dedicated Discovery page located at:
cannaiq/src/pages/Discovery.tsx
This page is the operational interface that administrators use for
managing the Dutchie discovery pipeline. While it does not define API
features itself, it is the primary consumer of the Dutchie Discovery API.
### Responsibilities of the Discovery UI
The UI enables administrators to:
- View all discovered Dutchie locations
- Filter by status:
- discovered
- verified
- merged (linked to an existing dispensary)
- rejected
- Inspect individual location details (metadata, raw address, menu URL)
- Verify & create a new canonical dispensary
- Verify & link to an existing canonical dispensary
- Reject or unreject discovered locations
- Promote verified/merged locations into full crawlers via the orchestrator
### API Endpoints Consumed by the Discovery UI
The Discovery UI uses platform-agnostic routes with neutral slugs (see `docs/platform-slug-mapping.md`):
**Platform Slug**: `dt` = Dutchie (trademark-safe URL)
- `GET /api/discovery/platforms/dt/locations`
- `GET /api/discovery/platforms/dt/locations/:id`
- `POST /api/discovery/platforms/dt/locations/:id/verify-create`
- `POST /api/discovery/platforms/dt/locations/:id/verify-link`
- `POST /api/discovery/platforms/dt/locations/:id/reject`
- `POST /api/discovery/platforms/dt/locations/:id/unreject`
- `GET /api/discovery/platforms/dt/locations/:id/match-candidates`
- `GET /api/discovery/platforms/dt/cities`
- `GET /api/discovery/platforms/dt/summary`
- `POST /api/orchestrator/platforms/dt/promote/:id`
These endpoints are defined in:
- `backend/src/dutchie-az/discovery/routes.ts`
- `backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts`
### Frontend API Helper
The file:
cannaiq/src/lib/api.ts
implements the client-side wrappers for calling these endpoints:
- `getPlatformDiscoverySummary(platformSlug)`
- `getPlatformDiscoveryLocations(platformSlug, params)`
- `getPlatformDiscoveryLocation(platformSlug, id)`
- `verifyCreatePlatformLocation(platformSlug, id, verifiedBy)`
- `verifyLinkPlatformLocation(platformSlug, id, dispensaryId, verifiedBy)`
- `rejectPlatformLocation(platformSlug, id, reason, verifiedBy)`
- `unrejectPlatformLocation(platformSlug, id)`
- `getPlatformLocationMatchCandidates(platformSlug, id)`
- `getPlatformDiscoveryCities(platformSlug, params)`
- `promotePlatformDiscoveryLocation(platformSlug, id)`
Where `platformSlug` is a neutral two-letter slug (e.g., `'dt'` for Dutchie).
These helpers must be kept synchronized with backend routes.
### UI/Backend Contract
The Discovery UI must always:
- Treat discovery data as **non-canonical** until verified.
- Not assume a discovery location is crawl-ready.
- Initiate promotion only after verification steps.
- Handle all statuses safely: discovered, verified, merged, rejected.
The backend must always:
- Preserve discovery data even if rejected.
- Never automatically merge or promote a location.
- Allow idempotent verification and linking actions.
- Expose complete metadata to help operators make verification decisions.
# Coordinate Capture (Platform Discovery)
The DtLocationDiscoveryService captures geographic coordinates (latitude, longitude) whenever a platform's store payload provides them.
## Behavior:
- On INSERT:
- If the Dutchie API/GraphQL payload includes coordinates, they are saved into:
- dutchie_discovery_locations.latitude
- dutchie_discovery_locations.longitude
- On UPDATE:
- Coordinates are only filled if the existing row has NULL values.
- Coordinates are never overwritten once set (prevents pollution if later payloads omit or degrade coordinate accuracy).
- Logging:
- When coordinates are detected and captured:
"Extracted coordinates for <slug>: <lat>, <lng>"
- Summary Statistics:
- The discovery runner reports a count of:
- locations with coordinates
- locations without coordinates
## Purpose:
Coordinate capture enables:
- City/state validation (cross-checking submitted address vs lat/lng)
- Distance-based duplicate detection
- Location clustering for analytics
- Mapping/front-end visualization
- Future multi-platform reconciliation
- Improved dispensary matching during verify-link flow
Coordinate capture is part of the discovery phase only.
Canonical `dispensaries` entries may later be enriched with verified coordinates during promotion.
# CannaiQ — Analytics V2 Examples & API Structure Extension
This section contains examples from `backend/docs/ANALYTICS_V2_EXAMPLES.md` and extends the Analytics V2 API definition to include:
- response payload formats
- time window semantics
- rec/med segmentation usage
- SQL/TS pseudo-code examples
- endpoint expectations
---
# Analytics V2: Supported Endpoints
Base URL prefix: /api/analytics/v2
All endpoints accept `?window=7d|30d|90d` unless noted otherwise.
## 1. Price Analytics
### GET /api/analytics/v2/price/product/:storeProductId
Returns price history for a canonical store product.
Example response:
{
"storeProductId": 123,
"window": "30d",
"points": [
{ "date": "2025-02-01", "price": 32, "in_stock": true },
{ "date": "2025-02-02", "price": 30, "in_stock": true }
]
}
### GET /api/analytics/v2/price/rec-vs-med?categoryId=XYZ
Compares category pricing between recreational and medical-only states.
Example response:
{
"categoryId": "flower",
"rec": { "avg": 29.44, "median": 28.00, "states": ["CO", "WA", ...] },
"med": { "avg": 33.10, "median": 31.00, "states": ["FL", "PA", ...] }
}
---
## 2. Brand Analytics
### GET /api/analytics/v2/brand/:name/penetration
Returns penetration across states.
{
"brand": "Wyld",
"window": "90d",
"penetration": [
{ "state": "AZ", "stores": 28 },
{ "state": "MI", "stores": 34 }
]
}
### GET /api/analytics/v2/brand/:name/rec-vs-med
Returns penetration split by rec vs med segmentation.
---
## 3. Category Analytics
### GET /api/analytics/v2/category/:name/growth
7d/30d/90d snapshot comparison:
{
"category": "vape",
"window": "30d",
"growth": {
"current_sku_count": 420,
"previous_sku_count": 380,
"delta": 40
}
}
### GET /api/analytics/v2/category/rec-vs-med
Category-level comparisons.
---
## 4. Store Analytics
### GET /api/analytics/v2/store/:storeId/changes
Product-level changes:
{
"storeId": 88,
"window": "30d",
"added": [...],
"removed": [...],
"price_changes": [...],
"restocks": [...],
"oos_events": [...]
}
### GET /api/analytics/v2/store/:storeId/summary
---
## 5. State Analytics
### GET /api/analytics/v2/state/legal-breakdown
State rec/med/no-program segmentation summary.
### GET /api/analytics/v2/state/rec-vs-med-pricing
State-level pricing comparison.
### GET /api/analytics/v2/state/recreational
List rec-legal state codes.
### GET /api/analytics/v2/state/medical-only
List med-only state codes.
---
# Windowing Semantics
Definition: window is applied to canonical snapshots.
Equivalent to:
WHERE snapshot_at >= NOW() - INTERVAL '<window>'
---
# Rec/Med Segmentation Rules
rec_states:
states.recreational_legal = TRUE
med_only_states:
states.medical_legal = TRUE AND states.recreational_legal = FALSE
no_program:
both flags FALSE or NULL
Analytics must use this segmentation consistently.
---
# Response Structure Requirements
Every analytics v2 endpoint must:
- include the window used
- include segmentation if relevant
- include state codes when state-level grouping is used
- return safe empty arrays if no data
- NEVER throw on missing data
- be versionable (v2 must not break previous analytics APIs)
---
# Service Responsibilities Summary
### PriceAnalyticsService
- compute time-series price trends
- compute average/median price by state
- compute rec-vs-med price comparisons
### BrandPenetrationService
- compute presence across stores and states
- rec-vs-med brand footprint
- detect expansion / contraction
### CategoryAnalyticsService
- compute SKU count changes
- category pricing
- rec-vs-med category dynamics
### StoreAnalyticsService
- detect SKU additions/drops
- price changes
- restocks & OOS events
### StateAnalyticsService
- legal breakdown
- coverage gaps
- rec-vs-med scoring
---
# END Analytics V2 spec extension
---
## WordPress Plugin Versioning
The WordPress plugin version is tracked in `wordpress-plugin/VERSION`.
**Current version:** Check `wordpress-plugin/VERSION` for the latest version.
**Versioning rules:**
- **Minor bumps (x.x.N)**: Bug fixes, small improvements - default for most changes
- **Middle bumps (x.N.0)**: New features, significant improvements
- **Major bumps (N.0.0)**: Breaking changes, major rewrites - only when user explicitly requests
**When making WP plugin changes:**
1. Read `wordpress-plugin/VERSION` to get current version
2. Bump the version number (minor by default)
3. Update both files:
- `wordpress-plugin/VERSION`
- Plugin header `Version:` in `cannaiq-menus.php` and/or `crawlsy-menus.php`
- The `define('..._VERSION', '...')` constant in each plugin file
**Plugin files:**
| File | Brand | API URL |
|------|-------|---------|
| `cannaiq-menus.php` | CannaIQ | `https://cannaiq.co/api/v1` |
| `crawlsy-menus.php` | Crawlsy (legacy) | `https://cannaiq.co/api/v1` |
Both plugins use the same API endpoint. The Crawlsy version exists for backward compatibility with existing installations.