diff --git a/CLAUDE.md b/CLAUDE.md index 579de0e2..42e79383 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,1316 +1,107 @@ -## Claude Guidelines for this Project - ---- +# Claude Guidelines for CannaiQ ## PERMANENT RULES (NEVER VIOLATE) -### 1. NO DELETION OF DATA — EVER +### 1. NO DELETE +Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system. -CannaiQ is a **historical analytics system**. Data retention is **permanent by design**. +### 2. NO KILL +Never run `pkill`, `kill`, `killall`, or similar. Say "Please run `./stop-local.sh`" instead. -**NEVER delete:** -- Product records -- Crawled snapshots -- Images -- Directories -- Logs -- Orchestrator traces -- Profiles -- Selector configs -- Crawl outcomes -- Store data -- Brand data +### 3. NO MANUAL STARTUP +Never start servers manually. Say "Please run `./setup-local.sh`" instead. -**NEVER automate cleanup:** -- No cron or scheduled job may `rm`, `unlink`, `delete`, `purge`, `prune`, `clean`, or `reset` any storage directory or DB row -- No migration may DELETE data — only add/update/alter columns -- If cleanup is required, ONLY the user may issue a manual command +### 4. DEPLOYMENT AUTH REQUIRED +Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED." -**Code enforcement:** -- `local-storage.ts` must only: write files, create directories, read files -- No `deleteImage`, `deleteProductImages`, or similar functions - -### 2. NO PROCESS KILLING — EVER - -**Claude must NEVER run process-killing commands:** -- No `pkill` -- No `kill -9` -- No `xargs kill` -- No `lsof | kill` -- No `killall` -- No `fuser -k` - -**Claude must NOT manage host processes.** Only user scripts manage the local environment. - -**Correct behavior:** -- If backend is running on port 3010 → say: "Backend already running" -- If backend is NOT running → say: "Please run `./setup-local.sh`" - -**Process management is done ONLY by user scripts:** -```bash -./setup-local.sh # Start local environment -./stop-local.sh # Stop local environment -``` - -### 3. NO MANUAL SERVER STARTUP — EVER - -**Claude must NEVER start the backend manually:** -- No `npx tsx src/index.ts` -- No `node dist/index.js` -- No `npm run dev` with custom env vars -- No `DATABASE_URL=... npx tsx ...` - -**Claude must NEVER set DATABASE_URL in shell commands:** -- DB connection uses `CANNAIQ_DB_*` env vars or `CANNAIQ_DB_URL` from the user's environment -- Never hardcode connection strings in bash commands -- Never override env vars to bypass the user's DB setup - -**If backend is not running:** -- Say: "Please run `./setup-local.sh`" -- Do NOT attempt to start it yourself - -**If a dependency is missing:** -- Add it to `package.json` -- Say: "Please run `cd backend && npm install`" -- Do NOT try to solve it by starting a custom dev server - -**The ONLY way to start local services:** -```bash -cd backend -./setup-local.sh -``` - -### 4. DEPLOYMENT AUTHORIZATION REQUIRED - -**NEVER deploy to production unless the user explicitly says:** -> "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED." - -Until then: -- All work is LOCAL ONLY -- No `kubectl apply`, `docker push`, or remote operations -- No port-forwarding to production -- No connecting to Kubernetes clusters - -### 5. DATABASE CONNECTION ARCHITECTURE - -**Migration code is CLI-only. Runtime code must NOT import `src/db/migrate.ts`.** - -| Module | Purpose | Import From | -|--------|---------|-------------| -| `src/db/migrate.ts` | CLI migrations only | **NEVER import at runtime** | -| `src/db/pool.ts` | Runtime database pool | `import { pool } from '../db/pool'` | -| `src/dutchie-az/db/connection.ts` | Canonical connection helper | Alternative for runtime | - -**Runtime gets DB connections ONLY via:** -```typescript -import { pool } from '../db/pool'; -// or -import { getPool } from '../dutchie-az/db/connection'; -``` - -**To run migrations:** -```bash -cd backend -npx tsx src/db/migrate.ts -``` - -**Why this matters:** -- `migrate.ts` validates env vars strictly and throws at module load time -- Importing it at runtime causes startup crashes if env vars aren't perfect -- `pool.ts` uses lazy initialization - only validates when first query is made - -### 6. ALL API ROUTES REQUIRE AUTHENTICATION — NO EXCEPTIONS - -**Every API router MUST apply `authMiddleware` at the router level.** - -```typescript -import { authMiddleware } from '../auth/middleware'; - -const router = Router(); -router.use(authMiddleware); // REQUIRED - first line after router creation -``` - -**Authentication flow (see `src/auth/middleware.ts`):** -1. Check Bearer token (JWT or API token) → grant access if valid -2. Check trusted origins (cannaiq.co, findadispo.com, localhost, etc.) → grant access -3. Check trusted IPs (127.0.0.1, ::1, internal pod IPs) → grant access -4. **Return 401 Unauthorized** if none of the above - -**NEVER create API routes without auth middleware:** -- No "public" endpoints that bypass authentication -- No "read-only" exceptions -- No "analytics-only" exceptions -- If an endpoint exists under `/api/*`, it MUST be protected - -**When creating new route files:** -1. Import `authMiddleware` from `../auth/middleware` -2. Add `router.use(authMiddleware)` immediately after creating the router -3. Document security requirements in file header comments - -**Trusted origins (defined in middleware):** -- `https://cannaiq.co` -- `https://findadispo.com` -- `https://findagram.co` -- `*.cannabrands.app` domains -- `localhost:*` for development - -### 7. LOCAL DEVELOPMENT BY DEFAULT - -**Quick Start:** -```bash -./setup-local.sh -``` - -**Services (all started by setup-local.sh):** -| Service | URL | Purpose | -|---------|-----|---------| -| PostgreSQL | localhost:54320 | cannaiq-postgres container | -| Backend API | http://localhost:3010 | Express API server | -| CannaiQ Admin | http://localhost:8080/admin | B2B admin dashboard | -| FindADispo | http://localhost:3001 | Consumer dispensary finder | -| Findagram | http://localhost:3002 | Consumer delivery marketplace | - -**In local mode:** -- Use `docker-compose.local.yml` (NO MinIO) -- Use local filesystem storage at `./storage` -- Connect to `cannaiq-postgres` at `localhost:54320` -- Backend runs at `localhost:3010` -- All three frontends run on separate ports (8080, 3001, 3002) -- NO remote connections, NO Kubernetes, NO MinIO - -**Environment:** -- All DB config is in `backend/.env` -- STORAGE_DRIVER=local -- STORAGE_BASE_PATH=./storage - -**Local Admin Bootstrap:** -```bash -cd backend -npx tsx src/scripts/bootstrap-local-admin.ts -``` - -Creates/resets a deterministic local admin user: -| Field | Value | -|-------|-------| -| Email | `admin@local.test` | -| Password | `admin123` | -| Role | `superadmin` | - -This is a LOCAL-DEV helper only. Never use these credentials in production. - -**Manual startup (if not using setup-local.sh):** -```bash -# Terminal 1: Start PostgreSQL -docker-compose -f docker-compose.local.yml up -d - -# Terminal 2: Start Backend -cd backend && npm run dev - -# Terminal 3: Start Frontend -cd cannaiq && npm run dev:admin -``` - -**Stop services:** -```bash -./stop-local.sh -``` +### 5. DB POOL ONLY +Never import `src/db/migrate.ts` at runtime. Use `src/db/pool.ts` for DB access. --- -## DATABASE MODEL (CRITICAL) +## Quick Reference -### Database Architecture +### Database Tables +| USE THIS | NOT THIS | +|----------|----------| +| `dispensaries` | `stores` (empty) | +| `store_products` | `products` (empty) | +| `store_product_snapshots` | `dutchie_product_snapshots` | -CannaiQ has **TWO databases** with distinct purposes: +### Key Files +| Purpose | File | +|---------|------| +| Dutchie client | `src/platforms/dutchie/client.ts` | +| DB pool | `src/db/pool.ts` | +| Payload fetch | `src/tasks/handlers/payload-fetch.ts` | +| Product refresh | `src/tasks/handlers/product-refresh.ts` | -| Database | Purpose | Access | -|----------|---------|--------| -| `dutchie_menus` | **Canonical CannaiQ database** - All schema, migrations, and application data | READ/WRITE | -| `dutchie_legacy` | **Legacy read-only archive** - Historical data from old system | READ-ONLY | +### Dutchie GraphQL +- **Endpoint**: `https://dutchie.com/api-3/graphql` +- **Hash (FilteredProducts)**: `ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0` +- **CRITICAL**: Use `Status: 'Active'` (not `null`) -### Store vs Dispensary Terminology - -**"Store" and "Dispensary" are SYNONYMS in CannaiQ.** - -| Term | Usage | DB Table | -|------|-------|----------| -| Store | API routes (`/api/stores`) | `dispensaries` | -| Dispensary | DB table, internal code | `dispensaries` | - -- `/api/stores` and `/api/dispensaries` both query the `dispensaries` table -- There is NO `stores` table in use - it's a legacy empty table -- Use these terms interchangeably in code and documentation - -### Canonical vs Legacy Tables - -**CANONICAL TABLES (USE THESE):** - -| Table | Purpose | Row Count | -|-------|---------|-----------| -| `dispensaries` | Store/dispensary records | ~188+ rows | -| `store_products` | Product catalog | ~37,000+ rows | -| `store_product_snapshots` | Price/stock history | ~millions | - -**LEGACY TABLES (EMPTY - DO NOT USE):** - -| Table | Status | Action | -|-------|--------|--------| -| `stores` | EMPTY (0 rows) | Use `dispensaries` instead | -| `products` | EMPTY (0 rows) | Use `store_products` instead | -| `dutchie_products` | LEGACY (0 rows) | Use `store_products` instead | -| `dutchie_product_snapshots` | LEGACY (0 rows) | Use `store_product_snapshots` instead | -| `categories` | EMPTY (0 rows) | Categories stored in product records | - -**Code must NEVER:** -- Query the `stores` table (use `dispensaries`) -- Query the `products` table (use `store_products`) -- Query the `dutchie_products` table (use `store_products`) -- Query the `categories` table (categories are in product records) - -**CRITICAL RULES:** -- **Migrations ONLY run on `dutchie_menus`** - NEVER on `dutchie_legacy` -- **Application code connects ONLY to `dutchie_menus`** -- **ETL scripts READ from `dutchie_legacy`, WRITE to `dutchie_menus`** -- `dutchie_legacy` is frozen - NO writes, NO schema changes, NO migrations - -### Environment Variables - -**CannaiQ Database (dutchie_menus) - PRIMARY:** -```bash -# All application/migration DB access uses these env vars: -CANNAIQ_DB_HOST=localhost # Database host -CANNAIQ_DB_PORT=54320 # Database port -CANNAIQ_DB_NAME=dutchie_menus # MUST be dutchie_menus -CANNAIQ_DB_USER=dutchie # Database user -CANNAIQ_DB_PASS= # Database password - -# OR use a full connection string: -CANNAIQ_DB_URL=postgresql://user:pass@host:port/dutchie_menus -``` - -**Legacy Database (dutchie_legacy) - ETL ONLY:** -```bash -# Only used by ETL scripts for reading legacy data: -LEGACY_DB_HOST=localhost -LEGACY_DB_PORT=54320 -LEGACY_DB_NAME=dutchie_legacy # READ-ONLY - never migrated -LEGACY_DB_USER=dutchie -LEGACY_DB_PASS= - -# OR use a full connection string: -LEGACY_DB_URL=postgresql://user:pass@host:port/dutchie_legacy -``` - -**Key Rules:** -- `CANNAIQ_DB_NAME` MUST be `dutchie_menus` for application/migrations -- `LEGACY_DB_NAME` is `dutchie_legacy` - READ-ONLY for ETL only -- ALL application code MUST use `CANNAIQ_DB_*` environment variables -- No hardcoded database names anywhere in the codebase -- `backend/.env` controls all database access for local development - -**State Modeling:** -- States (AZ, MI, CA, NV, etc.) are modeled via `states` table + `state_id` on dispensaries -- NO separate databases per state -- Use `state_code` or `state_id` columns for filtering - -### Migration and ETL Procedure - -**Step 1: Run schema migration (on dutchie_menus ONLY):** -```bash -cd backend -psql "postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ - -f migrations/041_cannaiq_canonical_schema.sql -``` - -**Step 2: Run ETL to copy legacy data:** -```bash -cd backend -npx tsx src/scripts/etl/042_legacy_import.ts -# Reads from dutchie_legacy, writes to dutchie_menus -``` - -### Database Access Rules - -**Claude MUST NOT:** -- Connect to any database besides the canonical CannaiQ database -- Use raw connection strings in shell commands -- Run `psql` commands directly -- Construct database URLs manually -- Create or rename databases automatically -- Run `npm run migrate` without explicit user authorization -- Patch schema at runtime (no ALTER TABLE from scripts) - -**All data access MUST go through:** -- LOCAL CannaiQ backend HTTP API endpoints -- Internal CannaiQ application code (using canonical connection pool) -- Ask user to run SQL manually if absolutely needed - -**Local service management:** -- User starts services via `./setup-local.sh` (ONLY the user runs this) -- If port 3010 responds, assume backend is running -- If port 3010 does NOT respond, tell user: "Backend is not running; please run `./setup-local.sh`" -- Claude may only access the app via HTTP: `http://localhost:3010` (API), `http://localhost:8080/admin` (UI) -- Never restart, kill, or manage local processes — that is the user's responsibility - -### Migrations - -**Rules:** -- Migrations may be WRITTEN but only the USER runs them after review -- Never execute migrations automatically -- Only additive migrations (no DROP/DELETE) -- Write schema-tolerant code that handles missing optional columns - -**If schema changes are needed:** -1. Generate a proper migration file in `backend/migrations/*.sql` -2. Show the migration to the user -3. Wait for explicit authorization before running -4. Never run migrations automatically - only the user runs them after review - -**Schema tolerance:** -- If a column is missing at runtime, prefer making the code tolerant (treat field as optional) instead of auto-creating the column -- Queries should gracefully handle missing columns by omitting them or using NULL defaults - -### Canonical Schema Migration (041/042) - -**Migration 041** (`backend/migrations/041_cannaiq_canonical_schema.sql`): -- Creates canonical CannaiQ tables: `states`, `chains`, `brands`, `store_products`, `store_product_snapshots`, `crawl_runs` -- Adds `state_id` and `chain_id` columns to `dispensaries` -- Adds status columns to `dispensary_crawler_profiles` -- SCHEMA ONLY - no data inserts from legacy tables - -**ETL Script 042** (`backend/src/scripts/etl/042_legacy_import.ts`): -- Copies data from legacy `dutchie_legacy.dutchie_products` → `store_products` -- Copies data from legacy `dutchie_legacy.dutchie_product_snapshots` → `store_product_snapshots` -- Extracts brands from product data into `brands` table -- Links dispensaries to chains and states -- INSERT-ONLY and IDEMPOTENT (uses ON CONFLICT DO NOTHING) -- Run manually: `cd backend && npx tsx src/scripts/etl/042_legacy_import.ts` - -**Tables touched by ETL:** -| Source Table (dutchie_legacy) | Target Table (dutchie_menus) | -|-------------------------------|------------------------------| -| `dutchie_products` | `store_products` | -| `dutchie_product_snapshots` | `store_product_snapshots` | -| (brand names extracted) | `brands` | -| (state codes mapped) | `dispensaries.state_id` | -| (chain names matched) | `dispensaries.chain_id` | - -**Note:** The legacy `dutchie_products` and `dutchie_product_snapshots` tables in `dutchie_legacy` are read-only sources. All new crawl data goes directly to `store_products` and `store_product_snapshots`. - -**Migration 045** (`backend/migrations/045_add_image_columns.sql`): -- Adds `thumbnail_url` to `store_products` and `store_product_snapshots` -- `image_url` already exists from migration 041 -- ETL 042 populates `image_url` from legacy `primary_image_url` where present -- `thumbnail_url` is NULL for legacy data - future crawls can populate it - -### Deprecated Connection Module - -The custom connection module at `src/dutchie-az/db/connection` is **DEPRECATED**. - -**All code using `getClient` from this module must be refactored to:** -- Use the CannaiQ API endpoints instead -- Use the orchestrator through the API -- Use the canonical DB pool from the main application +### Frontends +| Folder | Domain | Build | +|--------|--------|-------| +| `cannaiq/` | cannaiq.co | Vite | +| `findadispo/` | findadispo.com | CRA | +| `findagram/` | findagram.co | CRA | +| `frontend/` | DEPRECATED | - | --- -## PERFORMANCE REQUIREMENTS +## Deprecated Code -**Database Queries:** -- NEVER write N+1 queries - always batch fetch related data before iterating -- NEVER run queries inside loops - batch them before the loop -- Avoid multiple queries when one JOIN or subquery works -- Dashboard/index pages should use MAX 5-10 queries total, not 50+ -- Mentally trace query count - if a page would run 20+ queries, refactor -- Cache expensive aggregations (in-memory or Redis, 5-min TTL) instead of recalculating every request -- Use query logging during development to verify query count +**DO NOT USE** anything in `src/_deprecated/`: +- `hydration/` - Use `src/tasks/handlers/` +- `scraper-v2/` - Use `src/platforms/dutchie/` +- `canonical-hydration/` - Merged into tasks -**Before submitting route/controller code, verify:** -1. No queries inside `forEach`/`map`/`for` loops -2. All related data fetched in batches before iteration -3. Aggregations done in SQL (`COUNT`, `SUM`, `AVG`, `GROUP BY`), not in JS -4. **Would this cause a 503 under load? If unsure, simplify.** - -**Examples of BAD patterns:** -```typescript -// BAD: N+1 query - runs a query for each store -const stores = await getStores(); -for (const store of stores) { - store.products = await getProductsByStoreId(store.id); // N queries! -} - -// BAD: Query inside map -const results = await Promise.all( - storeIds.map(id => pool.query('SELECT * FROM products WHERE store_id = $1', [id])) -); -``` - -**Examples of GOOD patterns:** -```typescript -// GOOD: Batch fetch all products, then group in JS -const stores = await getStores(); -const storeIds = stores.map(s => s.id); -const allProducts = await pool.query( - 'SELECT * FROM products WHERE store_id = ANY($1)', [storeIds] -); -const productsByStore = groupBy(allProducts.rows, 'store_id'); -stores.forEach(s => s.products = productsByStore[s.id] || []); - -// GOOD: Single query with JOIN -const result = await pool.query(` - SELECT s.*, COUNT(p.id) as product_count - FROM stores s - LEFT JOIN products p ON p.store_id = s.id - GROUP BY s.id -`); -``` +**DO NOT USE** `src/dutchie-az/db/connection.ts` - Use `src/db/pool.ts` --- -## FORBIDDEN ACTIONS +## Local Development -1. **Deleting any data** (products, snapshots, images, logs, traces) -2. **Deploying without explicit authorization** -3. **Connecting to Kubernetes** without authorization -4. **Port-forwarding to production** without authorization -5. **Starting MinIO** in local development -6. **Using S3/MinIO SDKs** when `STORAGE_DRIVER=local` -7. **Automating cleanup** of any kind -8. **Dropping database tables or columns** -9. **Overwriting historical records** (always append snapshots) -10. **Runtime schema patching** (ALTER TABLE from scripts) -11. **Using `getClient` from deprecated connection module** -12. **Creating ad-hoc database connections** outside the canonical pool -13. **Auto-adding missing columns** at runtime -14. **Killing local processes** (`pkill`, `kill`, `kill -9`, etc.) -15. **Starting backend/frontend directly** with custom env vars -16. **Running `lsof -ti:PORT | xargs kill`** or similar process-killing commands -17. **Using hardcoded database names** in code or comments -18. **Creating or connecting to a second database** -19. **Creating API routes without authMiddleware** (all `/api/*` routes MUST be protected) - ---- - -## STORAGE BEHAVIOR - -### Local Storage Structure - -``` -/storage/images/products/{state}/{store}/{brand}/{product}/ - image-{hash}.webp - -/storage/images/brands/{brand}/ - logo-{hash}.webp -``` - -### Image Proxy API (On-Demand Resizing) - -Images are stored at full resolution and resized on-demand via the `/img` endpoint. - -**Endpoint:** `GET /img/?` - -**Parameters:** -| Param | Description | Example | -|-------|-------------|---------| -| `w` | Width in pixels (max 4000) | `?w=200` | -| `h` | Height in pixels (max 4000) | `?h=200` | -| `q` | Quality 1-100 (default 80) | `?q=70` | -| `fit` | Resize mode: cover, contain, fill, inside, outside | `?fit=cover` | -| `blur` | Blur sigma 0.3-1000 | `?blur=5` | -| `gray` | Grayscale (1 = enabled) | `?gray=1` | -| `format` | Output: webp, jpeg, png, avif (default webp) | `?format=jpeg` | - -**Examples:** ```bash -# Thumbnail (50px) -GET /img/products/az/store/brand/product/image-abc123.webp?w=50 - -# Card image (200px, cover fit) -GET /img/products/az/store/brand/product/image-abc123.webp?w=200&h=200&fit=cover - -# JPEG at 70% quality -GET /img/products/az/store/brand/product/image-abc123.webp?w=400&format=jpeg&q=70 - -# Grayscale blur -GET /img/products/az/store/brand/product/image-abc123.webp?w=200&gray=1&blur=3 +./setup-local.sh # Start all services +./stop-local.sh # Stop all services ``` -**Frontend Usage:** -```typescript -import { getImageUrl, ImageSizes } from '../lib/images'; +| Service | URL | +|---------|-----| +| API | http://localhost:3010 | +| Admin | http://localhost:8080/admin | +| PostgreSQL | localhost:54320 | -// Returns /img/products/.../image.webp?w=50 for local images -// Returns original URL for remote images (CDN, etc.) -const thumbUrl = getImageUrl(product.image_url, ImageSizes.thumb); -const cardUrl = getImageUrl(product.image_url, ImageSizes.medium); -const detailUrl = getImageUrl(product.image_url, ImageSizes.detail); -``` +--- -**Size Presets:** -| Preset | Width | Use Case | -|--------|-------|----------| -| `thumb` | 50px | Table thumbnails | -| `small` | 100px | Small cards | -| `medium` | 200px | Grid cards | -| `large` | 400px | Large cards | -| `detail` | 600px | Product detail | -| `full` | - | No resize | - -### Storage Adapter - -```typescript -import { saveImage, getImageUrl } from '../utils/storage-adapter'; - -// Automatically uses local storage when STORAGE_DRIVER=local -``` - -### Files +## WordPress Plugin (ACTIVE) +### Plugin Files | File | Purpose | |------|---------| -| `backend/src/utils/image-storage.ts` | Image download and storage | -| `backend/src/routes/image-proxy.ts` | On-demand image resizing endpoint | -| `cannaiq/src/lib/images.ts` | Frontend image URL helper | -| `docker-compose.local.yml` | Local stack without MinIO | -| `start-local.sh` | Convenience startup script | +| `wordpress-plugin/cannaiq-menus.php` | Main plugin (CannaIQ brand) | +| `wordpress-plugin/crawlsy-menus.php` | Legacy plugin (Crawlsy brand) | +| `wordpress-plugin/VERSION` | Version tracking | + +### API Routes (Backend) +- `GET /api/v1/wordpress/dispensaries` - List dispensaries +- `GET /api/v1/wordpress/dispensary/:id/menu` - Get menu data +- Route file: `backend/src/routes/wordpress.ts` + +### Versioning +Bump `wordpress-plugin/VERSION` on changes: +- Minor (x.x.N): bug fixes +- Middle (x.N.0): new features +- Major (N.0.0): breaking changes (user must request) --- -## UI ANONYMIZATION RULES +## Documentation -- No vendor names in forward-facing URLs -- No "dutchie", "treez", "jane", "weedmaps", "leafly" visible in consumer UIs -- Internal admin tools may show provider names for debugging - ---- - -## DUTCHIE DISCOVERY PIPELINE (Added 2025-01) - -### Overview -Automated discovery of Dutchie-powered dispensaries across all US states. - -### Flow -``` -1. getAllCitiesByState GraphQL → Get all cities for a state -2. ConsumerDispensaries GraphQL → Get stores for each city -3. Upsert to dutchie_discovery_locations (keyed by platform_location_id) -4. AUTO-VALIDATE: Check required fields -5. AUTO-PROMOTE: Create/update dispensaries with crawl_enabled=true -6. Log all actions to dutchie_promotion_log -``` - -### Tables -| Table | Purpose | -|-------|---------| -| `dutchie_discovery_cities` | Cities known to have dispensaries | -| `dutchie_discovery_locations` | Raw discovered store data | -| `dispensaries` | Canonical stores (promoted from discovery) | -| `dutchie_promotion_log` | Audit trail for validation/promotion | - -### Files -| File | Purpose | -|------|---------| -| `src/discovery/discovery-crawler.ts` | Main orchestrator | -| `src/discovery/location-discovery.ts` | GraphQL fetching | -| `src/discovery/promotion.ts` | Validation & promotion logic | -| `src/scripts/run-discovery.ts` | CLI interface | -| `migrations/067_promotion_log.sql` | Audit log table | - -### GraphQL Hashes (in `src/platforms/dutchie/client.ts`) -| Query | Hash | -|-------|------| -| `GetAllCitiesByState` | `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6` | -| `ConsumerDispensaries` | `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b` | - -### Usage -```bash -# Discover all stores in a state -npx tsx src/scripts/run-discovery.ts discover:state AZ -npx tsx src/scripts/run-discovery.ts discover:state CA - -# Check stats -npx tsx src/scripts/run-discovery.ts stats -``` - -### Validation Rules -A discovery location must have: -- `platform_location_id` (MongoDB ObjectId, 24 hex chars) -- `name` -- `city` -- `state_code` -- `platform_menu_url` - -Invalid records are marked `status='rejected'` with errors logged. - -### Key Design Decisions -- `platform_location_id` MUST be MongoDB ObjectId (not slug) -- Old geo-based discovery stored slugs → deleted as garbage data -- Rate limit: 2 seconds between city requests to avoid API throttling -- Promotion is idempotent via `ON CONFLICT (platform_dispensary_id)` - ---- - -## FUTURE TODO / PENDING FEATURES - -- [ ] Orchestrator observability dashboard -- [ ] Crawl profile management UI -- [ ] State machine sandbox (disabled until authorized) -- [ ] Multi-state expansion beyond AZ - ---- - -### Multi-Site Architecture (CRITICAL) - -This project has **4 active locations** (plus 1 deprecated) - always clarify which one before making changes: - -| Folder | Domain | Type | Purpose | -|--------|--------|------|---------| -| `backend/` | (shared) | Express API | Single backend serving all frontends | -| `frontend/` | (DEPRECATED) | React SPA (Vite) | DEPRECATED - was dispos.crawlsy.com, now removed | -| `cannaiq/` | cannaiq.co | React SPA + PWA | Admin dashboard / B2B analytics | -| `findadispo/` | findadispo.com | React SPA + PWA | Consumer dispensary finder | -| `findagram/` | findagram.co | React SPA + PWA | Consumer delivery marketplace | - -**NOTE: `frontend/` folder is DEPRECATED:** -- `frontend/` = OLD/legacy dashboard - NO LONGER DEPLOYED (removed from k8s) -- `cannaiq/` = Primary admin dashboard, deployed to `cannaiq.co` -- Do NOT use or modify `frontend/` folder - it will be archived/removed - -**Before any frontend work, ASK: "Which site? cannaiq, findadispo, or findagram?"** - -All three active frontends share: -- Same backend API (port 3010) -- Same PostgreSQL database -- Same Kubernetes deployment for backend - -Each frontend has: -- Its own folder, package.json, Dockerfile -- Its own domain and branding -- Its own PWA manifest and service worker (cannaiq, findadispo, findagram) -- Separate Docker containers in production - ---- - -### Multi-Domain Hosting Architecture - -All three frontends are served from the **same IP** using **host-based routing**: - -**Kubernetes Ingress (Production):** -```yaml -# Each domain routes to its own frontend service -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: multi-site-ingress -spec: - rules: - - host: cannaiq.co - http: - paths: - - path: / - backend: - service: - name: cannaiq-frontend - port: 80 - - path: /api - backend: - service: - name: scraper # shared backend - port: 3010 - - host: findadispo.com - http: - paths: - - path: / - backend: - service: - name: findadispo-frontend - port: 80 - - path: /api - backend: - service: - name: scraper - port: 3010 - - host: findagram.co - http: - paths: - - path: / - backend: - service: - name: findagram-frontend - port: 80 - - path: /api - backend: - service: - name: scraper - port: 3010 -``` - -**Key Points:** -- DNS A records for all 3 domains point to same IP -- Ingress controller routes based on `Host` header -- Each frontend is a separate Docker container (nginx serving static files) -- All frontends share the same backend API at `/api/*` -- SSL/TLS handled at ingress level (cert-manager) - ---- - -### PWA Setup Requirements - -Each frontend is a **Progressive Web App (PWA)**. Required files in each `public/` folder: - -1. **manifest.json** - App metadata, icons, theme colors -2. **service-worker.js** - Offline caching, background sync -3. **Icons** - 192x192 and 512x512 PNG icons - -**Vite PWA Plugin Setup** (in each frontend's vite.config.ts): -```typescript -import { VitePWA } from 'vite-plugin-pwa' - -export default defineConfig({ - plugins: [ - react(), - VitePWA({ - registerType: 'autoUpdate', - manifest: { - name: 'Site Name', - short_name: 'Short', - theme_color: '#10b981', - icons: [ - { src: '/icon-192.png', sizes: '192x192', type: 'image/png' }, - { src: '/icon-512.png', sizes: '512x512', type: 'image/png' } - ] - }, - workbox: { - globPatterns: ['**/*.{js,css,html,ico,png,svg,woff2}'] - } - }) - ] -}) -``` - ---- - -### Core Rules Summary - -- **DB**: Use the single CannaiQ database via `CANNAIQ_DB_*` env vars. No hardcoded names. -- **Images**: No MinIO. Save to local /images/products//-.webp (and brands); preserve original URL; serve via backend static. -- **Dutchie GraphQL**: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). **CRITICAL: Use `Status: 'Active'`, NOT `null`** (null returns 0 products). -- **cName/slug**: Derive cName from each store's menu_url (/embedded-menu/ or /dispensary/). No hardcoded defaults. -- **Batch DB writes**: Chunk products/snapshots/missing (100–200) to avoid OOM. -- **API/Frontend**: Use `/api/stores`, `/api/products`, `/api/workers`, `/api/pipeline` endpoints. -- **Scheduling**: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter. -- **THC/CBD values**: Clamp to ≤100 - some products report milligrams as percentages. -- **Column names**: Use `name_raw`, `brand_name_raw`, `category_raw`, `subcategory_raw` (NOT `name`, `brand_name`, etc.) - -- **Monitor**: `/api/workers` shows active/recent jobs from job queue. -- **No slug guessing**: Never use defaults. Always derive per store from menu_url and resolve platform IDs per location. - -**📖 Full Documentation: See `docs/DUTCHIE_CRAWL_WORKFLOW.md` for complete pipeline documentation.** - ---- - -### Detailed Rules - -1) **Dispensary = Store (SAME THING)** - - "Dispensary" and "store" are synonyms in CannaiQ. Use interchangeably. - - **API endpoint**: `/api/stores` (NOT `/api/dispensaries`) - - **DB table**: `dispensaries` - - When you need to create/query stores via API, use `/api/stores` - - Use the record's `menu_url` and `platform_dispensary_id`. - -2) **API Authentication** - - **Trusted Origins (no auth needed)**: - - IPs: `127.0.0.1`, `::1`, `::ffff:127.0.0.1` - - Origins: `https://cannaiq.co`, `https://findadispo.com`, `https://findagram.co` - - Also: `http://localhost:3010`, `http://localhost:8080`, `http://localhost:5173` - - Requests from trusted IPs/origins get automatic admin access (`role: 'internal'`) - - **Remote (non-trusted)**: Use Bearer token (JWT or API token). NO username/password auth. - - Never try to login with username/password via API - use tokens only. - - See `src/auth/middleware.ts` for `TRUSTED_ORIGINS` and `TRUSTED_IPS` lists. - -3) **Menu detection and platform IDs** - - Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`. - - Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set. - -4) **Queries and mapping** - - The DB returns snake_case; code expects camelCase. Always alias/map: - - `platform_dispensary_id AS "platformDispensaryId"` - - Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl). - - Avoid `SELECT *`; explicitly select and/or map fields. - -4) **Scheduling** - - `/scraper-schedule` should accept filters/search (All vs AZ-only, name). - - "Run Now"/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing. - - Use `dispensary_crawl_status` view; show reason when not crawlable. - -5) **Crawling** - - Trigger dutchie crawls by dispensary ID (e.g., `POST /api/admin/crawl/:id`). - - Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (`/images/...`), store local URLs. - - Use dutchie GraphQL pipeline only for `menu_type='dutchie'`. - -6) **Frontend** - - Forward-facing URLs should not contain vendor names. - - `/scraper-schedule`: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls. - -7) **No slug guessing** - - Do not guess slugs; use the DB record's `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID. - -8) **Image storage (no MinIO)** - - Save images to local filesystem only. Do not create or use MinIO in Docker. - - Product images: `/images/products//-.webp` (+medium/+thumb). - - Brand images: `/images/brands/-.webp`. - - Store local URLs in DB fields (keep original URLs as fallback only). - - Serve `/images` via backend static middleware. - -9) **Dutchie GraphQL fetch rules** - - **Endpoint**: `https://dutchie.com/api-3/graphql` - - **Variables**: Use `productsFilter.dispensaryId` = `platform_dispensary_id` (MongoDB ObjectId). - - **Mode A**: `Status: "Active"` - returns active products with pricing - - **Mode B**: `Status: null` / `activeOnly: false` - returns all products including OOS/inactive - - **Headers** (server-side axios only): Chrome UA, `Origin: https://dutchie.com`, `Referer: https://dutchie.com/embedded-menu/`. - -10) **Batch DB writes to avoid OOM** - - Do NOT build one giant upsert/insert payload for products/snapshots/missing marks. - - Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk. - -11) **Use dual-mode crawls by default** - - Always run with `useBothModes:true` to combine Mode A (pricing) + Mode B (full coverage). - - Union/dedupe by product ID so you keep full coverage and pricing in one run. - -12) **Capture OOS and missing items** - - GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). - - After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. - - If an existing product is absent from both modes, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'. - -13) **Preserve all stock statuses (including unknown)** - - Do not filter or drop stock_status values in API/UI; pass through whatever is stored. - - Expected values: in_stock, out_of_stock, missing_from_feed, unknown. - -14) **Never delete or overwrite historical data** - - Do not delete products/snapshots or overwrite historical records. - - Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. - -15) **Per-location cName and platform_dispensary_id resolution** - - For each dispensary, menu_url and cName must be valid for that exact location. - - Derive cName from menu_url per store: `/embedded-menu/` or `/dispensary/`. - - Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData. - - If the slug is invalid/missing, mark the store not crawlable and log it. - -16) **API Route Semantics** - - **Route Groups (as registered in `src/index.ts`):** - - `/api/stores` = Store/dispensary CRUD and listing - - `/api/products` = Product listing and details - - `/api/workers` = Job queue monitoring (replaces legacy `/api/dutchie-az/...`) - - `/api/pipeline` = Crawl pipeline triggers - - `/api/admin/orchestrator` = Orchestrator admin actions - - `/api/discovery` = Platform discovery (Dutchie, etc.) - - `/api/v1/...` = Public API for external consumers (WordPress, etc.) - - **Crawl Trigger:** - Check `/api/pipeline` or `/api/admin/orchestrator` routes for crawl triggers. - The legacy `POST /api/admin/crawl/:dispensaryId` does NOT exist. - -17) **Monitoring and logging** - - `/api/workers` shows active/recent jobs from job queue - - Auto-refresh every 30 seconds - - System Logs page should show real log data, not just startup messages - -18) **Dashboard Architecture** - - **Frontend**: Rebuild the frontend with `VITE_API_URL` pointing to the correct backend and redeploy. - - **Backend**: `/api/dashboard/stats` MUST use the canonical DB pool. Use the correct tables: `store_products`, `dispensaries`, and views like `v_dashboard_stats`, `v_latest_snapshots`. - -19) **Deployment (Gitea + Kubernetes)** - - **Registry**: Gitea at `code.cannabrands.app/creationshop/dispensary-scraper` - - **Build and push** (from backend directory): - ```bash - docker login code.cannabrands.app - cd backend - docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest . - docker push code.cannabrands.app/creationshop/dispensary-scraper:latest - ``` - - **Deploy to Kubernetes**: - ```bash - kubectl rollout restart deployment/scraper -n dispensary-scraper - kubectl rollout restart deployment/scraper-worker -n dispensary-scraper - kubectl rollout status deployment/scraper -n dispensary-scraper - ``` - - K8s manifests are in `/k8s/` folder (scraper.yaml, scraper-worker.yaml, etc.) - -20) **Crawler Architecture** - - **Scraper pod (1 replica)**: Runs the Express API server + scheduler. - - **Scraper-worker pods (25 replicas)**: Each runs `dist/tasks/task-worker.js`, polling the job queue. - - **Worker naming**: Pods use fantasy names (Aethelgard, Xylos, Kryll, Coriolis, etc.) - see `k8s/scraper-worker.yaml` ConfigMap. Worker IDs: `{PodName}-worker-{n}` - - **Job types**: `menu_detection`, `menu_detection_single`, `dutchie_product_crawl` - - **Job schedules** (managed in `job_schedules` table): - - `dutchie_az_menu_detection`: Runs daily with 60-min jitter - - `dutchie_az_product_crawl`: Runs every 4 hours with 30-min jitter - - **Monitor jobs**: `GET /api/workers` - - **Trigger crawls**: Check `/api/pipeline` routes - -21) **Frontend Architecture - AVOID OVER-ENGINEERING** - - **Key Principles:** - - **ONE BACKEND** serves ALL domains (cannaiq.co, findadispo.com, findagram.co) - - Do NOT create separate backend services for each domain - - **Frontend Build Differences:** - - `cannaiq/` uses **Vite** (outputs to `dist/`, uses `VITE_` env vars) → cannaiq.co - - `findadispo/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findadispo.com - - `findagram/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findagram.co - - **CRA vs Vite Dockerfile Differences:** - ```dockerfile - # Vite (cannaiq) - ENV VITE_API_URL=https://api.domain.com - RUN npm run build - COPY --from=builder /app/dist /usr/share/nginx/html - - # CRA (findadispo, findagram) - ENV REACT_APP_API_URL=https://api.domain.com - RUN npm run build - COPY --from=builder /app/build /usr/share/nginx/html - ``` - - **Common Mistakes to AVOID:** - - Creating a FastAPI/Express backend just for findagram or findadispo - - Creating separate Docker images per domain when one would work - - Using `npm ci` in Dockerfiles when package-lock.json doesn't exist (use `npm install`) - ---- - -## Admin UI Integration (Dutchie Discovery System) - -The admin frontend includes a dedicated Discovery page located at: - - cannaiq/src/pages/Discovery.tsx - -This page is the operational interface that administrators use for -managing the Dutchie discovery pipeline. While it does not define API -features itself, it is the primary consumer of the Dutchie Discovery API. - -### Responsibilities of the Discovery UI - -The UI enables administrators to: - -- View all discovered Dutchie locations -- Filter by status: - - discovered - - verified - - merged (linked to an existing dispensary) - - rejected -- Inspect individual location details (metadata, raw address, menu URL) -- Verify & create a new canonical dispensary -- Verify & link to an existing canonical dispensary -- Reject or unreject discovered locations -- Promote verified/merged locations into full crawlers via the orchestrator - -### API Endpoints Consumed by the Discovery UI - -The Discovery UI uses platform-agnostic routes with neutral slugs (see `docs/platform-slug-mapping.md`): - -**Platform Slug**: `dt` = Dutchie (trademark-safe URL) - -- `GET /api/discovery/platforms/dt/locations` -- `GET /api/discovery/platforms/dt/locations/:id` -- `POST /api/discovery/platforms/dt/locations/:id/verify-create` -- `POST /api/discovery/platforms/dt/locations/:id/verify-link` -- `POST /api/discovery/platforms/dt/locations/:id/reject` -- `POST /api/discovery/platforms/dt/locations/:id/unreject` -- `GET /api/discovery/platforms/dt/locations/:id/match-candidates` -- `GET /api/discovery/platforms/dt/cities` -- `GET /api/discovery/platforms/dt/summary` -- `POST /api/orchestrator/platforms/dt/promote/:id` - -These endpoints are defined in: -- `backend/src/dutchie-az/discovery/routes.ts` -- `backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts` - -### Frontend API Helper - -The file: - - cannaiq/src/lib/api.ts - -implements the client-side wrappers for calling these endpoints: - -- `getPlatformDiscoverySummary(platformSlug)` -- `getPlatformDiscoveryLocations(platformSlug, params)` -- `getPlatformDiscoveryLocation(platformSlug, id)` -- `verifyCreatePlatformLocation(platformSlug, id, verifiedBy)` -- `verifyLinkPlatformLocation(platformSlug, id, dispensaryId, verifiedBy)` -- `rejectPlatformLocation(platformSlug, id, reason, verifiedBy)` -- `unrejectPlatformLocation(platformSlug, id)` -- `getPlatformLocationMatchCandidates(platformSlug, id)` -- `getPlatformDiscoveryCities(platformSlug, params)` -- `promotePlatformDiscoveryLocation(platformSlug, id)` - -Where `platformSlug` is a neutral two-letter slug (e.g., `'dt'` for Dutchie). -These helpers must be kept synchronized with backend routes. - -### UI/Backend Contract - -The Discovery UI must always: -- Treat discovery data as **non-canonical** until verified. -- Not assume a discovery location is crawl-ready. -- Initiate promotion only after verification steps. -- Handle all statuses safely: discovered, verified, merged, rejected. - -The backend must always: -- Preserve discovery data even if rejected. -- Never automatically merge or promote a location. -- Allow idempotent verification and linking actions. -- Expose complete metadata to help operators make verification decisions. - -# Coordinate Capture (Platform Discovery) - -The DtLocationDiscoveryService captures geographic coordinates (latitude, longitude) whenever a platform's store payload provides them. - -## Behavior: - -- On INSERT: - - If the Dutchie API/GraphQL payload includes coordinates, they are saved into: - - dutchie_discovery_locations.latitude - - dutchie_discovery_locations.longitude - -- On UPDATE: - - Coordinates are only filled if the existing row has NULL values. - - Coordinates are never overwritten once set (prevents pollution if later payloads omit or degrade coordinate accuracy). - -- Logging: - - When coordinates are detected and captured: - "Extracted coordinates for : , " - -- Summary Statistics: - - The discovery runner reports a count of: - - locations with coordinates - - locations without coordinates - -## Purpose: - -Coordinate capture enables: -- City/state validation (cross-checking submitted address vs lat/lng) -- Distance-based duplicate detection -- Location clustering for analytics -- Mapping/front-end visualization -- Future multi-platform reconciliation -- Improved dispensary matching during verify-link flow - -Coordinate capture is part of the discovery phase only. -Canonical `dispensaries` entries may later be enriched with verified coordinates during promotion. - -# CannaiQ — Analytics V2 Examples & API Structure Extension - -This section contains examples from `backend/docs/ANALYTICS_V2_EXAMPLES.md` and extends the Analytics V2 API definition to include: - -- response payload formats -- time window semantics -- rec/med segmentation usage -- SQL/TS pseudo-code examples -- endpoint expectations - ---- - -# Analytics V2: Supported Endpoints - -Base URL prefix: /api/analytics/v2 - -All endpoints accept `?window=7d|30d|90d` unless noted otherwise. - -## 1. Price Analytics - -### GET /api/analytics/v2/price/product/:storeProductId -Returns price history for a canonical store product. - -Example response: -{ - "storeProductId": 123, - "window": "30d", - "points": [ - { "date": "2025-02-01", "price": 32, "in_stock": true }, - { "date": "2025-02-02", "price": 30, "in_stock": true } - ] -} - -### GET /api/analytics/v2/price/rec-vs-med?categoryId=XYZ -Compares category pricing between recreational and medical-only states. - -Example response: -{ - "categoryId": "flower", - "rec": { "avg": 29.44, "median": 28.00, "states": ["CO", "WA", ...] }, - "med": { "avg": 33.10, "median": 31.00, "states": ["FL", "PA", ...] } -} - ---- - -## 2. Brand Analytics - -### GET /api/analytics/v2/brand/:name/penetration -Returns penetration across states. - -{ - "brand": "Wyld", - "window": "90d", - "penetration": [ - { "state": "AZ", "stores": 28 }, - { "state": "MI", "stores": 34 } - ] -} - -### GET /api/analytics/v2/brand/:name/rec-vs-med -Returns penetration split by rec vs med segmentation. - ---- - -## 3. Category Analytics - -### GET /api/analytics/v2/category/:name/growth -7d/30d/90d snapshot comparison: - -{ - "category": "vape", - "window": "30d", - "growth": { - "current_sku_count": 420, - "previous_sku_count": 380, - "delta": 40 - } -} - -### GET /api/analytics/v2/category/rec-vs-med -Category-level comparisons. - ---- - -## 4. Store Analytics - -### GET /api/analytics/v2/store/:storeId/changes -Product-level changes: - -{ - "storeId": 88, - "window": "30d", - "added": [...], - "removed": [...], - "price_changes": [...], - "restocks": [...], - "oos_events": [...] -} - -### GET /api/analytics/v2/store/:storeId/summary - ---- - -## 5. State Analytics - -### GET /api/analytics/v2/state/legal-breakdown -State rec/med/no-program segmentation summary. - -### GET /api/analytics/v2/state/rec-vs-med-pricing -State-level pricing comparison. - -### GET /api/analytics/v2/state/recreational -List rec-legal state codes. - -### GET /api/analytics/v2/state/medical-only -List med-only state codes. - ---- - -# Windowing Semantics - -Definition: window is applied to canonical snapshots. -Equivalent to: - -WHERE snapshot_at >= NOW() - INTERVAL '' - ---- - -# Rec/Med Segmentation Rules - -rec_states: - states.recreational_legal = TRUE - -med_only_states: - states.medical_legal = TRUE AND states.recreational_legal = FALSE - -no_program: - both flags FALSE or NULL - -Analytics must use this segmentation consistently. - ---- - -# Response Structure Requirements - -Every analytics v2 endpoint must: - -- include the window used -- include segmentation if relevant -- include state codes when state-level grouping is used -- return safe empty arrays if no data -- NEVER throw on missing data -- be versionable (v2 must not break previous analytics APIs) - ---- - -# Service Responsibilities Summary - -### PriceAnalyticsService -- compute time-series price trends -- compute average/median price by state -- compute rec-vs-med price comparisons - -### BrandPenetrationService -- compute presence across stores and states -- rec-vs-med brand footprint -- detect expansion / contraction - -### CategoryAnalyticsService -- compute SKU count changes -- category pricing -- rec-vs-med category dynamics - -### StoreAnalyticsService -- detect SKU additions/drops -- price changes -- restocks & OOS events - -### StateAnalyticsService -- legal breakdown -- coverage gaps -- rec-vs-med scoring - ---- - -# END Analytics V2 spec extension - ---- - -## WordPress Plugin Versioning - -The WordPress plugin version is tracked in `wordpress-plugin/VERSION`. - -**Current version:** Check `wordpress-plugin/VERSION` for the latest version. - -**Versioning rules:** -- **Minor bumps (x.x.N)**: Bug fixes, small improvements - default for most changes -- **Middle bumps (x.N.0)**: New features, significant improvements -- **Major bumps (N.0.0)**: Breaking changes, major rewrites - only when user explicitly requests - -**When making WP plugin changes:** -1. Read `wordpress-plugin/VERSION` to get current version -2. Bump the version number (minor by default) -3. Update both files: - - `wordpress-plugin/VERSION` - - Plugin header `Version:` in `cannaiq-menus.php` and/or `crawlsy-menus.php` - - The `define('..._VERSION', '...')` constant in each plugin file - -**Plugin files:** -| File | Brand | API URL | -|------|-------|---------| -| `cannaiq-menus.php` | CannaIQ | `https://cannaiq.co/api/v1` | -| `crawlsy-menus.php` | Crawlsy (legacy) | `https://cannaiq.co/api/v1` | - -Both plugins use the same API endpoint. The Crawlsy version exists for backward compatibility with existing installations. +| Doc | Purpose | +|-----|---------| +| `backend/docs/CODEBASE_MAP.md` | Current files/directories | +| `backend/docs/_archive/` | Historical docs (may be outdated) | diff --git a/backend/docs/CODEBASE_MAP.md b/backend/docs/CODEBASE_MAP.md new file mode 100644 index 00000000..f1cbfef6 --- /dev/null +++ b/backend/docs/CODEBASE_MAP.md @@ -0,0 +1,218 @@ +# CannaiQ Backend Codebase Map + +**Last Updated:** 2025-12-12 +**Purpose:** Help Claude and developers understand which code is current vs deprecated + +--- + +## Quick Reference: What to Use + +### For Crawling/Scraping +| Task | Use This | NOT This | +|------|----------|----------| +| Fetch products | `src/tasks/handlers/payload-fetch.ts` | `src/hydration/*` | +| Process products | `src/tasks/handlers/product-refresh.ts` | `src/scraper-v2/*` | +| GraphQL client | `src/platforms/dutchie/client.ts` | `src/dutchie-az/services/graphql-client.ts` | +| Worker system | `src/tasks/task-worker.ts` | `src/dutchie-az/services/worker.ts` | + +### For Database +| Task | Use This | NOT This | +|------|----------|----------| +| Get DB pool | `src/db/pool.ts` | `src/dutchie-az/db/connection.ts` | +| Run migrations | `src/db/migrate.ts` (CLI only) | Never import at runtime | +| Query products | `store_products` table | `products`, `dutchie_products` | +| Query stores | `dispensaries` table | `stores` table | + +### For Discovery +| Task | Use This | +|------|----------| +| Discover stores | `src/discovery/*.ts` | +| Run discovery | `npx tsx src/scripts/run-discovery.ts` | + +--- + +## Directory Status + +### ACTIVE DIRECTORIES (Use These) + +``` +src/ +├── auth/ # JWT/session auth, middleware +├── db/ # Database pool, migrations +├── discovery/ # Dutchie store discovery pipeline +├── middleware/ # Express middleware +├── multi-state/ # Multi-state query support +├── platforms/ # Platform-specific clients (Dutchie, Jane, etc) +│ └── dutchie/ # THE Dutchie client - use this one +├── routes/ # Express API routes +├── services/ # Core services (logger, scheduler, etc) +├── tasks/ # Task system (workers, handlers, scheduler) +│ └── handlers/ # Task handlers (payload_fetch, product_refresh, etc) +├── types/ # TypeScript types +└── utils/ # Utilities (storage, image processing) +``` + +### DEPRECATED DIRECTORIES (DO NOT USE) + +``` +src/ +├── hydration/ # DEPRECATED - Old pipeline approach +├── scraper-v2/ # DEPRECATED - Old scraper engine +├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers +├── dutchie-az/ # PARTIAL - Some parts deprecated, some active +│ ├── db/ # DEPRECATED - Use src/db/pool.ts +│ └── services/ # PARTIAL - worker.ts still runs, graphql-client.ts deprecated +├── portals/ # FUTURE - Not yet implemented +├── seo/ # PARTIAL - Settings work, templates WIP +└── system/ # DEPRECATED - Old orchestration system +``` + +### DEPRECATED FILES (DO NOT USE) + +``` +src/dutchie-az/db/connection.ts # Use src/db/pool.ts instead +src/dutchie-az/services/graphql-client.ts # Use src/platforms/dutchie/client.ts +src/hydration/*.ts # Entire directory deprecated +src/scraper-v2/*.ts # Entire directory deprecated +``` + +--- + +## Key Files Reference + +### Entry Points +| File | Purpose | Status | +|------|---------|--------| +| `src/index.ts` | Main Express server | ACTIVE | +| `src/dutchie-az/services/worker.ts` | Worker process entry | ACTIVE | +| `src/tasks/task-worker.ts` | Task worker (new system) | ACTIVE | + +### Dutchie Integration +| File | Purpose | Status | +|------|---------|--------| +| `src/platforms/dutchie/client.ts` | GraphQL client, hashes, curl | **PRIMARY** | +| `src/platforms/dutchie/queries.ts` | High-level query functions | ACTIVE | +| `src/platforms/dutchie/index.ts` | Re-exports | ACTIVE | + +### Task Handlers +| File | Purpose | Status | +|------|---------|--------| +| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** | +| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** | +| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE | +| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs | ACTIVE | +| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE | + +### Database +| File | Purpose | Status | +|------|---------|--------| +| `src/db/pool.ts` | Canonical DB pool | **PRIMARY** | +| `src/db/migrate.ts` | Migration runner (CLI only) | CLI ONLY | +| `src/db/auto-migrate.ts` | Auto-run migrations on startup | ACTIVE | + +### Configuration +| File | Purpose | Status | +|------|---------|--------| +| `.env` | Environment variables | ACTIVE | +| `package.json` | Dependencies | ACTIVE | +| `tsconfig.json` | TypeScript config | ACTIVE | + +--- + +## GraphQL Hashes (CRITICAL) + +The correct hashes are in `src/platforms/dutchie/client.ts`: + +```typescript +export const GRAPHQL_HASHES = { + FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0', + GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', + ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b', + GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6', +}; +``` + +**ALWAYS** use `Status: 'Active'` for FilteredProducts (not `null` or `'All'`). + +--- + +## Scripts Reference + +### Useful Scripts (in `src/scripts/`) +| Script | Purpose | +|--------|---------| +| `run-discovery.ts` | Run Dutchie discovery | +| `crawl-single-store.ts` | Test crawl a single store | +| `test-dutchie-graphql.ts` | Test GraphQL queries | + +### One-Off Scripts (probably don't need) +| Script | Purpose | +|--------|---------| +| `harmonize-az-dispensaries.ts` | One-time data cleanup | +| `bootstrap-stores-for-dispensaries.ts` | One-time migration | +| `backfill-*.ts` | Historical backfill scripts | + +--- + +## API Routes + +### Active Routes (in `src/routes/`) +| Route File | Mount Point | Purpose | +|------------|-------------|---------| +| `auth.ts` | `/api/auth` | Login/logout/session | +| `stores.ts` | `/api/stores` | Store CRUD | +| `dashboard.ts` | `/api/dashboard` | Dashboard stats | +| `workers.ts` | `/api/workers` | Worker monitoring | +| `pipeline.ts` | `/api/pipeline` | Crawl triggers | +| `discovery.ts` | `/api/discovery` | Discovery management | +| `analytics.ts` | `/api/analytics` | Analytics queries | +| `wordpress.ts` | `/api/v1/wordpress` | WordPress plugin API | + +--- + +## Documentation Files + +### Current Docs (in `backend/docs/`) +| Doc | Purpose | Currency | +|-----|---------|----------| +| `TASK_WORKFLOW_2024-12-10.md` | Task system architecture | CURRENT | +| `WORKER_TASK_ARCHITECTURE.md` | Worker/task design | CURRENT | +| `CRAWL_PIPELINE.md` | Crawl pipeline overview | CURRENT | +| `ORGANIC_SCRAPING_GUIDE.md` | Browser-based scraping | CURRENT | +| `CODEBASE_MAP.md` | This file | CURRENT | +| `ANALYTICS_V2_EXAMPLES.md` | Analytics API examples | CURRENT | +| `BRAND_INTELLIGENCE_API.md` | Brand API docs | CURRENT | + +### Root Docs +| Doc | Purpose | Currency | +|-----|---------|----------| +| `CLAUDE.md` | Claude instructions | **PRIMARY** | +| `README.md` | Project overview | NEEDS UPDATE | + +--- + +## Common Mistakes to Avoid + +1. **Don't use `src/hydration/`** - It's an old approach that was superseded by the task system + +2. **Don't use `src/dutchie-az/db/connection.ts`** - Use `src/db/pool.ts` instead + +3. **Don't import `src/db/migrate.ts` at runtime** - It will crash. Only use for CLI migrations. + +4. **Don't query `stores` table** - It's empty. Use `dispensaries`. + +5. **Don't query `products` table** - It's empty. Use `store_products`. + +6. **Don't use wrong GraphQL hash** - Always get hash from `GRAPHQL_HASHES` in client.ts + +7. **Don't use `Status: null`** - It returns 0 products. Use `Status: 'Active'`. + +--- + +## When in Doubt + +1. Check if the file is imported in `src/index.ts` - if not, it may be deprecated +2. Check the last modified date - older files may be stale +3. Look for `DEPRECATED` comments in the code +4. Ask: "Is there a newer version of this in `src/tasks/` or `src/platforms/`?" +5. Read the relevant doc in `docs/` before modifying code diff --git a/backend/docs/ANALYTICS_RUNBOOK.md b/backend/docs/_archive/ANALYTICS_RUNBOOK.md similarity index 100% rename from backend/docs/ANALYTICS_RUNBOOK.md rename to backend/docs/_archive/ANALYTICS_RUNBOOK.md diff --git a/backend/docs/ANALYTICS_V2_EXAMPLES.md b/backend/docs/_archive/ANALYTICS_V2_EXAMPLES.md similarity index 100% rename from backend/docs/ANALYTICS_V2_EXAMPLES.md rename to backend/docs/_archive/ANALYTICS_V2_EXAMPLES.md diff --git a/backend/docs/BRAND_INTELLIGENCE_API.md b/backend/docs/_archive/BRAND_INTELLIGENCE_API.md similarity index 100% rename from backend/docs/BRAND_INTELLIGENCE_API.md rename to backend/docs/_archive/BRAND_INTELLIGENCE_API.md diff --git a/backend/docs/CRAWL_PIPELINE.md b/backend/docs/_archive/CRAWL_PIPELINE.md similarity index 100% rename from backend/docs/CRAWL_PIPELINE.md rename to backend/docs/_archive/CRAWL_PIPELINE.md diff --git a/backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md b/backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md new file mode 100644 index 00000000..cc89140b --- /dev/null +++ b/backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md @@ -0,0 +1,297 @@ +# Organic Browser-Based Scraping Guide + +**Last Updated:** 2025-12-12 +**Status:** Production-ready proof of concept + +--- + +## Overview + +This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk. + +--- + +## Why Organic Scraping? + +| Approach | Detection Risk | Speed | Complexity | +|----------|---------------|-------|------------| +| Direct curl | Higher | Fast | Low | +| curl-impersonate | Medium | Fast | Medium | +| **Browser-based (organic)** | **Lowest** | Slower | Higher | + +Direct curl requests can be fingerprinted via: +- TLS fingerprint (cipher suites, extensions) +- Header order and values +- Missing cookies/session data +- Request patterns + +Browser-based requests inherit: +- Real Chrome TLS fingerprint +- Session cookies from page visit +- Natural header order +- JavaScript execution environment + +--- + +## Implementation + +### Dependencies + +```bash +npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth +``` + +### Core Script: `test-intercept.js` + +Located at: `backend/test-intercept.js` + +```javascript +const puppeteer = require('puppeteer-extra'); +const StealthPlugin = require('puppeteer-extra-plugin-stealth'); +const fs = require('fs'); + +puppeteer.use(StealthPlugin()); + +async function capturePayload(config) { + const { dispensaryId, platformId, cName, outputPath } = config; + + const browser = await puppeteer.launch({ + headless: 'new', + args: ['--no-sandbox', '--disable-setuid-sandbox'] + }); + + const page = await browser.newPage(); + + // STEP 1: Establish session by visiting the menu + const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`; + await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 }); + + // STEP 2: Fetch ALL products using GraphQL from browser context + const result = await page.evaluate(async (platformId) => { + const allProducts = []; + let pageNum = 0; + const perPage = 100; + let totalCount = 0; + const sessionId = 'browser-session-' + Date.now(); + + while (pageNum < 30) { + const variables = { + includeEnterpriseSpecials: false, + productsFilter: { + dispensaryId: platformId, + pricingType: 'rec', + Status: 'Active', // CRITICAL: Must be 'Active', not null + types: [], + useCache: true, + isDefaultSort: true, + sortBy: 'popularSortIdx', + sortDirection: 1, + bypassOnlineThresholds: true, + isKioskMenu: false, + removeProductsBelowOptionThresholds: false, + }, + page: pageNum, + perPage: perPage, + }; + + const extensions = { + persistedQuery: { + version: 1, + sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0' + } + }; + + const qs = new URLSearchParams({ + operationName: 'FilteredProducts', + variables: JSON.stringify(variables), + extensions: JSON.stringify(extensions) + }); + + const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, { + method: 'GET', + headers: { + 'Accept': 'application/json', + 'content-type': 'application/json', + 'x-dutchie-session': sessionId, + 'apollographql-client-name': 'Marketplace (production)', + }, + credentials: 'include' + }); + + const json = await response.json(); + const data = json?.data?.filteredProducts; + if (!data?.products) break; + + allProducts.push(...data.products); + if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0; + if (allProducts.length >= totalCount) break; + + pageNum++; + await new Promise(r => setTimeout(r, 200)); // Polite delay + } + + return { products: allProducts, totalCount }; + }, platformId); + + await browser.close(); + + // STEP 3: Save payload + const payload = { + dispensaryId, + platformId, + cName, + fetchedAt: new Date().toISOString(), + productCount: result.products.length, + products: result.products, + }; + + fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2)); + return payload; +} +``` + +--- + +## Critical Parameters + +### GraphQL Hash (FilteredProducts) + +``` +ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0 +``` + +**WARNING:** Using the wrong hash returns HTTP 400. + +### Status Parameter + +| Value | Result | +|-------|--------| +| `'Active'` | Returns in-stock products (1019 in test) | +| `null` | Returns 0 products | +| `'All'` | Returns HTTP 400 | + +**ALWAYS use `Status: 'Active'`** + +### Required Headers + +```javascript +{ + 'Accept': 'application/json', + 'content-type': 'application/json', + 'x-dutchie-session': 'unique-session-id', + 'apollographql-client-name': 'Marketplace (production)', +} +``` + +### Endpoint + +``` +https://dutchie.com/api-3/graphql +``` + +--- + +## Performance Benchmarks + +Test store: AZ-Deeply-Rooted (1019 products) + +| Metric | Value | +|--------|-------| +| Total products | 1019 | +| Time | 18.5 seconds | +| Payload size | 11.8 MB | +| Pages fetched | 11 (100 per page) | +| Success rate | 100% | + +--- + +## Payload Format + +The output matches the existing `payload-fetch.ts` handler format: + +```json +{ + "dispensaryId": 123, + "platformId": "6405ef617056e8014d79101b", + "cName": "AZ-Deeply-Rooted", + "fetchedAt": "2025-12-12T05:05:19.837Z", + "productCount": 1019, + "products": [ + { + "id": "6927508db4851262f629a869", + "Name": "Product Name", + "brand": { "name": "Brand Name", ... }, + "type": "Flower", + "THC": "25%", + "Prices": [...], + "Options": [...], + ... + } + ] +} +``` + +--- + +## Integration Points + +### As a Task Handler + +The organic approach can be integrated as an alternative to curl-based fetching: + +```typescript +// In src/tasks/handlers/organic-payload-fetch.ts +export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise { + // Use puppeteer-based capture + // Save to same payload storage + // Queue product_refresh task +} +``` + +### Worker Configuration + +Add to job_schedules: +```sql +INSERT INTO job_schedules (name, role, cron_expression) +VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *'); +``` + +--- + +## Troubleshooting + +### HTTP 400 Bad Request +- Check hash is correct: `ee29c060...` +- Verify Status is `'Active'` (string, not null) + +### 0 Products Returned +- Status was likely `null` or `'All'` - use `'Active'` +- Check platformId is valid MongoDB ObjectId + +### Session Not Established +- Increase timeout on initial page.goto() +- Check cName is valid (matches embedded-menu URL) + +### Detection/Blocking +- StealthPlugin should handle most cases +- Add random delays between pages +- Use headless: 'new' (not true/false) + +--- + +## Files Reference + +| File | Purpose | +|------|---------| +| `backend/test-intercept.js` | Proof of concept script | +| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation | +| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler | +| `backend/src/utils/payload-storage.ts` | Payload save/load utilities | + +--- + +## See Also + +- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation +- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture +- `CLAUDE.md` - Project rules and constraints diff --git a/backend/docs/_archive/README.md b/backend/docs/_archive/README.md new file mode 100644 index 00000000..bbe3eb16 --- /dev/null +++ b/backend/docs/_archive/README.md @@ -0,0 +1,25 @@ +# ARCHIVED DOCUMENTATION + +**WARNING: These docs may be outdated or inaccurate.** + +The code has evolved significantly. These docs are kept for historical reference only. + +## What to Use Instead + +**The single source of truth is:** +- `CLAUDE.md` (root) - Essential rules and quick reference +- `docs/CODEBASE_MAP.md` - Current file/directory reference + +## Why Archive? + +These docs were written during development iterations and may reference: +- Old file paths that no longer exist +- Deprecated approaches (hydration, scraper-v2) +- APIs that have changed +- Database schemas that evolved + +## If You Need Details + +1. First check CODEBASE_MAP.md for current file locations +2. Then read the actual source code +3. Only use archive docs as a last resort for historical context diff --git a/backend/docs/TASK_WORKFLOW_2024-12-10.md b/backend/docs/_archive/TASK_WORKFLOW_2024-12-10.md similarity index 100% rename from backend/docs/TASK_WORKFLOW_2024-12-10.md rename to backend/docs/_archive/TASK_WORKFLOW_2024-12-10.md diff --git a/backend/docs/WORKER_TASK_ARCHITECTURE.md b/backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md similarity index 100% rename from backend/docs/WORKER_TASK_ARCHITECTURE.md rename to backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md diff --git a/backend/src/_deprecated/DONT_USE.md b/backend/src/_deprecated/DONT_USE.md new file mode 100644 index 00000000..be669347 --- /dev/null +++ b/backend/src/_deprecated/DONT_USE.md @@ -0,0 +1,46 @@ +# DEPRECATED CODE - DO NOT USE + +**These directories contain OLD, ABANDONED code.** + +## What's Here + +| Directory | What It Was | Why Deprecated | +|-----------|-------------|----------------| +| `hydration/` | Old pipeline for processing crawl data | Replaced by `src/tasks/handlers/` | +| `scraper-v2/` | Old Puppeteer-based scraper engine | Replaced by curl-based `src/platforms/dutchie/client.ts` | +| `canonical-hydration/` | Intermediate step toward canonical schema | Merged into task handlers | + +## What to Use Instead + +| Old (DONT USE) | New (USE THIS) | +|----------------|----------------| +| `hydration/normalizers/dutchie.ts` | `src/tasks/handlers/product-refresh.ts` | +| `hydration/producer.ts` | `src/tasks/handlers/payload-fetch.ts` | +| `scraper-v2/engine.ts` | `src/platforms/dutchie/client.ts` | +| `scraper-v2/scheduler.ts` | `src/services/task-scheduler.ts` | + +## Why Keep This Code? + +- Historical reference only +- Some patterns may be useful for debugging +- Will be deleted once confirmed not needed + +## Claude Instructions + +**IF YOU ARE CLAUDE:** + +1. NEVER import from `src/_deprecated/` +2. NEVER reference these files as examples +3. NEVER try to "fix" or "update" code in here +4. If you see imports from these directories, suggest replacing them + +**Correct imports:** +```typescript +// GOOD +import { executeGraphQL } from '../platforms/dutchie/client'; +import { pool } from '../db/pool'; + +// BAD - DO NOT USE +import { something } from '../_deprecated/hydration/...'; +import { something } from '../_deprecated/scraper-v2/...'; +``` diff --git a/backend/src/canonical-hydration/RUNBOOK.md b/backend/src/_deprecated/canonical-hydration/RUNBOOK.md similarity index 100% rename from backend/src/canonical-hydration/RUNBOOK.md rename to backend/src/_deprecated/canonical-hydration/RUNBOOK.md diff --git a/backend/src/canonical-hydration/cli/backfill.ts b/backend/src/_deprecated/canonical-hydration/cli/backfill.ts similarity index 100% rename from backend/src/canonical-hydration/cli/backfill.ts rename to backend/src/_deprecated/canonical-hydration/cli/backfill.ts diff --git a/backend/src/canonical-hydration/cli/incremental.ts b/backend/src/_deprecated/canonical-hydration/cli/incremental.ts similarity index 100% rename from backend/src/canonical-hydration/cli/incremental.ts rename to backend/src/_deprecated/canonical-hydration/cli/incremental.ts diff --git a/backend/src/canonical-hydration/cli/products-only.ts b/backend/src/_deprecated/canonical-hydration/cli/products-only.ts similarity index 100% rename from backend/src/canonical-hydration/cli/products-only.ts rename to backend/src/_deprecated/canonical-hydration/cli/products-only.ts diff --git a/backend/src/canonical-hydration/crawl-run-recorder.ts b/backend/src/_deprecated/canonical-hydration/crawl-run-recorder.ts similarity index 100% rename from backend/src/canonical-hydration/crawl-run-recorder.ts rename to backend/src/_deprecated/canonical-hydration/crawl-run-recorder.ts diff --git a/backend/src/canonical-hydration/hydration-service.ts b/backend/src/_deprecated/canonical-hydration/hydration-service.ts similarity index 100% rename from backend/src/canonical-hydration/hydration-service.ts rename to backend/src/_deprecated/canonical-hydration/hydration-service.ts diff --git a/backend/src/canonical-hydration/index.ts b/backend/src/_deprecated/canonical-hydration/index.ts similarity index 100% rename from backend/src/canonical-hydration/index.ts rename to backend/src/_deprecated/canonical-hydration/index.ts diff --git a/backend/src/canonical-hydration/snapshot-writer.ts b/backend/src/_deprecated/canonical-hydration/snapshot-writer.ts similarity index 100% rename from backend/src/canonical-hydration/snapshot-writer.ts rename to backend/src/_deprecated/canonical-hydration/snapshot-writer.ts diff --git a/backend/src/canonical-hydration/store-product-normalizer.ts b/backend/src/_deprecated/canonical-hydration/store-product-normalizer.ts similarity index 100% rename from backend/src/canonical-hydration/store-product-normalizer.ts rename to backend/src/_deprecated/canonical-hydration/store-product-normalizer.ts diff --git a/backend/src/canonical-hydration/types.ts b/backend/src/_deprecated/canonical-hydration/types.ts similarity index 100% rename from backend/src/canonical-hydration/types.ts rename to backend/src/_deprecated/canonical-hydration/types.ts diff --git a/backend/src/hydration/__tests__/hydration.test.ts b/backend/src/_deprecated/hydration/__tests__/hydration.test.ts similarity index 100% rename from backend/src/hydration/__tests__/hydration.test.ts rename to backend/src/_deprecated/hydration/__tests__/hydration.test.ts diff --git a/backend/src/hydration/__tests__/normalizer.test.ts b/backend/src/_deprecated/hydration/__tests__/normalizer.test.ts similarity index 100% rename from backend/src/hydration/__tests__/normalizer.test.ts rename to backend/src/_deprecated/hydration/__tests__/normalizer.test.ts diff --git a/backend/src/hydration/backfill.ts b/backend/src/_deprecated/hydration/backfill.ts similarity index 100% rename from backend/src/hydration/backfill.ts rename to backend/src/_deprecated/hydration/backfill.ts diff --git a/backend/src/hydration/canonical-upsert.ts b/backend/src/_deprecated/hydration/canonical-upsert.ts similarity index 100% rename from backend/src/hydration/canonical-upsert.ts rename to backend/src/_deprecated/hydration/canonical-upsert.ts diff --git a/backend/src/hydration/incremental-sync.ts b/backend/src/_deprecated/hydration/incremental-sync.ts similarity index 100% rename from backend/src/hydration/incremental-sync.ts rename to backend/src/_deprecated/hydration/incremental-sync.ts diff --git a/backend/src/hydration/index.ts b/backend/src/_deprecated/hydration/index.ts similarity index 100% rename from backend/src/hydration/index.ts rename to backend/src/_deprecated/hydration/index.ts diff --git a/backend/src/hydration/legacy-backfill.ts b/backend/src/_deprecated/hydration/legacy-backfill.ts similarity index 100% rename from backend/src/hydration/legacy-backfill.ts rename to backend/src/_deprecated/hydration/legacy-backfill.ts diff --git a/backend/src/hydration/locking.ts b/backend/src/_deprecated/hydration/locking.ts similarity index 100% rename from backend/src/hydration/locking.ts rename to backend/src/_deprecated/hydration/locking.ts diff --git a/backend/src/hydration/normalizers/base.ts b/backend/src/_deprecated/hydration/normalizers/base.ts similarity index 100% rename from backend/src/hydration/normalizers/base.ts rename to backend/src/_deprecated/hydration/normalizers/base.ts diff --git a/backend/src/hydration/normalizers/dutchie.ts b/backend/src/_deprecated/hydration/normalizers/dutchie.ts similarity index 100% rename from backend/src/hydration/normalizers/dutchie.ts rename to backend/src/_deprecated/hydration/normalizers/dutchie.ts diff --git a/backend/src/hydration/normalizers/index.ts b/backend/src/_deprecated/hydration/normalizers/index.ts similarity index 100% rename from backend/src/hydration/normalizers/index.ts rename to backend/src/_deprecated/hydration/normalizers/index.ts diff --git a/backend/src/hydration/payload-store.ts b/backend/src/_deprecated/hydration/payload-store.ts similarity index 100% rename from backend/src/hydration/payload-store.ts rename to backend/src/_deprecated/hydration/payload-store.ts diff --git a/backend/src/hydration/producer.ts b/backend/src/_deprecated/hydration/producer.ts similarity index 100% rename from backend/src/hydration/producer.ts rename to backend/src/_deprecated/hydration/producer.ts diff --git a/backend/src/hydration/types.ts b/backend/src/_deprecated/hydration/types.ts similarity index 100% rename from backend/src/hydration/types.ts rename to backend/src/_deprecated/hydration/types.ts diff --git a/backend/src/hydration/worker.ts b/backend/src/_deprecated/hydration/worker.ts similarity index 100% rename from backend/src/hydration/worker.ts rename to backend/src/_deprecated/hydration/worker.ts diff --git a/backend/src/scraper-v2/README.md b/backend/src/_deprecated/scraper-v2/README.md similarity index 100% rename from backend/src/scraper-v2/README.md rename to backend/src/_deprecated/scraper-v2/README.md diff --git a/backend/src/scraper-v2/canonical-pipeline.ts b/backend/src/_deprecated/scraper-v2/canonical-pipeline.ts similarity index 100% rename from backend/src/scraper-v2/canonical-pipeline.ts rename to backend/src/_deprecated/scraper-v2/canonical-pipeline.ts diff --git a/backend/src/scraper-v2/downloader.ts b/backend/src/_deprecated/scraper-v2/downloader.ts similarity index 100% rename from backend/src/scraper-v2/downloader.ts rename to backend/src/_deprecated/scraper-v2/downloader.ts diff --git a/backend/src/scraper-v2/engine.ts b/backend/src/_deprecated/scraper-v2/engine.ts similarity index 100% rename from backend/src/scraper-v2/engine.ts rename to backend/src/_deprecated/scraper-v2/engine.ts diff --git a/backend/src/scraper-v2/index.ts b/backend/src/_deprecated/scraper-v2/index.ts similarity index 100% rename from backend/src/scraper-v2/index.ts rename to backend/src/_deprecated/scraper-v2/index.ts diff --git a/backend/src/scraper-v2/middlewares.ts b/backend/src/_deprecated/scraper-v2/middlewares.ts similarity index 100% rename from backend/src/scraper-v2/middlewares.ts rename to backend/src/_deprecated/scraper-v2/middlewares.ts diff --git a/backend/src/scraper-v2/navigation.ts b/backend/src/_deprecated/scraper-v2/navigation.ts similarity index 100% rename from backend/src/scraper-v2/navigation.ts rename to backend/src/_deprecated/scraper-v2/navigation.ts diff --git a/backend/src/scraper-v2/pipelines.ts b/backend/src/_deprecated/scraper-v2/pipelines.ts similarity index 100% rename from backend/src/scraper-v2/pipelines.ts rename to backend/src/_deprecated/scraper-v2/pipelines.ts diff --git a/backend/src/scraper-v2/scheduler.ts b/backend/src/_deprecated/scraper-v2/scheduler.ts similarity index 100% rename from backend/src/scraper-v2/scheduler.ts rename to backend/src/_deprecated/scraper-v2/scheduler.ts diff --git a/backend/src/scraper-v2/types.ts b/backend/src/_deprecated/scraper-v2/types.ts similarity index 100% rename from backend/src/scraper-v2/types.ts rename to backend/src/_deprecated/scraper-v2/types.ts diff --git a/backend/src/services/DiscoveryGeoService.ts b/backend/src/_deprecated/services/DiscoveryGeoService.ts similarity index 100% rename from backend/src/services/DiscoveryGeoService.ts rename to backend/src/_deprecated/services/DiscoveryGeoService.ts diff --git a/backend/src/services/GeoValidationService.ts b/backend/src/_deprecated/services/GeoValidationService.ts similarity index 100% rename from backend/src/services/GeoValidationService.ts rename to backend/src/_deprecated/services/GeoValidationService.ts diff --git a/backend/src/services/availability.ts b/backend/src/_deprecated/services/availability.ts similarity index 100% rename from backend/src/services/availability.ts rename to backend/src/_deprecated/services/availability.ts diff --git a/backend/src/services/crawler-logger.ts b/backend/src/_deprecated/services/crawler-logger.ts similarity index 100% rename from backend/src/services/crawler-logger.ts rename to backend/src/_deprecated/services/crawler-logger.ts diff --git a/backend/src/services/crawler-profiles.ts b/backend/src/_deprecated/services/crawler-profiles.ts similarity index 100% rename from backend/src/services/crawler-profiles.ts rename to backend/src/_deprecated/services/crawler-profiles.ts diff --git a/backend/src/services/geolocation.ts b/backend/src/_deprecated/services/geolocation.ts similarity index 100% rename from backend/src/services/geolocation.ts rename to backend/src/_deprecated/services/geolocation.ts diff --git a/backend/src/services/intelligence-detector.ts b/backend/src/_deprecated/services/intelligence-detector.ts similarity index 100% rename from backend/src/services/intelligence-detector.ts rename to backend/src/_deprecated/services/intelligence-detector.ts diff --git a/backend/src/services/menu-provider-detector.ts b/backend/src/_deprecated/services/menu-provider-detector.ts similarity index 100% rename from backend/src/services/menu-provider-detector.ts rename to backend/src/_deprecated/services/menu-provider-detector.ts diff --git a/backend/src/services/scraper-debug.ts b/backend/src/_deprecated/services/scraper-debug.ts similarity index 100% rename from backend/src/services/scraper-debug.ts rename to backend/src/_deprecated/services/scraper-debug.ts diff --git a/backend/src/services/scraper.ts b/backend/src/_deprecated/services/scraper.ts similarity index 100% rename from backend/src/services/scraper.ts rename to backend/src/_deprecated/services/scraper.ts diff --git a/backend/src/utils/HomepageValidator.ts b/backend/src/_deprecated/utils/HomepageValidator.ts similarity index 100% rename from backend/src/utils/HomepageValidator.ts rename to backend/src/_deprecated/utils/HomepageValidator.ts diff --git a/backend/src/utils/age-gate-playwright.ts b/backend/src/_deprecated/utils/age-gate-playwright.ts similarity index 100% rename from backend/src/utils/age-gate-playwright.ts rename to backend/src/_deprecated/utils/age-gate-playwright.ts diff --git a/backend/src/utils/stealthBrowser.ts b/backend/src/_deprecated/utils/stealthBrowser.ts similarity index 100% rename from backend/src/utils/stealthBrowser.ts rename to backend/src/_deprecated/utils/stealthBrowser.ts diff --git a/backend/test-intercept.js b/backend/test-intercept.js new file mode 100644 index 00000000..a1654551 --- /dev/null +++ b/backend/test-intercept.js @@ -0,0 +1,180 @@ +/** + * Stealth Browser Payload Capture - Direct GraphQL Injection + * + * Uses the browser session to make GraphQL requests that look organic. + * Adds proper headers matching what Dutchie's frontend sends. + */ + +const puppeteer = require('puppeteer-extra'); +const StealthPlugin = require('puppeteer-extra-plugin-stealth'); +const fs = require('fs'); + +puppeteer.use(StealthPlugin()); + +async function capturePayload(config) { + const { + dispensaryId = null, + platformId, + cName, + outputPath = `/tmp/payload_${cName}_${Date.now()}.json`, + } = config; + + const browser = await puppeteer.launch({ + headless: 'new', + args: ['--no-sandbox', '--disable-setuid-sandbox'] + }); + + const page = await browser.newPage(); + + // Establish session by visiting the embedded menu + const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`; + console.log(`[Capture] Establishing session at ${embedUrl}...`); + + await page.goto(embedUrl, { + waitUntil: 'networkidle2', + timeout: 60000 + }); + + console.log('[Capture] Session established, fetching ALL products...'); + + // Fetch all products using GET requests with proper headers + const result = await page.evaluate(async (platformId, cName) => { + const allProducts = []; + const logs = []; + let pageNum = 0; + const perPage = 100; + let totalCount = 0; + const sessionId = 'browser-session-' + Date.now(); + + try { + while (pageNum < 30) { // Max 30 pages = 3000 products + const variables = { + includeEnterpriseSpecials: false, + productsFilter: { + dispensaryId: platformId, + pricingType: 'rec', + Status: 'Active', // 'Active' for in-stock products per CLAUDE.md + types: [], + useCache: true, + isDefaultSort: true, + sortBy: 'popularSortIdx', + sortDirection: 1, + bypassOnlineThresholds: true, + isKioskMenu: false, + removeProductsBelowOptionThresholds: false, + }, + page: pageNum, + perPage: perPage, + }; + + const extensions = { + persistedQuery: { + version: 1, + sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0' + } + }; + + // Build GET URL like the browser does + const qs = new URLSearchParams({ + operationName: 'FilteredProducts', + variables: JSON.stringify(variables), + extensions: JSON.stringify(extensions) + }); + const url = `https://dutchie.com/api-3/graphql?${qs.toString()}`; + + const response = await fetch(url, { + method: 'GET', + headers: { + 'Accept': 'application/json', + 'content-type': 'application/json', + 'x-dutchie-session': sessionId, + 'apollographql-client-name': 'Marketplace (production)', + }, + credentials: 'include' + }); + + logs.push(`Page ${pageNum}: HTTP ${response.status}`); + + if (!response.ok) { + const text = await response.text(); + logs.push(`HTTP error: ${response.status} - ${text.slice(0, 200)}`); + break; + } + + const json = await response.json(); + + if (json.errors) { + logs.push(`GraphQL error: ${JSON.stringify(json.errors).slice(0, 200)}`); + break; + } + + const data = json?.data?.filteredProducts; + if (!data || !data.products) { + logs.push('No products in response'); + break; + } + + const products = data.products; + allProducts.push(...products); + + if (pageNum === 0) { + totalCount = data.queryInfo?.totalCount || 0; + logs.push(`Total reported: ${totalCount}`); + } + + logs.push(`Got ${products.length} products (total: ${allProducts.length}/${totalCount})`); + + if (allProducts.length >= totalCount || products.length < perPage) { + break; + } + + pageNum++; + + // Small delay between pages to be polite + await new Promise(r => setTimeout(r, 200)); + } + } catch (err) { + logs.push(`Error: ${err.message}`); + } + + return { products: allProducts, totalCount, logs }; + }, platformId, cName); + + await browser.close(); + + // Print logs from browser context + result.logs.forEach(log => console.log(`[Browser] ${log}`)); + + console.log(`[Capture] Got ${result.products.length} products (API reported ${result.totalCount})`); + + const payload = { + dispensaryId: dispensaryId, + platformId: platformId, + cName, + fetchedAt: new Date().toISOString(), + productCount: result.products.length, + products: result.products, + }; + + fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2)); + + console.log(`\n=== Capture Complete ===`); + console.log(`Total products: ${result.products.length}`); + console.log(`Saved to: ${outputPath}`); + console.log(`File size: ${(fs.statSync(outputPath).size / 1024).toFixed(1)} KB`); + + return payload; +} + +// Run +(async () => { + const payload = await capturePayload({ + cName: 'AZ-Deeply-Rooted', + platformId: '6405ef617056e8014d79101b', + }); + + if (payload.products.length > 0) { + const sample = payload.products[0]; + console.log(`\nSample: ${sample.Name || sample.name} - ${sample.brand?.name || sample.brandName}`); + } +})().catch(console.error); diff --git a/k8s/woodpecker-agent-compose.yml b/k8s/woodpecker-agent-compose.yml new file mode 100644 index 00000000..b6fe558a --- /dev/null +++ b/k8s/woodpecker-agent-compose.yml @@ -0,0 +1,18 @@ +# Woodpecker Agent Docker Compose +# Path: /opt/woodpecker/docker-compose.yml +# Deploy: cd /opt/woodpecker && docker compose up -d +version: '3.8' + +services: + woodpecker-agent: + image: woodpeckerci/woodpecker-agent:latest + container_name: woodpecker-agent + restart: always + volumes: + - /var/run/docker.sock:/var/run/docker.sock + environment: + - WOODPECKER_SERVER=localhost:9000 + - WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET} + - WOODPECKER_MAX_WORKFLOWS=5 + - WOODPECKER_HEALTHCHECK=true + - WOODPECKER_LOG_LEVEL=info