🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
Claude Guidelines for CannaiQ
CURRENT ENVIRONMENT: PRODUCTION
We are working in PRODUCTION only. All database queries and API calls should target the remote production environment, not localhost. Use kubectl port-forward or remote DB connections as needed.
PERMANENT RULES (NEVER VIOLATE)
1. NO DELETE
Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.
2. NO KILL
Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.
3. NO MANUAL STARTUP
Never start servers manually. Say "Please run ./setup-local.sh" instead.
4. DEPLOYMENT AUTH REQUIRED
Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
5. DB POOL ONLY
Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.
6. CI/CD DEPLOYMENT — COMMIT AND WAIT
Never manually deploy or check deployment status. The project uses Woodpecker CI.
Workflow:
- Make code changes
git add+git commitgit push origin master- STOP - CI handles the rest
- Wait for user to confirm deployment worked
DO NOT:
- Run
kubectl rollout statusto check deployment - Run
kubectl logsto verify new code is running - Manually restart pods
- Check CI pipeline status
Just commit, push, and wait for user feedback.
7. K8S POD LIMITS — CRITICAL
EXACTLY 8 PODS for scraper-worker deployment. NEVER CHANGE THIS.
Replica Count is LOCKED:
- Always 8 replicas — no more, no less
- NEVER scale down (even temporarily)
- NEVER scale up beyond 8
- If pods are not 8, restore to 8 immediately
Pods vs Workers:
- Pod = Kubernetes container instance (ALWAYS 8)
- Worker = Concurrent task runner INSIDE a pod (controlled by
MAX_CONCURRENT_TASKSenv var) - Formula:
8 pods × MAX_CONCURRENT_TASKS = 24 total concurrent workers
Browser Task Memory Limits:
- Each Puppeteer/Chrome browser uses ~400 MB RAM
- Pod memory limit is 2 GB
- MAX_CONCURRENT_TASKS=3 is the safe maximum for browser tasks
- More than 3 concurrent browsers per pod = OOM crash
| Browsers | RAM Used | Status |
|---|---|---|
| 3 | ~1.3 GB | Safe (recommended) |
| 4 | ~1.7 GB | Risky |
| 5+ | >2 GB | OOM crash |
To increase throughput: Add more pods (up to 8), NOT more concurrent tasks per pod.
# CORRECT - scale pods (up to 8)
kubectl scale deployment/scraper-worker -n dispensary-scraper --replicas=8
# WRONG - will cause OOM crashes
kubectl set env deployment/scraper-worker -n dispensary-scraper MAX_CONCURRENT_TASKS=10
If K8s API returns ServiceUnavailable: STOP IMMEDIATELY. Do not retry. The cluster is overloaded.
7. K8S REQUIRES EXPLICIT PERMISSION
NEVER run kubectl commands without explicit user permission.
Before running ANY kubectl command (scale, rollout, set env, delete, apply, etc.):
- Tell the user what you want to do
- Wait for explicit approval
- Only then execute the command
This applies to ALL kubectl operations - even read-only ones like kubectl get pods.
Quick Reference
Database Tables
| USE THIS | NOT THIS |
|---|---|
dispensaries |
stores (empty) |
store_products |
products (empty) |
store_product_snapshots |
dutchie_product_snapshots |
Key Files
| Purpose | File |
|---|---|
| Dutchie client | src/platforms/dutchie/client.ts |
| DB pool | src/db/pool.ts |
| Payload fetch | src/tasks/handlers/payload-fetch.ts |
| Product refresh | src/tasks/handlers/product-refresh.ts |
Dutchie GraphQL
- Endpoint:
https://dutchie.com/api-3/graphql - Hash (FilteredProducts):
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0 - CRITICAL: Use
Status: 'Active'(notnull)
Frontends
| Folder | Domain | Build |
|---|---|---|
cannaiq/ |
cannaiq.co | Vite |
findadispo/ |
findadispo.com | CRA |
findagram/ |
findagram.co | CRA |
frontend/ |
DEPRECATED | - |
Deprecated Code
DO NOT USE anything in src/_deprecated/:
hydration/- Usesrc/tasks/handlers/scraper-v2/- Usesrc/platforms/dutchie/canonical-hydration/- Merged into tasks
DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts
Local Development
./setup-local.sh # Start all services
./stop-local.sh # Stop all services
| Service | URL |
|---|---|
| API | http://localhost:3010 |
| Admin | http://localhost:8080/admin |
| PostgreSQL | localhost:54320 |
WordPress Plugin (ACTIVE)
Plugin Files
| File | Purpose |
|---|---|
wordpress-plugin/cannaiq-menus.php |
Main plugin (CannaIQ brand) |
wordpress-plugin/crawlsy-menus.php |
Legacy plugin (Crawlsy brand) |
wordpress-plugin/VERSION |
Version tracking |
API Routes (Backend)
GET /api/v1/wordpress/dispensaries- List dispensariesGET /api/v1/wordpress/dispensary/:id/menu- Get menu data- Route file:
backend/src/routes/wordpress.ts
Versioning
Bump wordpress-plugin/VERSION on changes:
- Minor (x.x.N): bug fixes
- Middle (x.N.0): new features
- Major (N.0.0): breaking changes (user must request)
Puppeteer Scraping (Browser-Based)
Age Gate Bypass
Most dispensary sites require age verification. The browser scraper handles this automatically:
Utility File: src/utils/age-gate.ts
Key Functions:
setAgeGateCookies(page, url, state)- Set cookies BEFORE navigation to prevent gatehasAgeGate(page)- Detect if page shows age verificationbypassAgeGate(page, state)- Click through age gate if displayeddetectStateFromUrl(url)- Extract state from URL (e.g.,-az-→ Arizona)
Cookie Names Set:
age_gate_passed: 'true'selected_state: '<state>'age_verified: 'true'
Bypass Methods (tried in order):
- Custom dropdown (shadcn/radix style) - Curaleaf pattern
- Standard
<select>dropdown - State button/card click
- Direct "Yes"/"Enter" button
Usage Pattern:
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);
// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');
Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
Dual-Transport Preflight
Workers run BOTH preflight checks on startup:
| Transport | Test Method | Use Case |
|---|---|---|
curl |
axios + proxy → httpbin.org | Fast API requests |
http |
Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
HTTP Preflight Steps:
- Get proxy from pool (CrawlRotator)
- Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
- Visit Dutchie embedded menu to establish session
- Make GraphQL request from browser context
Files:
src/services/curl-preflight.tssrc/services/puppeteer-preflight.tsmigrations/084_dual_transport_preflight.sql
Task Method Column: Tasks have method column ('curl' | 'http' | null):
null= any worker can claim'curl'= only workers with passed curl preflight'http'= only workers with passed http preflight
Currently ALL crawl tasks require method = 'http'.
Anti-Detect Fingerprint Distribution
Browser fingerprints are randomized using realistic market share distributions:
Files:
src/services/crawl-rotator.ts- Device/browser selectionsrc/services/http-fingerprint.ts- HTTP header fingerprinting
Device Weights (matches real traffic patterns):
| Device | Weight | Percentage |
|---|---|---|
| Mobile | 62 | 62% |
| Desktop | 36 | 36% |
| Tablet | 2 | 2% |
Allowed Browsers (only realistic ones):
- Chrome (67% market share)
- Safari (20% market share)
- Edge (6% market share)
- Firefox (3% market share)
All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.
HTTP Header Fingerprinting:
- DNT (Do Not Track): 30% probability of sending
- Accept headers: Browser-specific variations
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
curl-impersonate Binaries (for curl transport):
| Browser | Binary |
|---|---|
| Chrome | curl_chrome131 |
| Edge | curl_chrome131 |
| Firefox | curl_ff133 |
| Safari | curl_safari17 |
These binaries mimic real browser TLS fingerprints to avoid detection.
Evomi Residential Proxy API
Workers use Evomi's residential proxy API for geo-targeted proxies on-demand.
Priority Order:
- Evomi API (if EVOMI_USER/EVOMI_PASS configured)
- DB proxies (fallback if Evomi not configured)
Environment Variables:
| Variable | Description | Default |
|---|---|---|
EVOMI_USER |
API username | - |
EVOMI_PASS |
API key | - |
EVOMI_HOST |
Proxy host | rpc.evomi.com |
EVOMI_PORT |
Proxy port | 1000 |
K8s Secret: Credentials stored in scraper-secrets:
kubectl get secret scraper-secrets -n dispensary-scraper -o jsonpath='{.data.EVOMI_PASS}' | base64 -d
Proxy URL Format: http://{user}_{session}_{geo}:{pass}@{host}:{port}
session: Worker ID for sticky sessionsgeo: State code (e.g.,arizona,california)
Files:
src/services/crawl-rotator.ts-getEvomiConfig(),buildEvomiProxyUrl()src/tasks/task-worker.ts- Proxy initialization order
Bulk Task Workflow (Updated 2025-12-13)
Overview
Tasks are created with scheduled_for = NOW() by default. Worker-level controls handle pacing - no task-level staggering needed.
How It Works
1. Task created with scheduled_for = NOW()
2. Worker claims task only when scheduled_for <= NOW()
3. Worker runs preflight on EVERY task claim (proxy health check)
4. If preflight passes, worker executes task
5. If preflight fails, task released back to pending for another worker
6. Worker finishes task, polls for next available task
7. Repeat - preflight runs on each new task claim
Worker-Level Throttling
These controls pace task execution - no staggering at task creation time:
| Control | Purpose |
|---|---|
MAX_CONCURRENT_TASKS |
Limits concurrent tasks per pod (default: 3) |
| Working hours | Restricts when tasks run (configurable per schedule) |
| Preflight checks | Ensures proxy health before each task |
| Per-store locking | Only one active task per dispensary |
Key Points
- Preflight is per-task, not per-startup: Each task claim triggers a new preflight check
- Worker controls pacing: Tasks scheduled for NOW() but claimed based on worker capacity
- Optional staggering: Pass
stagger_seconds > 0if you need explicit delays
API Endpoints
# Create bulk tasks for specific dispensary IDs
POST /api/tasks/batch/staggered
{
"dispensary_ids": [1, 2, 3, 4],
"role": "product_refresh", # or "product_discovery"
"stagger_seconds": 0, # default: 0 (all NOW)
"platform": "dutchie", # default: "dutchie"
"method": null # "curl" | "http" | null
}
# Create bulk tasks for all stores in a state
POST /api/tasks/crawl-state/:stateCode
{
"stagger_seconds": 0, # default: 0 (all NOW)
"method": "http" # default: "http"
}
Example: Tasks for AZ Stores
curl -X POST http://localhost:3010/api/tasks/crawl-state/AZ \
-H "Content-Type: application/json"
Related Files
| File | Purpose |
|---|---|
src/tasks/task-service.ts |
createStaggeredTasks() method |
src/routes/tasks.ts |
API endpoints for batch task creation |
src/tasks/task-worker.ts |
Worker task claiming and preflight logic |
Documentation
| Doc | Purpose |
|---|---|
backend/docs/CODEBASE_MAP.md |
Current files/directories |
backend/docs/_archive/ |
Historical docs (may be outdated) |