- Add scraper-worker-statefulset.yaml with 8 persistent pods - updateStrategy: OnDelete prevents automatic restarts - Workers maintain stable identity across restarts - Document worker architecture in CLAUDE.md - Add worker registry API endpoint documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.9 KiB
Claude Guidelines for CannaiQ
PERMANENT RULES (NEVER VIOLATE)
1. NO DELETE
Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.
2. NO KILL
Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.
3. NO MANUAL STARTUP
Never start servers manually. Say "Please run ./setup-local.sh" instead.
4. DEPLOYMENT AUTH REQUIRED
Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
5. DB POOL ONLY
Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.
Quick Reference
Database Tables
| USE THIS | NOT THIS |
|---|---|
dispensaries |
stores (empty) |
store_products |
products (empty) |
store_product_snapshots |
dutchie_product_snapshots |
Key Files
| Purpose | File |
|---|---|
| Dutchie client | src/platforms/dutchie/client.ts |
| DB pool | src/db/pool.ts |
| Payload fetch | src/tasks/handlers/payload-fetch.ts |
| Product refresh | src/tasks/handlers/product-refresh.ts |
Dutchie GraphQL
- Endpoint:
https://dutchie.com/api-3/graphql - Hash (FilteredProducts):
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0 - CRITICAL: Use
Status: 'Active'(notnull)
Frontends
| Folder | Domain | Build |
|---|---|---|
cannaiq/ |
cannaiq.co | Vite |
findadispo/ |
findadispo.com | CRA |
findagram/ |
findagram.co | CRA |
frontend/ |
DEPRECATED | - |
Deprecated Code
DO NOT USE anything in src/_deprecated/:
hydration/- Usesrc/tasks/handlers/scraper-v2/- Usesrc/platforms/dutchie/canonical-hydration/- Merged into tasks
DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts
Local Development
./setup-local.sh # Start all services
./stop-local.sh # Stop all services
| Service | URL |
|---|---|
| API | http://localhost:3010 |
| Admin | http://localhost:8080/admin |
| PostgreSQL | localhost:54320 |
WordPress Plugin (ACTIVE)
Plugin Files
| File | Purpose |
|---|---|
wordpress-plugin/cannaiq-menus.php |
Main plugin (CannaIQ brand) |
wordpress-plugin/crawlsy-menus.php |
Legacy plugin (Crawlsy brand) |
wordpress-plugin/VERSION |
Version tracking |
API Routes (Backend)
GET /api/v1/wordpress/dispensaries- List dispensariesGET /api/v1/wordpress/dispensary/:id/menu- Get menu data- Route file:
backend/src/routes/wordpress.ts
Versioning
Bump wordpress-plugin/VERSION on changes:
- Minor (x.x.N): bug fixes
- Middle (x.N.0): new features
- Major (N.0.0): breaking changes (user must request)
Puppeteer Scraping (Browser-Based)
Age Gate Bypass
Most dispensary sites require age verification. The browser scraper handles this automatically:
Utility File: src/utils/age-gate.ts
Key Functions:
setAgeGateCookies(page, url, state)- Set cookies BEFORE navigation to prevent gatehasAgeGate(page)- Detect if page shows age verificationbypassAgeGate(page, state)- Click through age gate if displayeddetectStateFromUrl(url)- Extract state from URL (e.g.,-az-→ Arizona)
Cookie Names Set:
age_gate_passed: 'true'selected_state: '<state>'age_verified: 'true'
Bypass Methods (tried in order):
- Custom dropdown (shadcn/radix style) - Curaleaf pattern
- Standard
<select>dropdown - State button/card click
- Direct "Yes"/"Enter" button
Usage Pattern:
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);
// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');
Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
Dual-Transport Preflight
Workers run BOTH preflight checks on startup:
| Transport | Test Method | Use Case |
|---|---|---|
curl |
axios + proxy → httpbin.org | Fast API requests |
http |
Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
HTTP Preflight Steps:
- Get proxy from pool (CrawlRotator)
- Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
- Visit Dutchie embedded menu to establish session
- Make GraphQL request from browser context
Files:
src/services/curl-preflight.tssrc/services/puppeteer-preflight.tsmigrations/084_dual_transport_preflight.sql
Task Method Column: Tasks have method column ('curl' | 'http' | null):
null= any worker can claim'curl'= only workers with passed curl preflight'http'= only workers with passed http preflight
Currently ALL crawl tasks require method = 'http'.
Anti-Detect Fingerprint Distribution
Browser fingerprints are randomized using realistic market share distributions:
Files:
src/services/crawl-rotator.ts- Device/browser selectionsrc/services/http-fingerprint.ts- HTTP header fingerprinting
Device Weights (matches real traffic patterns):
| Device | Weight | Percentage |
|---|---|---|
| Mobile | 62 | 62% |
| Desktop | 36 | 36% |
| Tablet | 2 | 2% |
Allowed Browsers (only realistic ones):
- Chrome (67% market share)
- Safari (20% market share)
- Edge (6% market share)
- Firefox (3% market share)
All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.
HTTP Header Fingerprinting:
- DNT (Do Not Track): 30% probability of sending
- Accept headers: Browser-specific variations
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
curl-impersonate Binaries (for curl transport):
| Browser | Binary |
|---|---|
| Chrome | curl_chrome131 |
| Edge | curl_chrome131 |
| Firefox | curl_ff133 |
| Safari | curl_safari17 |
These binaries mimic real browser TLS fingerprints to avoid detection.
Worker Architecture (Kubernetes)
Persistent Workers (StatefulSet)
Workers run as a StatefulSet with 8 persistent pods. They maintain identity across restarts.
Pod Names: scraper-worker-0 through scraper-worker-7
Key Properties:
updateStrategy: OnDelete- Pods only update when manually deleted (no automatic restarts)podManagementPolicy: Parallel- All pods start simultaneously- Workers register with their pod name as identity
K8s Manifest: backend/k8s/scraper-worker-statefulset.yaml
Worker Lifecycle
- Startup: Worker registers in
worker_registrytable with pod name - Preflight: Runs dual-transport preflights (curl + http), reports IPs and fingerprint
- Task Loop: Polls for tasks, executes them, reports status
- Shutdown: Graceful 60-second termination period
NEVER Restart Workers Unnecessarily
Claude must NOT:
- Restart workers unless explicitly requested
- Use
kubectl rollout restarton workers - Use
kubectl set imageon workers (this triggers restart)
To update worker code (only when user authorizes):
- Build and push new image with version tag
- Update StatefulSet image reference
- Manually delete pods one at a time when ready:
kubectl delete pod scraper-worker-0 -n dispensary-scraper
Worker Registry API
Endpoint: GET /api/worker-registry/workers
Response Fields:
| Field | Description |
|---|---|
pod_name |
Kubernetes pod name |
worker_id |
Internal worker UUID |
status |
active, idle, offline |
curl_ip |
IP from curl preflight |
http_ip |
IP from Puppeteer preflight |
preflight_status |
pending, passed, failed |
preflight_at |
Timestamp of last preflight |
fingerprint_data |
Browser fingerprint JSON |
Documentation
| Doc | Purpose |
|---|---|
backend/docs/CODEBASE_MAP.md |
Current files/directories |
backend/docs/_archive/ |
Historical docs (may be outdated) |