Files
cannaiq/CLAUDE.md
Kelly e17b3b225a feat(k8s): Add StatefulSet for persistent workers
- Add scraper-worker-statefulset.yaml with 8 persistent pods
- updateStrategy: OnDelete prevents automatic restarts
- Workers maintain stable identity across restarts
- Document worker architecture in CLAUDE.md
- Add worker registry API endpoint documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 00:30:08 -07:00

7.9 KiB

Claude Guidelines for CannaiQ

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETE

Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.

2. NO KILL

Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.

3. NO MANUAL STARTUP

Never start servers manually. Say "Please run ./setup-local.sh" instead.

4. DEPLOYMENT AUTH REQUIRED

Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."

5. DB POOL ONLY

Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.


Quick Reference

Database Tables

USE THIS NOT THIS
dispensaries stores (empty)
store_products products (empty)
store_product_snapshots dutchie_product_snapshots

Key Files

Purpose File
Dutchie client src/platforms/dutchie/client.ts
DB pool src/db/pool.ts
Payload fetch src/tasks/handlers/payload-fetch.ts
Product refresh src/tasks/handlers/product-refresh.ts

Dutchie GraphQL

  • Endpoint: https://dutchie.com/api-3/graphql
  • Hash (FilteredProducts): ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
  • CRITICAL: Use Status: 'Active' (not null)

Frontends

Folder Domain Build
cannaiq/ cannaiq.co Vite
findadispo/ findadispo.com CRA
findagram/ findagram.co CRA
frontend/ DEPRECATED -

Deprecated Code

DO NOT USE anything in src/_deprecated/:

  • hydration/ - Use src/tasks/handlers/
  • scraper-v2/ - Use src/platforms/dutchie/
  • canonical-hydration/ - Merged into tasks

DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts


Local Development

./setup-local.sh   # Start all services
./stop-local.sh    # Stop all services
Service URL
API http://localhost:3010
Admin http://localhost:8080/admin
PostgreSQL localhost:54320

WordPress Plugin (ACTIVE)

Plugin Files

File Purpose
wordpress-plugin/cannaiq-menus.php Main plugin (CannaIQ brand)
wordpress-plugin/crawlsy-menus.php Legacy plugin (Crawlsy brand)
wordpress-plugin/VERSION Version tracking

API Routes (Backend)

  • GET /api/v1/wordpress/dispensaries - List dispensaries
  • GET /api/v1/wordpress/dispensary/:id/menu - Get menu data
  • Route file: backend/src/routes/wordpress.ts

Versioning

Bump wordpress-plugin/VERSION on changes:

  • Minor (x.x.N): bug fixes
  • Middle (x.N.0): new features
  • Major (N.0.0): breaking changes (user must request)

Puppeteer Scraping (Browser-Based)

Age Gate Bypass

Most dispensary sites require age verification. The browser scraper handles this automatically:

Utility File: src/utils/age-gate.ts

Key Functions:

  • setAgeGateCookies(page, url, state) - Set cookies BEFORE navigation to prevent gate
  • hasAgeGate(page) - Detect if page shows age verification
  • bypassAgeGate(page, state) - Click through age gate if displayed
  • detectStateFromUrl(url) - Extract state from URL (e.g., -az- → Arizona)

Cookie Names Set:

  • age_gate_passed: 'true'
  • selected_state: '<state>'
  • age_verified: 'true'

Bypass Methods (tried in order):

  1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
  2. Standard <select> dropdown
  3. State button/card click
  4. Direct "Yes"/"Enter" button

Usage Pattern:

import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';

// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);

// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');

Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.

Dual-Transport Preflight

Workers run BOTH preflight checks on startup:

Transport Test Method Use Case
curl axios + proxy → httpbin.org Fast API requests
http Puppeteer + proxy + StealthPlugin Anti-detect, browser fingerprint

HTTP Preflight Steps:

  1. Get proxy from pool (CrawlRotator)
  2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
  3. Visit Dutchie embedded menu to establish session
  4. Make GraphQL request from browser context

Files:

  • src/services/curl-preflight.ts
  • src/services/puppeteer-preflight.ts
  • migrations/084_dual_transport_preflight.sql

Task Method Column: Tasks have method column ('curl' | 'http' | null):

  • null = any worker can claim
  • 'curl' = only workers with passed curl preflight
  • 'http' = only workers with passed http preflight

Currently ALL crawl tasks require method = 'http'.

Anti-Detect Fingerprint Distribution

Browser fingerprints are randomized using realistic market share distributions:

Files:

  • src/services/crawl-rotator.ts - Device/browser selection
  • src/services/http-fingerprint.ts - HTTP header fingerprinting

Device Weights (matches real traffic patterns):

Device Weight Percentage
Mobile 62 62%
Desktop 36 36%
Tablet 2 2%

Allowed Browsers (only realistic ones):

  • Chrome (67% market share)
  • Safari (20% market share)
  • Edge (6% market share)
  • Firefox (3% market share)

All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.

HTTP Header Fingerprinting:

  • DNT (Do Not Track): 30% probability of sending
  • Accept headers: Browser-specific variations
  • Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)

curl-impersonate Binaries (for curl transport):

Browser Binary
Chrome curl_chrome131
Edge curl_chrome131
Firefox curl_ff133
Safari curl_safari17

These binaries mimic real browser TLS fingerprints to avoid detection.


Worker Architecture (Kubernetes)

Persistent Workers (StatefulSet)

Workers run as a StatefulSet with 8 persistent pods. They maintain identity across restarts.

Pod Names: scraper-worker-0 through scraper-worker-7

Key Properties:

  • updateStrategy: OnDelete - Pods only update when manually deleted (no automatic restarts)
  • podManagementPolicy: Parallel - All pods start simultaneously
  • Workers register with their pod name as identity

K8s Manifest: backend/k8s/scraper-worker-statefulset.yaml

Worker Lifecycle

  1. Startup: Worker registers in worker_registry table with pod name
  2. Preflight: Runs dual-transport preflights (curl + http), reports IPs and fingerprint
  3. Task Loop: Polls for tasks, executes them, reports status
  4. Shutdown: Graceful 60-second termination period

NEVER Restart Workers Unnecessarily

Claude must NOT:

  • Restart workers unless explicitly requested
  • Use kubectl rollout restart on workers
  • Use kubectl set image on workers (this triggers restart)

To update worker code (only when user authorizes):

  1. Build and push new image with version tag
  2. Update StatefulSet image reference
  3. Manually delete pods one at a time when ready: kubectl delete pod scraper-worker-0 -n dispensary-scraper

Worker Registry API

Endpoint: GET /api/worker-registry/workers

Response Fields:

Field Description
pod_name Kubernetes pod name
worker_id Internal worker UUID
status active, idle, offline
curl_ip IP from curl preflight
http_ip IP from Puppeteer preflight
preflight_status pending, passed, failed
preflight_at Timestamp of last preflight
fingerprint_data Browser fingerprint JSON

Documentation

Doc Purpose
backend/docs/CODEBASE_MAP.md Current files/directories
backend/docs/_archive/ Historical docs (may be outdated)