Files
cannaiq/CLAUDE.md
Kelly eedc027ff6 fix(workers): Report geo to worker_registry when identity claimed
Workers were showing "No geo assigned" on dashboard because geo info
was set internally but never reported to worker_registry after
identity pool claim.

Now updates current_state and current_city columns when identity
is claimed, so dashboard shows correct geo assignment.

Also documents CI/CD batching rule to minimize build time.

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-14 01:14:31 -07:00

12 KiB
Raw Permalink Blame History

Claude Guidelines for CannaiQ

CURRENT ENVIRONMENT: PRODUCTION

We are working in PRODUCTION only. All database queries and API calls should target the remote production environment, not localhost. Use kubectl port-forward or remote DB connections as needed.

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETE

Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.

2. NO KILL

Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.

3. NO MANUAL STARTUP

Never start servers manually. Say "Please run ./setup-local.sh" instead.

4. DEPLOYMENT AUTH REQUIRED

Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."

5. DB POOL ONLY

Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.

6. CI/CD DEPLOYMENT — BATCH CHANGES, PUSH ONCE

Never manually deploy or check deployment status. The project uses Woodpecker CI.

CRITICAL: Each CI build takes 30 minutes. NEVER push incrementally.

Workflow:

  1. Make ALL related code changes first
  2. Test locally if possible (./setup-local.sh)
  3. ONE commit with all changes
  4. ONE push to master
  5. STOP - CI handles the rest
  6. Wait for user to confirm deployment worked

DO NOT:

  • Push multiple small commits (each triggers 30-min build)
  • Run kubectl rollout status to check deployment
  • Run kubectl logs to verify new code is running
  • Manually restart pods
  • Check CI pipeline status

Batch everything, push once, wait for user feedback.

7. K8S POD LIMITS — CRITICAL

EXACTLY 8 PODS for scraper-worker deployment. NEVER CHANGE THIS.

Replica Count is LOCKED:

  • Always 8 replicas — no more, no less
  • NEVER scale down (even temporarily)
  • NEVER scale up beyond 8
  • If pods are not 8, restore to 8 immediately

Pods vs Workers:

  • Pod = Kubernetes container instance (ALWAYS 8)
  • Worker = Concurrent task runner INSIDE a pod (controlled by MAX_CONCURRENT_TASKS env var)
  • Formula: 8 pods × MAX_CONCURRENT_TASKS = 24 total concurrent workers

Browser Task Memory Limits:

  • Each Puppeteer/Chrome browser uses ~400 MB RAM
  • Pod memory limit is 2 GB
  • MAX_CONCURRENT_TASKS=3 is the safe maximum for browser tasks
  • More than 3 concurrent browsers per pod = OOM crash
Browsers RAM Used Status
3 ~1.3 GB Safe (recommended)
4 ~1.7 GB Risky
5+ >2 GB OOM crash

To increase throughput: Add more pods (up to 8), NOT more concurrent tasks per pod.

# CORRECT - scale pods (up to 8)
kubectl scale deployment/scraper-worker -n dispensary-scraper --replicas=8

# WRONG - will cause OOM crashes
kubectl set env deployment/scraper-worker -n dispensary-scraper MAX_CONCURRENT_TASKS=10

If K8s API returns ServiceUnavailable: STOP IMMEDIATELY. Do not retry. The cluster is overloaded.

7. K8S REQUIRES EXPLICIT PERMISSION

NEVER run kubectl commands without explicit user permission.

Before running ANY kubectl command (scale, rollout, set env, delete, apply, etc.):

  1. Tell the user what you want to do
  2. Wait for explicit approval
  3. Only then execute the command

This applies to ALL kubectl operations - even read-only ones like kubectl get pods.


Quick Reference

Database Tables

USE THIS NOT THIS
dispensaries stores (empty)
store_products products (empty)
store_product_snapshots dutchie_product_snapshots

Key Files

Purpose File
Dutchie client src/platforms/dutchie/client.ts
DB pool src/db/pool.ts
Payload fetch src/tasks/handlers/payload-fetch.ts
Product refresh src/tasks/handlers/product-refresh.ts

Dutchie GraphQL

  • Endpoint: https://dutchie.com/api-3/graphql
  • Hash (FilteredProducts): ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
  • CRITICAL: Use Status: 'Active' (not null)

Frontends

Folder Domain Build
cannaiq/ cannaiq.co Vite
findadispo/ findadispo.com CRA
findagram/ findagram.co CRA
frontend/ DEPRECATED -

Deprecated Code

DO NOT USE anything in src/_deprecated/:

  • hydration/ - Use src/tasks/handlers/
  • scraper-v2/ - Use src/platforms/dutchie/
  • canonical-hydration/ - Merged into tasks

DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts


Local Development

./setup-local.sh   # Start all services
./stop-local.sh    # Stop all services
Service URL
API http://localhost:3010
Admin http://localhost:8080/admin
PostgreSQL localhost:54320

WordPress Plugin (ACTIVE)

Plugin Files

File Purpose
wordpress-plugin/cannaiq-menus.php Main plugin (CannaIQ brand)
wordpress-plugin/crawlsy-menus.php Legacy plugin (Crawlsy brand)
wordpress-plugin/VERSION Version tracking

API Routes (Backend)

  • GET /api/v1/wordpress/dispensaries - List dispensaries
  • GET /api/v1/wordpress/dispensary/:id/menu - Get menu data
  • Route file: backend/src/routes/wordpress.ts

Versioning

Bump wordpress-plugin/VERSION on changes:

  • Minor (x.x.N): bug fixes
  • Middle (x.N.0): new features
  • Major (N.0.0): breaking changes (user must request)

Puppeteer Scraping (Browser-Based)

Age Gate Bypass

Most dispensary sites require age verification. The browser scraper handles this automatically:

Utility File: src/utils/age-gate.ts

Key Functions:

  • setAgeGateCookies(page, url, state) - Set cookies BEFORE navigation to prevent gate
  • hasAgeGate(page) - Detect if page shows age verification
  • bypassAgeGate(page, state) - Click through age gate if displayed
  • detectStateFromUrl(url) - Extract state from URL (e.g., -az- → Arizona)

Cookie Names Set:

  • age_gate_passed: 'true'
  • selected_state: '<state>'
  • age_verified: 'true'

Bypass Methods (tried in order):

  1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
  2. Standard <select> dropdown
  3. State button/card click
  4. Direct "Yes"/"Enter" button

Usage Pattern:

import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';

// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);

// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');

Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.

Dual-Transport Preflight

Workers run BOTH preflight checks on startup:

Transport Test Method Use Case
curl axios + proxy → httpbin.org Fast API requests
http Puppeteer + proxy + StealthPlugin Anti-detect, browser fingerprint

HTTP Preflight Steps:

  1. Get proxy from pool (CrawlRotator)
  2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
  3. Visit Dutchie embedded menu to establish session
  4. Make GraphQL request from browser context

Files:

  • src/services/curl-preflight.ts
  • src/services/puppeteer-preflight.ts
  • migrations/084_dual_transport_preflight.sql

Task Method Column: Tasks have method column ('curl' | 'http' | null):

  • null = any worker can claim
  • 'curl' = only workers with passed curl preflight
  • 'http' = only workers with passed http preflight

Currently ALL crawl tasks require method = 'http'.

Anti-Detect Fingerprint Distribution

Browser fingerprints are randomized using realistic market share distributions:

Files:

  • src/services/crawl-rotator.ts - Device/browser selection
  • src/services/http-fingerprint.ts - HTTP header fingerprinting

Device Weights (matches real traffic patterns):

Device Weight Percentage
Mobile 62 62%
Desktop 36 36%
Tablet 2 2%

Allowed Browsers (only realistic ones):

  • Chrome (67% market share)
  • Safari (20% market share)
  • Edge (6% market share)
  • Firefox (3% market share)

All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.

HTTP Header Fingerprinting:

  • DNT (Do Not Track): 30% probability of sending
  • Accept headers: Browser-specific variations
  • Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)

curl-impersonate Binaries (for curl transport):

Browser Binary
Chrome curl_chrome131
Edge curl_chrome131
Firefox curl_ff133
Safari curl_safari17

These binaries mimic real browser TLS fingerprints to avoid detection.

Evomi Residential Proxy API

Workers use Evomi's residential proxy API for geo-targeted proxies on-demand.

Priority Order:

  1. Evomi API (if EVOMI_USER/EVOMI_PASS configured)
  2. DB proxies (fallback if Evomi not configured)

Environment Variables:

Variable Description Default
EVOMI_USER API username -
EVOMI_PASS API key -
EVOMI_HOST Proxy host rpc.evomi.com
EVOMI_PORT Proxy port 1000

K8s Secret: Credentials stored in scraper-secrets:

kubectl get secret scraper-secrets -n dispensary-scraper -o jsonpath='{.data.EVOMI_PASS}' | base64 -d

Proxy URL Format: http://{user}_{session}_{geo}:{pass}@{host}:{port}

  • session: Worker ID for sticky sessions
  • geo: State code (e.g., arizona, california)

Files:

  • src/services/crawl-rotator.ts - getEvomiConfig(), buildEvomiProxyUrl()
  • src/tasks/task-worker.ts - Proxy initialization order

Bulk Task Workflow (Updated 2025-12-13)

Overview

Tasks are created with scheduled_for = NOW() by default. Worker-level controls handle pacing - no task-level staggering needed.

How It Works

1. Task created with scheduled_for = NOW()
2. Worker claims task only when scheduled_for <= NOW()
3. Worker runs preflight on EVERY task claim (proxy health check)
4. If preflight passes, worker executes task
5. If preflight fails, task released back to pending for another worker
6. Worker finishes task, polls for next available task
7. Repeat - preflight runs on each new task claim

Worker-Level Throttling

These controls pace task execution - no staggering at task creation time:

Control Purpose
MAX_CONCURRENT_TASKS Limits concurrent tasks per pod (default: 3)
Working hours Restricts when tasks run (configurable per schedule)
Preflight checks Ensures proxy health before each task
Per-store locking Only one active task per dispensary

Key Points

  • Preflight is per-task, not per-startup: Each task claim triggers a new preflight check
  • Worker controls pacing: Tasks scheduled for NOW() but claimed based on worker capacity
  • Optional staggering: Pass stagger_seconds > 0 if you need explicit delays

API Endpoints

# Create bulk tasks for specific dispensary IDs
POST /api/tasks/batch/staggered
{
  "dispensary_ids": [1, 2, 3, 4],
  "role": "product_refresh",      # or "product_discovery"
  "stagger_seconds": 0,           # default: 0 (all NOW)
  "platform": "dutchie",          # default: "dutchie"
  "method": null                  # "curl" | "http" | null
}

# Create bulk tasks for all stores in a state
POST /api/tasks/crawl-state/:stateCode
{
  "stagger_seconds": 0,           # default: 0 (all NOW)
  "method": "http"                # default: "http"
}

Example: Tasks for AZ Stores

curl -X POST http://localhost:3010/api/tasks/crawl-state/AZ \
  -H "Content-Type: application/json"
File Purpose
src/tasks/task-service.ts createStaggeredTasks() method
src/routes/tasks.ts API endpoints for batch task creation
src/tasks/task-worker.ts Worker task claiming and preflight logic

Documentation

Doc Purpose
backend/docs/CODEBASE_MAP.md Current files/directories
backend/docs/_archive/ Historical docs (may be outdated)