Files

Kelly 92f88fdcd6 fix(workers): Increase max concurrent tasks to 15 and add K8s permission rule

- Change MAX_CONCURRENT_TASKS default from 3 to 15
- Add CLAUDE.md rule requiring explicit permission before kubectl commands

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-12 10:54:33 -07:00

9.6 KiB

Raw Blame History

Claude Guidelines for CannaiQ

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETE

Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.

2. NO KILL

Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.

3. NO MANUAL STARTUP

Never start servers manually. Say "Please run ./setup-local.sh" instead.

4. DEPLOYMENT AUTH REQUIRED

Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."

5. DB POOL ONLY

Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.

6. K8S POD LIMITS — CRITICAL

MAX 8 PODS for scraper-worker deployment. NEVER EXCEED THIS.

Pods vs Workers:

Pod = Kubernetes container instance (MAX 8)
Worker = Concurrent task runner INSIDE a pod (controlled by MAX_CONCURRENT_TASKS env var)
Formula: 8 pods × MAX_CONCURRENT_TASKS = total concurrent workers

To increase workers: Change MAX_CONCURRENT_TASKS env var, NOT replicas.

# CORRECT - increase workers per pod
kubectl set env deployment/scraper-worker -n dispensary-scraper MAX_CONCURRENT_TASKS=5

# WRONG - never scale above 8 replicas
kubectl scale deployment/scraper-worker --replicas=20  # NEVER DO THIS

If K8s API returns ServiceUnavailable: STOP IMMEDIATELY. Do not retry. The cluster is overloaded.

7. K8S REQUIRES EXPLICIT PERMISSION

NEVER run kubectl commands without explicit user permission.

Before running ANY kubectl command (scale, rollout, set env, delete, apply, etc.):

Tell the user what you want to do
Wait for explicit approval
Only then execute the command

This applies to ALL kubectl operations - even read-only ones like kubectl get pods.

Quick Reference

Database Tables

USE THIS	NOT THIS
`dispensaries`	`stores` (empty)
`store_products`	`products` (empty)
`store_product_snapshots`	`dutchie_product_snapshots`

Key Files

Purpose	File
Dutchie client	`src/platforms/dutchie/client.ts`
DB pool	`src/db/pool.ts`
Payload fetch	`src/tasks/handlers/payload-fetch.ts`
Product refresh	`src/tasks/handlers/product-refresh.ts`

Dutchie GraphQL

Endpoint: https://dutchie.com/api-3/graphql
Hash (FilteredProducts): ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
CRITICAL: Use Status: 'Active' (not null)

Frontends

Folder	Domain	Build
`cannaiq/`	cannaiq.co	Vite
`findadispo/`	findadispo.com	CRA
`findagram/`	findagram.co	CRA
`frontend/`	DEPRECATED	-

Deprecated Code

DO NOT USE anything in src/_deprecated/:

hydration/ - Use src/tasks/handlers/
scraper-v2/ - Use src/platforms/dutchie/
canonical-hydration/ - Merged into tasks

DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts

Local Development

./setup-local.sh   # Start all services
./stop-local.sh    # Stop all services

Service	URL
API	http://localhost:3010
Admin	http://localhost:8080/admin
PostgreSQL	localhost:54320

WordPress Plugin (ACTIVE)

Plugin Files

File	Purpose
`wordpress-plugin/cannaiq-menus.php`	Main plugin (CannaIQ brand)
`wordpress-plugin/crawlsy-menus.php`	Legacy plugin (Crawlsy brand)
`wordpress-plugin/VERSION`	Version tracking

API Routes (Backend)

GET /api/v1/wordpress/dispensaries - List dispensaries
GET /api/v1/wordpress/dispensary/:id/menu - Get menu data
Route file: backend/src/routes/wordpress.ts

Versioning

Bump wordpress-plugin/VERSION on changes:

Minor (x.x.N): bug fixes
Middle (x.N.0): new features
Major (N.0.0): breaking changes (user must request)

Puppeteer Scraping (Browser-Based)

Age Gate Bypass

Most dispensary sites require age verification. The browser scraper handles this automatically:

Utility File: src/utils/age-gate.ts

Key Functions:

setAgeGateCookies(page, url, state) - Set cookies BEFORE navigation to prevent gate
hasAgeGate(page) - Detect if page shows age verification
bypassAgeGate(page, state) - Click through age gate if displayed
detectStateFromUrl(url) - Extract state from URL (e.g., -az- → Arizona)

Cookie Names Set:

age_gate_passed: 'true'
selected_state: '<state>'
age_verified: 'true'

Bypass Methods (tried in order):

Custom dropdown (shadcn/radix style) - Curaleaf pattern
Standard <select> dropdown
State button/card click
Direct "Yes"/"Enter" button

Usage Pattern:

import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';

// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);

// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');

Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.

Dual-Transport Preflight

Workers run BOTH preflight checks on startup:

Transport	Test Method	Use Case
`curl`	axios + proxy → httpbin.org	Fast API requests
`http`	Puppeteer + proxy + StealthPlugin	Anti-detect, browser fingerprint

HTTP Preflight Steps:

Get proxy from pool (CrawlRotator)
Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
Visit Dutchie embedded menu to establish session
Make GraphQL request from browser context

Files:

src/services/curl-preflight.ts
src/services/puppeteer-preflight.ts
migrations/084_dual_transport_preflight.sql

Task Method Column: Tasks have method column ('curl' | 'http' | null):

null = any worker can claim
'curl' = only workers with passed curl preflight
'http' = only workers with passed http preflight

Currently ALL crawl tasks require method = 'http'.

Anti-Detect Fingerprint Distribution

Browser fingerprints are randomized using realistic market share distributions:

Files:

src/services/crawl-rotator.ts - Device/browser selection
src/services/http-fingerprint.ts - HTTP header fingerprinting

Device Weights (matches real traffic patterns):

Device	Weight	Percentage
Mobile	62	62%
Desktop	36	36%
Tablet	2	2%

Allowed Browsers (only realistic ones):

Chrome (67% market share)
Safari (20% market share)
Edge (6% market share)
Firefox (3% market share)

All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.

HTTP Header Fingerprinting:

DNT (Do Not Track): 30% probability of sending
Accept headers: Browser-specific variations
Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)

curl-impersonate Binaries (for curl transport):

Browser	Binary
Chrome	`curl_chrome131`
Edge	`curl_chrome131`
Firefox	`curl_ff133`
Safari	`curl_safari17`

These binaries mimic real browser TLS fingerprints to avoid detection.

Staggered Task Workflow (Added 2025-12-12)

Overview

When creating many tasks at once (e.g., product refresh for all AZ stores), staggered scheduling prevents resource contention, proxy assignment lag, and API rate limiting.

How It Works

1. Task created with scheduled_for = NOW() + (index * stagger_seconds)
2. Worker claims task only when scheduled_for <= NOW()
3. Worker runs preflight on EVERY task claim (proxy health check)
4. If preflight passes, worker executes task
5. If preflight fails, task released back to pending for another worker
6. Worker finishes task, polls for next available task
7. Repeat - preflight runs on each new task claim

Key Points

Preflight is per-task, not per-startup: Each task claim triggers a new preflight check
Stagger prevents thundering herd: 15 seconds between tasks is default
Task assignment is the trigger: Worker picks up task → runs preflight → executes if passed

API Endpoints

# Create staggered tasks for specific dispensary IDs
POST /api/tasks/batch/staggered
{
  "dispensary_ids": [1, 2, 3, 4],
  "role": "product_refresh",      # or "product_discovery"
  "stagger_seconds": 15,          # default: 15
  "platform": "dutchie",          # default: "dutchie"
  "method": null                  # "curl" | "http" | null
}

# Create staggered tasks for AZ stores (convenience endpoint)
POST /api/tasks/batch/az-stores
{
  "total_tasks": 24,              # default: 24
  "stagger_seconds": 15,          # default: 15
  "split_roles": true             # default: true (12 refresh, 12 discovery)
}

Example: 24 Tasks for AZ Stores

curl -X POST http://localhost:3010/api/tasks/batch/az-stores \
  -H "Content-Type: application/json" \
  -d '{"total_tasks": 24, "stagger_seconds": 15, "split_roles": true}'

Response:

{
  "success": true,
  "total": 24,
  "product_refresh": 12,
  "product_discovery": 12,
  "stagger_seconds": 15,
  "total_duration_seconds": 345,
  "estimated_completion": "2025-12-12T08:40:00.000Z",
  "message": "Created 24 staggered tasks for AZ stores (12 refresh, 12 discovery)"
}

File	Purpose
`src/tasks/task-service.ts`	`createStaggeredTasks()` and `createAZStoreTasks()` methods
`src/routes/tasks.ts`	API endpoints for batch task creation
`src/tasks/task-worker.ts`	Worker task claiming and preflight logic

Documentation

Doc	Purpose
`backend/docs/CODEBASE_MAP.md`	Current files/directories
`backend/docs/_archive/`	Historical docs (may be outdated)

9.6 KiB Raw Blame History Unescape Escape