Files

ci/woodpecker/push/woodpecker Pipeline failed

Details

fix: Worker task concurrency limit and inventory tracking

- Fix claim_task to enforce max 5 tasks per worker (was unlimited)
- Add session_task_count check before ANY claiming path
- Add triggers to auto-decrement count on task complete/release
- Update MAX_CONCURRENT_TASKS default from 3 to 5
- Update frontend fallback to show 5 task slots

- Add Wasabi S3 storage for payload archival
- Add inventory snapshots service (delta-only tracking)
- Add sales analytics views and routes
- Add high-frequency manager UI components
- Reset hardcoded AZ 5-minute intervals (use UI to configure)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-17 01:34:38 -07:00

16 KiB

Raw Blame History

Claude Guidelines for CannaiQ

CURRENT ENVIRONMENT: PRODUCTION

We are working in PRODUCTION only. All database queries and API calls should target the remote production environment, not localhost. Use kubectl port-forward or remote DB connections as needed.

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETE

Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.

2. NO KILL

Never run pkill, kill, killall, or similar. Say "Please run ./stop-local.sh" instead.

3. NO MANUAL STARTUP

Never start servers manually. Say "Please run ./setup-local.sh" instead.

4. DEPLOYMENT AUTH REQUIRED

Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."

5. DB POOL ONLY

Never import src/db/migrate.ts at runtime. Use src/db/pool.ts for DB access.

6. CI/CD DEPLOYMENT — BATCH CHANGES, PUSH ONCE

Never manually deploy or check deployment status. The project uses Woodpecker CI.

CRITICAL: Each CI build takes 30 minutes. NEVER push incrementally.

Workflow:

Make ALL related code changes first
Test locally if possible (./setup-local.sh)
ONE commit with all changes
ONE push to master
STOP - CI handles the rest
Wait for user to confirm deployment worked

DO NOT:

Push multiple small commits (each triggers 30-min build)
Run kubectl rollout status to check deployment
Run kubectl logs to verify new code is running
Manually restart pods
Check CI pipeline status

Batch everything, push once, wait for user feedback.

7. K8S POD LIMITS — CRITICAL

EXACTLY 8 PODS for scraper-worker deployment. NEVER CHANGE THIS.

Replica Count is LOCKED:

Always 8 replicas — no more, no less
NEVER scale down (even temporarily)
NEVER scale up beyond 8
If pods are not 8, restore to 8 immediately

Pods vs Workers:

Pod = Kubernetes container instance (ALWAYS 8)
Worker = Concurrent task runner INSIDE a pod (controlled by MAX_CONCURRENT_TASKS env var)
Formula: 8 pods × MAX_CONCURRENT_TASKS = 24 total concurrent workers

Browser Task Memory Limits:

Each Puppeteer/Chrome browser uses ~400 MB RAM
Pod memory limit is 2 GB
MAX_CONCURRENT_TASKS=3 is the safe maximum for browser tasks
More than 3 concurrent browsers per pod = OOM crash

Browsers	RAM Used	Status
3	~1.3 GB	Safe (recommended)
4	~1.7 GB	Risky
5+	>2 GB	OOM crash

To increase throughput: Add more pods (up to 8), NOT more concurrent tasks per pod.

# CORRECT - scale pods (up to 8)
kubectl scale deployment/scraper-worker -n cannaiq --replicas=8

# WRONG - will cause OOM crashes
kubectl set env deployment/scraper-worker -n cannaiq MAX_CONCURRENT_TASKS=10

If K8s API returns ServiceUnavailable: STOP IMMEDIATELY. Do not retry. The cluster is overloaded.

7. K8S REQUIRES EXPLICIT PERMISSION

NEVER run kubectl commands without explicit user permission.

Before running ANY kubectl command (scale, rollout, set env, delete, apply, etc.):

Tell the user what you want to do
Wait for explicit approval
Only then execute the command

This applies to ALL kubectl operations - even read-only ones like kubectl get pods.

Quick Reference

Database Tables

USE THIS	NOT THIS
`dispensaries`	`stores` (empty)
`store_products`	`products` (empty)
`store_product_snapshots`	`dutchie_product_snapshots`

Key Files

Purpose	File
Dutchie client	`src/platforms/dutchie/client.ts`
DB pool	`src/db/pool.ts`
Payload fetch	`src/tasks/handlers/payload-fetch.ts`
Product refresh	`src/tasks/handlers/product-refresh.ts`

Dutchie GraphQL

Endpoint: https://dutchie.com/api-3/graphql
Hash (FilteredProducts): ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
CRITICAL: Use Status: 'Active' (not null)

Frontends

Folder	Domain	Build
`cannaiq/`	cannaiq.co	Vite
`findadispo/`	findadispo.com	CRA
`findagram/`	findagram.co	CRA
`frontend/`	DEPRECATED	-

Deprecated Code

DO NOT USE anything in src/_deprecated/:

hydration/ - Use src/tasks/handlers/
scraper-v2/ - Use src/platforms/dutchie/
canonical-hydration/ - Merged into tasks

DO NOT USE src/dutchie-az/db/connection.ts - Use src/db/pool.ts

Local Development

./setup-local.sh   # Start all services
./stop-local.sh    # Stop all services

Service	URL
API	http://localhost:3010
Admin	http://localhost:8080/admin
PostgreSQL	localhost:54320

WordPress Plugin (ACTIVE)

Plugin Files

File	Purpose
`wordpress-plugin/cannaiq-menus.php`	Main plugin (CannaIQ brand)
`wordpress-plugin/crawlsy-menus.php`	Legacy plugin (Crawlsy brand)
`wordpress-plugin/VERSION`	Version tracking

API Routes (Backend)

GET /api/v1/wordpress/dispensaries - List dispensaries
GET /api/v1/wordpress/dispensary/:id/menu - Get menu data
Route file: backend/src/routes/wordpress.ts

Versioning

Bump wordpress-plugin/VERSION on changes:

Minor (x.x.N): bug fixes
Middle (x.N.0): new features
Major (N.0.0): breaking changes (user must request)

Puppeteer Scraping (Browser-Based)

Age Gate Bypass

Most dispensary sites require age verification. The browser scraper handles this automatically:

Utility File: src/utils/age-gate.ts

Key Functions:

setAgeGateCookies(page, url, state) - Set cookies BEFORE navigation to prevent gate
hasAgeGate(page) - Detect if page shows age verification
bypassAgeGate(page, state) - Click through age gate if displayed
detectStateFromUrl(url) - Extract state from URL (e.g., -az- → Arizona)

Cookie Names Set:

age_gate_passed: 'true'
selected_state: '<state>'
age_verified: 'true'

Bypass Methods (tried in order):

Custom dropdown (shadcn/radix style) - Curaleaf pattern
Standard <select> dropdown
State button/card click
Direct "Yes"/"Enter" button

Usage Pattern:

import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';

// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);

// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');

Note: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.

Dual-Transport Preflight

Workers run BOTH preflight checks on startup:

Transport	Test Method	Use Case
`curl`	axios + proxy → httpbin.org	Fast API requests
`http`	Puppeteer + proxy + StealthPlugin	Anti-detect, browser fingerprint

HTTP Preflight Steps:

Get proxy from pool (CrawlRotator)
Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
Visit Dutchie embedded menu to establish session
Make GraphQL request from browser context

Files:

src/services/curl-preflight.ts
src/services/puppeteer-preflight.ts
migrations/084_dual_transport_preflight.sql

Task Method Column: Tasks have method column ('curl' | 'http' | null):

null = any worker can claim
'curl' = only workers with passed curl preflight
'http' = only workers with passed http preflight

Currently ALL crawl tasks require method = 'http'.

Anti-Detect Fingerprint Distribution

Browser fingerprints are randomized using realistic market share distributions:

Files:

src/services/crawl-rotator.ts - Device/browser selection
src/services/http-fingerprint.ts - HTTP header fingerprinting

Device Weights (matches real traffic patterns):

Device	Weight	Percentage
Mobile	62	62%
Desktop	36	36%
Tablet	2	2%

Allowed Browsers (only realistic ones):

Chrome (67% market share)
Safari (20% market share)
Edge (6% market share)
Firefox (3% market share)

All other browsers are filtered out. Uses intoli/user-agents library for realistic UA generation.

HTTP Header Fingerprinting:

DNT (Do Not Track): 30% probability of sending
Accept headers: Browser-specific variations
Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)

curl-impersonate Binaries (for curl transport):

Browser	Binary
Chrome	`curl_chrome131`
Edge	`curl_chrome131`
Firefox	`curl_ff133`
Safari	`curl_safari17`

These binaries mimic real browser TLS fingerprints to avoid detection.

Evomi Residential Proxy API

Workers use Evomi's residential proxy API for geo-targeted proxies on-demand.

Priority Order:

Evomi API (if EVOMI_USER/EVOMI_PASS configured)
DB proxies (fallback if Evomi not configured)

Environment Variables:

Variable	Description	Default
`EVOMI_USER`	API username	-
`EVOMI_PASS`	API key	-
`EVOMI_HOST`	Proxy host	`rpc.evomi.com`
`EVOMI_PORT`	Proxy port	`1000`

K8s Secret: Credentials stored in scraper-secrets:

kubectl get secret scraper-secrets -n cannaiq -o jsonpath='{.data.EVOMI_PASS}' | base64 -d

Proxy URL Format: http://{user}_{session}_{geo}:{pass}@{host}:{port}

session: Worker ID for sticky sessions
geo: State code (e.g., arizona, california)

Files:

src/services/crawl-rotator.ts - getEvomiConfig(), buildEvomiProxyUrl()
src/tasks/task-worker.ts - Proxy initialization order

Bulk Task Workflow (Updated 2025-12-13)

Overview

Tasks are created with scheduled_for = NOW() by default. Worker-level controls handle pacing - no task-level staggering needed.

How It Works

1. Task created with scheduled_for = NOW()
2. Worker claims task only when scheduled_for <= NOW()
3. Worker runs preflight on EVERY task claim (proxy health check)
4. If preflight passes, worker executes task
5. If preflight fails, task released back to pending for another worker
6. Worker finishes task, polls for next available task
7. Repeat - preflight runs on each new task claim

Worker-Level Throttling

These controls pace task execution - no staggering at task creation time:

Control	Purpose
`MAX_CONCURRENT_TASKS`	Limits concurrent tasks per pod (default: 3)
Working hours	Restricts when tasks run (configurable per schedule)
Preflight checks	Ensures proxy health before each task
Per-store locking	Only one active task per dispensary

Key Points

Preflight is per-task, not per-startup: Each task claim triggers a new preflight check
Worker controls pacing: Tasks scheduled for NOW() but claimed based on worker capacity
Optional staggering: Pass stagger_seconds > 0 if you need explicit delays

API Endpoints

# Create bulk tasks for specific dispensary IDs
POST /api/tasks/batch/staggered
{
  "dispensary_ids": [1, 2, 3, 4],
  "role": "product_refresh",      # or "product_discovery"
  "stagger_seconds": 0,           # default: 0 (all NOW)
  "platform": "dutchie",          # default: "dutchie"
  "method": null                  # "curl" | "http" | null
}

# Create bulk tasks for all stores in a state
POST /api/tasks/crawl-state/:stateCode
{
  "stagger_seconds": 0,           # default: 0 (all NOW)
  "method": "http"                # default: "http"
}

Example: Tasks for AZ Stores

curl -X POST http://localhost:3010/api/tasks/crawl-state/AZ \
  -H "Content-Type: application/json"

File	Purpose
`src/tasks/task-service.ts`	`createStaggeredTasks()` method
`src/routes/tasks.ts`	API endpoints for batch task creation
`src/tasks/task-worker.ts`	Worker task claiming and preflight logic

Wasabi S3 Storage (Payload Archive)

Raw crawl payloads are archived to Wasabi S3 for long-term storage and potential reprocessing.

Configuration

Variable	Description	Default
`WASABI_ACCESS_KEY`	Wasabi access key ID	-
`WASABI_SECRET_KEY`	Wasabi secret access key	-
`WASABI_BUCKET`	Bucket name	`cannaiq`
`WASABI_REGION`	Wasabi region	`us-west-2`
`WASABI_ENDPOINT`	S3 endpoint URL	`https://s3.us-west-2.wasabisys.com`

Storage Path Format

payloads/{state}/{YYYY-MM-DD}/{dispensary_id}/{platform}_{timestamp}.json.gz

Example: payloads/AZ/2025-12-16/123/dutchie_2025-12-16T10-30-00-000Z.json.gz

Features

Gzip compression: ~70% size reduction on JSON payloads
Automatic archival: Every crawl is archived (not just daily baselines)
Metadata: taskId, productCount, platform stored with each object
Graceful fallback: If Wasabi not configured, archival is skipped (no task failure)

Files

File	Purpose
`src/services/wasabi-storage.ts`	S3 client and storage functions
`src/tasks/handlers/product-discovery-dutchie.ts`	Archives Dutchie payloads
`src/tasks/handlers/product-discovery-jane.ts`	Archives Jane payloads
`src/tasks/handlers/product-discovery-treez.ts`	Archives Treez payloads

K8s Secret Setup

kubectl patch secret scraper-secrets -n cannaiq -p '{"stringData":{
  "WASABI_ACCESS_KEY": "<access-key>",
  "WASABI_SECRET_KEY": "<secret-key>"
}}'

Usage in Code

import { storePayload, getPayload, listPayloads } from '../services/wasabi-storage';

// Store a payload
const result = await storePayload(dispensaryId, 'AZ', 'dutchie', rawPayload);
console.log(result.path);           // payloads/AZ/2025-12-16/123/dutchie_...
console.log(result.compressedBytes); // Size after gzip

// Retrieve a payload
const payload = await getPayload(result.path);

// List payloads for a store on a date
const paths = await listPayloads(123, 'AZ', '2025-12-16');

Estimated Storage

~100KB per crawl (compressed)
~200 stores × 12 crawls/day = 240MB/day
~7.2GB/month
5TB capacity = ~5+ years of storage

Real-Time Inventory Tracking

High-frequency crawling for sales velocity and inventory analytics.

Crawl Intervals

State	Interval	Jitter	Effective Range
AZ	5 min	±3 min	2-8 min
Others	60 min	±3 min	57-63 min

Delta-Only Snapshots

Only store inventory changes, not full state. Reduces storage by ~95%.

Change Types:

sale: quantity decreased (qty_delta < 0)
restock: quantity increased (qty_delta > 0)
price_change: price changed, quantity same
oos: went out of stock (qty → 0)
back_in_stock: returned to stock (0 → qty)
new_product: first time seeing product

Revenue Calculation

revenue = ABS(qty_delta) × effective_price
effective_price = sale_price if on_special else regular_price

Key Views

View	Purpose
`v_hourly_sales`	Sales aggregated by hour
`v_daily_store_sales`	Daily revenue by store
`v_daily_brand_sales`	Daily brand performance
`v_product_velocity`	Hot/steady/slow/stale rankings
`v_stock_out_prediction`	Days until OOS based on velocity
`v_brand_variants`	SKU counts per brand

Files

File	Purpose
`src/services/inventory-snapshots.ts`	Delta calculation and storage
`src/services/task-scheduler.ts`	High-frequency scheduling with jitter
`migrations/125_delta_only_snapshots.sql`	Delta columns and views
`migrations/126_az_high_frequency.sql`	AZ 5-min intervals

Documentation

Doc	Purpose
`backend/docs/CODEBASE_MAP.md`	Current files/directories
`backend/docs/_archive/`	Historical docs (may be outdated)

16 KiB Raw Blame History Unescape Escape

Claude Guidelines for CannaiQ

CURRENT ENVIRONMENT: PRODUCTION

PERMANENT RULES (NEVER VIOLATE)

1. NO DELETE

2. NO KILL

3. NO MANUAL STARTUP

4. DEPLOYMENT AUTH REQUIRED

5. DB POOL ONLY

6. CI/CD DEPLOYMENT — BATCH CHANGES, PUSH ONCE

7. K8S POD LIMITS — CRITICAL

7. K8S REQUIRES EXPLICIT PERMISSION

Quick Reference

Database Tables

Key Files

Dutchie GraphQL

Frontends

Deprecated Code

Local Development

WordPress Plugin (ACTIVE)

Plugin Files

API Routes (Backend)

Versioning

Puppeteer Scraping (Browser-Based)

Age Gate Bypass

Dual-Transport Preflight

Anti-Detect Fingerprint Distribution

Evomi Residential Proxy API

Bulk Task Workflow (Updated 2025-12-13)

Overview

How It Works

Worker-Level Throttling

Key Points

API Endpoints

Example: Tasks for AZ Stores

Related Files

Wasabi S3 Storage (Payload Archive)

Configuration

Storage Path Format

Features

Files

K8s Secret Setup

Usage in Code

Estimated Storage

Real-Time Inventory Tracking

Crawl Intervals

Delta-Only Snapshots

Revenue Calculation

Key Views

Files

Documentation

16 KiB

Raw Blame History