- Remove /run-now endpoint (use task priority instead) - Add source tracking to worker_tasks (source, source_schedule_id, source_metadata) - Parallelize dashboard API calls (Promise.all) - Add 1-5 min caching to /markets/dashboard and /national/summary - Add performance indexes for dashboard queries Migrations: - 104: Task source tracking columns - 105: Dashboard performance indexes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
325 lines
10 KiB
Markdown
325 lines
10 KiB
Markdown
# Claude Guidelines for CannaiQ
|
||
|
||
## CURRENT ENVIRONMENT: PRODUCTION
|
||
**We are working in PRODUCTION only.** All database queries and API calls should target the remote production environment, not localhost. Use kubectl port-forward or remote DB connections as needed.
|
||
|
||
## PERMANENT RULES (NEVER VIOLATE)
|
||
|
||
### 1. NO DELETE
|
||
Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.
|
||
|
||
### 2. NO KILL
|
||
Never run `pkill`, `kill`, `killall`, or similar. Say "Please run `./stop-local.sh`" instead.
|
||
|
||
### 3. NO MANUAL STARTUP
|
||
Never start servers manually. Say "Please run `./setup-local.sh`" instead.
|
||
|
||
### 4. DEPLOYMENT AUTH REQUIRED
|
||
Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
|
||
|
||
### 5. DB POOL ONLY
|
||
Never import `src/db/migrate.ts` at runtime. Use `src/db/pool.ts` for DB access.
|
||
|
||
### 6. K8S POD LIMITS — CRITICAL
|
||
**MAX 8 PODS** for `scraper-worker` deployment. NEVER EXCEED THIS.
|
||
|
||
**Pods vs Workers:**
|
||
- **Pod** = Kubernetes container instance (MAX 8)
|
||
- **Worker** = Concurrent task runner INSIDE a pod (controlled by `MAX_CONCURRENT_TASKS` env var)
|
||
- Formula: `8 pods × MAX_CONCURRENT_TASKS = total concurrent workers`
|
||
|
||
**Browser Task Memory Limits:**
|
||
- Each Puppeteer/Chrome browser uses ~400 MB RAM
|
||
- Pod memory limit is 2 GB
|
||
- **MAX_CONCURRENT_TASKS=3** is the safe maximum for browser tasks
|
||
- More than 3 concurrent browsers per pod = OOM crash
|
||
|
||
| Browsers | RAM Used | Status |
|
||
|----------|----------|--------|
|
||
| 3 | ~1.3 GB | Safe (recommended) |
|
||
| 4 | ~1.7 GB | Risky |
|
||
| 5+ | >2 GB | OOM crash |
|
||
|
||
**To increase throughput:** Add more pods (up to 8), NOT more concurrent tasks per pod.
|
||
|
||
```bash
|
||
# CORRECT - scale pods (up to 8)
|
||
kubectl scale deployment/scraper-worker -n dispensary-scraper --replicas=8
|
||
|
||
# WRONG - will cause OOM crashes
|
||
kubectl set env deployment/scraper-worker -n dispensary-scraper MAX_CONCURRENT_TASKS=10
|
||
```
|
||
|
||
**If K8s API returns ServiceUnavailable:** STOP IMMEDIATELY. Do not retry. The cluster is overloaded.
|
||
|
||
### 7. K8S REQUIRES EXPLICIT PERMISSION
|
||
**NEVER run kubectl commands without explicit user permission.**
|
||
|
||
Before running ANY `kubectl` command (scale, rollout, set env, delete, apply, etc.):
|
||
1. Tell the user what you want to do
|
||
2. Wait for explicit approval
|
||
3. Only then execute the command
|
||
|
||
This applies to ALL kubectl operations - even read-only ones like `kubectl get pods`.
|
||
|
||
---
|
||
|
||
## Quick Reference
|
||
|
||
### Database Tables
|
||
| USE THIS | NOT THIS |
|
||
|----------|----------|
|
||
| `dispensaries` | `stores` (empty) |
|
||
| `store_products` | `products` (empty) |
|
||
| `store_product_snapshots` | `dutchie_product_snapshots` |
|
||
|
||
### Key Files
|
||
| Purpose | File |
|
||
|---------|------|
|
||
| Dutchie client | `src/platforms/dutchie/client.ts` |
|
||
| DB pool | `src/db/pool.ts` |
|
||
| Payload fetch | `src/tasks/handlers/payload-fetch.ts` |
|
||
| Product refresh | `src/tasks/handlers/product-refresh.ts` |
|
||
|
||
### Dutchie GraphQL
|
||
- **Endpoint**: `https://dutchie.com/api-3/graphql`
|
||
- **Hash (FilteredProducts)**: `ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0`
|
||
- **CRITICAL**: Use `Status: 'Active'` (not `null`)
|
||
|
||
### Frontends
|
||
| Folder | Domain | Build |
|
||
|--------|--------|-------|
|
||
| `cannaiq/` | cannaiq.co | Vite |
|
||
| `findadispo/` | findadispo.com | CRA |
|
||
| `findagram/` | findagram.co | CRA |
|
||
| `frontend/` | DEPRECATED | - |
|
||
|
||
---
|
||
|
||
## Deprecated Code
|
||
|
||
**DO NOT USE** anything in `src/_deprecated/`:
|
||
- `hydration/` - Use `src/tasks/handlers/`
|
||
- `scraper-v2/` - Use `src/platforms/dutchie/`
|
||
- `canonical-hydration/` - Merged into tasks
|
||
|
||
**DO NOT USE** `src/dutchie-az/db/connection.ts` - Use `src/db/pool.ts`
|
||
|
||
---
|
||
|
||
## Local Development
|
||
|
||
```bash
|
||
./setup-local.sh # Start all services
|
||
./stop-local.sh # Stop all services
|
||
```
|
||
|
||
| Service | URL |
|
||
|---------|-----|
|
||
| API | http://localhost:3010 |
|
||
| Admin | http://localhost:8080/admin |
|
||
| PostgreSQL | localhost:54320 |
|
||
|
||
---
|
||
|
||
## WordPress Plugin (ACTIVE)
|
||
|
||
### Plugin Files
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `wordpress-plugin/cannaiq-menus.php` | Main plugin (CannaIQ brand) |
|
||
| `wordpress-plugin/crawlsy-menus.php` | Legacy plugin (Crawlsy brand) |
|
||
| `wordpress-plugin/VERSION` | Version tracking |
|
||
|
||
### API Routes (Backend)
|
||
- `GET /api/v1/wordpress/dispensaries` - List dispensaries
|
||
- `GET /api/v1/wordpress/dispensary/:id/menu` - Get menu data
|
||
- Route file: `backend/src/routes/wordpress.ts`
|
||
|
||
### Versioning
|
||
Bump `wordpress-plugin/VERSION` on changes:
|
||
- Minor (x.x.N): bug fixes
|
||
- Middle (x.N.0): new features
|
||
- Major (N.0.0): breaking changes (user must request)
|
||
|
||
---
|
||
|
||
## Puppeteer Scraping (Browser-Based)
|
||
|
||
### Age Gate Bypass
|
||
|
||
Most dispensary sites require age verification. The browser scraper handles this automatically:
|
||
|
||
**Utility File**: `src/utils/age-gate.ts`
|
||
|
||
**Key Functions**:
|
||
- `setAgeGateCookies(page, url, state)` - Set cookies BEFORE navigation to prevent gate
|
||
- `hasAgeGate(page)` - Detect if page shows age verification
|
||
- `bypassAgeGate(page, state)` - Click through age gate if displayed
|
||
- `detectStateFromUrl(url)` - Extract state from URL (e.g., `-az-` → Arizona)
|
||
|
||
**Cookie Names Set**:
|
||
- `age_gate_passed: 'true'`
|
||
- `selected_state: '<state>'`
|
||
- `age_verified: 'true'`
|
||
|
||
**Bypass Methods** (tried in order):
|
||
1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
|
||
2. Standard `<select>` dropdown
|
||
3. State button/card click
|
||
4. Direct "Yes"/"Enter" button
|
||
|
||
**Usage Pattern**:
|
||
```typescript
|
||
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
|
||
|
||
// Set cookies BEFORE navigation
|
||
await setAgeGateCookies(page, menuUrl, 'Arizona');
|
||
await page.goto(menuUrl);
|
||
|
||
// If gate still appears, bypass it
|
||
await bypassAgeGate(page, 'Arizona');
|
||
```
|
||
|
||
**Note**: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
|
||
|
||
### Dual-Transport Preflight
|
||
|
||
Workers run BOTH preflight checks on startup:
|
||
|
||
| Transport | Test Method | Use Case |
|
||
|-----------|-------------|----------|
|
||
| `curl` | axios + proxy → httpbin.org | Fast API requests |
|
||
| `http` | Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
|
||
|
||
**HTTP Preflight Steps**:
|
||
1. Get proxy from pool (CrawlRotator)
|
||
2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
|
||
3. Visit Dutchie embedded menu to establish session
|
||
4. Make GraphQL request from browser context
|
||
|
||
**Files**:
|
||
- `src/services/curl-preflight.ts`
|
||
- `src/services/puppeteer-preflight.ts`
|
||
- `migrations/084_dual_transport_preflight.sql`
|
||
|
||
**Task Method Column**: Tasks have `method` column ('curl' | 'http' | null):
|
||
- `null` = any worker can claim
|
||
- `'curl'` = only workers with passed curl preflight
|
||
- `'http'` = only workers with passed http preflight
|
||
|
||
Currently ALL crawl tasks require `method = 'http'`.
|
||
|
||
### Anti-Detect Fingerprint Distribution
|
||
|
||
Browser fingerprints are randomized using realistic market share distributions:
|
||
|
||
**Files**:
|
||
- `src/services/crawl-rotator.ts` - Device/browser selection
|
||
- `src/services/http-fingerprint.ts` - HTTP header fingerprinting
|
||
|
||
**Device Weights** (matches real traffic patterns):
|
||
| Device | Weight | Percentage |
|
||
|--------|--------|------------|
|
||
| Mobile | 62 | 62% |
|
||
| Desktop | 36 | 36% |
|
||
| Tablet | 2 | 2% |
|
||
|
||
**Allowed Browsers** (only realistic ones):
|
||
- Chrome (67% market share)
|
||
- Safari (20% market share)
|
||
- Edge (6% market share)
|
||
- Firefox (3% market share)
|
||
|
||
All other browsers are filtered out. Uses `intoli/user-agents` library for realistic UA generation.
|
||
|
||
**HTTP Header Fingerprinting**:
|
||
- DNT (Do Not Track): 30% probability of sending
|
||
- Accept headers: Browser-specific variations
|
||
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
|
||
|
||
**curl-impersonate Binaries** (for curl transport):
|
||
| Browser | Binary |
|
||
|---------|--------|
|
||
| Chrome | `curl_chrome131` |
|
||
| Edge | `curl_chrome131` |
|
||
| Firefox | `curl_ff133` |
|
||
| Safari | `curl_safari17` |
|
||
|
||
These binaries mimic real browser TLS fingerprints to avoid detection.
|
||
|
||
---
|
||
|
||
## Bulk Task Workflow (Updated 2025-12-13)
|
||
|
||
### Overview
|
||
Tasks are created with `scheduled_for = NOW()` by default. Worker-level controls handle pacing - no task-level staggering needed.
|
||
|
||
### How It Works
|
||
```
|
||
1. Task created with scheduled_for = NOW()
|
||
2. Worker claims task only when scheduled_for <= NOW()
|
||
3. Worker runs preflight on EVERY task claim (proxy health check)
|
||
4. If preflight passes, worker executes task
|
||
5. If preflight fails, task released back to pending for another worker
|
||
6. Worker finishes task, polls for next available task
|
||
7. Repeat - preflight runs on each new task claim
|
||
```
|
||
|
||
### Worker-Level Throttling
|
||
These controls pace task execution - no staggering at task creation time:
|
||
|
||
| Control | Purpose |
|
||
|---------|---------|
|
||
| `MAX_CONCURRENT_TASKS` | Limits concurrent tasks per pod (default: 3) |
|
||
| Working hours | Restricts when tasks run (configurable per schedule) |
|
||
| Preflight checks | Ensures proxy health before each task |
|
||
| Per-store locking | Only one active task per dispensary |
|
||
|
||
### Key Points
|
||
- **Preflight is per-task, not per-startup**: Each task claim triggers a new preflight check
|
||
- **Worker controls pacing**: Tasks scheduled for NOW() but claimed based on worker capacity
|
||
- **Optional staggering**: Pass `stagger_seconds > 0` if you need explicit delays
|
||
|
||
### API Endpoints
|
||
```bash
|
||
# Create bulk tasks for specific dispensary IDs
|
||
POST /api/tasks/batch/staggered
|
||
{
|
||
"dispensary_ids": [1, 2, 3, 4],
|
||
"role": "product_refresh", # or "product_discovery"
|
||
"stagger_seconds": 0, # default: 0 (all NOW)
|
||
"platform": "dutchie", # default: "dutchie"
|
||
"method": null # "curl" | "http" | null
|
||
}
|
||
|
||
# Create bulk tasks for all stores in a state
|
||
POST /api/tasks/crawl-state/:stateCode
|
||
{
|
||
"stagger_seconds": 0, # default: 0 (all NOW)
|
||
"method": "http" # default: "http"
|
||
}
|
||
```
|
||
|
||
### Example: Tasks for AZ Stores
|
||
```bash
|
||
curl -X POST http://localhost:3010/api/tasks/crawl-state/AZ \
|
||
-H "Content-Type: application/json"
|
||
```
|
||
|
||
### Related Files
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `src/tasks/task-service.ts` | `createStaggeredTasks()` method |
|
||
| `src/routes/tasks.ts` | API endpoints for batch task creation |
|
||
| `src/tasks/task-worker.ts` | Worker task claiming and preflight logic |
|
||
|
||
---
|
||
|
||
## Documentation
|
||
|
||
| Doc | Purpose |
|
||
|-----|---------|
|
||
| `backend/docs/CODEBASE_MAP.md` | Current files/directories |
|
||
| `backend/docs/_archive/` | Historical docs (may be outdated) |
|