Files
cannaiq/CLAUDE.md
Kelly f82eed4dc3 feat(workers): Add proxy reload, staggered tasks, and bulk proxy import
- Periodic proxy reload: Workers now reload proxies every 60s to pick up changes
- Staggered task scheduling: New API endpoints for creating tasks with delays
- Bulk proxy import: Script supports multiple URL formats including host:port:user:pass
- Proxy URL column: Migration 086 adds proxy_url for non-standard formats

Key changes:
- crawl-rotator.ts: Added reloadIfStale(), isStale(), setReloadInterval()
- task-worker.ts: Calls reloadIfStale() in main loop
- task-service.ts: Added createStaggeredTasks() and createAZStoreTasks()
- tasks.ts: Added POST /batch/staggered and /batch/az-stores endpoints
- import-proxies.ts: New script for bulk proxy import
- CLAUDE.md: Documented staggered task workflow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 01:53:15 -07:00

286 lines
8.5 KiB
Markdown

# Claude Guidelines for CannaiQ
## PERMANENT RULES (NEVER VIOLATE)
### 1. NO DELETE
Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.
### 2. NO KILL
Never run `pkill`, `kill`, `killall`, or similar. Say "Please run `./stop-local.sh`" instead.
### 3. NO MANUAL STARTUP
Never start servers manually. Say "Please run `./setup-local.sh`" instead.
### 4. DEPLOYMENT AUTH REQUIRED
Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
### 5. DB POOL ONLY
Never import `src/db/migrate.ts` at runtime. Use `src/db/pool.ts` for DB access.
---
## Quick Reference
### Database Tables
| USE THIS | NOT THIS |
|----------|----------|
| `dispensaries` | `stores` (empty) |
| `store_products` | `products` (empty) |
| `store_product_snapshots` | `dutchie_product_snapshots` |
### Key Files
| Purpose | File |
|---------|------|
| Dutchie client | `src/platforms/dutchie/client.ts` |
| DB pool | `src/db/pool.ts` |
| Payload fetch | `src/tasks/handlers/payload-fetch.ts` |
| Product refresh | `src/tasks/handlers/product-refresh.ts` |
### Dutchie GraphQL
- **Endpoint**: `https://dutchie.com/api-3/graphql`
- **Hash (FilteredProducts)**: `ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0`
- **CRITICAL**: Use `Status: 'Active'` (not `null`)
### Frontends
| Folder | Domain | Build |
|--------|--------|-------|
| `cannaiq/` | cannaiq.co | Vite |
| `findadispo/` | findadispo.com | CRA |
| `findagram/` | findagram.co | CRA |
| `frontend/` | DEPRECATED | - |
---
## Deprecated Code
**DO NOT USE** anything in `src/_deprecated/`:
- `hydration/` - Use `src/tasks/handlers/`
- `scraper-v2/` - Use `src/platforms/dutchie/`
- `canonical-hydration/` - Merged into tasks
**DO NOT USE** `src/dutchie-az/db/connection.ts` - Use `src/db/pool.ts`
---
## Local Development
```bash
./setup-local.sh # Start all services
./stop-local.sh # Stop all services
```
| Service | URL |
|---------|-----|
| API | http://localhost:3010 |
| Admin | http://localhost:8080/admin |
| PostgreSQL | localhost:54320 |
---
## WordPress Plugin (ACTIVE)
### Plugin Files
| File | Purpose |
|------|---------|
| `wordpress-plugin/cannaiq-menus.php` | Main plugin (CannaIQ brand) |
| `wordpress-plugin/crawlsy-menus.php` | Legacy plugin (Crawlsy brand) |
| `wordpress-plugin/VERSION` | Version tracking |
### API Routes (Backend)
- `GET /api/v1/wordpress/dispensaries` - List dispensaries
- `GET /api/v1/wordpress/dispensary/:id/menu` - Get menu data
- Route file: `backend/src/routes/wordpress.ts`
### Versioning
Bump `wordpress-plugin/VERSION` on changes:
- Minor (x.x.N): bug fixes
- Middle (x.N.0): new features
- Major (N.0.0): breaking changes (user must request)
---
## Puppeteer Scraping (Browser-Based)
### Age Gate Bypass
Most dispensary sites require age verification. The browser scraper handles this automatically:
**Utility File**: `src/utils/age-gate.ts`
**Key Functions**:
- `setAgeGateCookies(page, url, state)` - Set cookies BEFORE navigation to prevent gate
- `hasAgeGate(page)` - Detect if page shows age verification
- `bypassAgeGate(page, state)` - Click through age gate if displayed
- `detectStateFromUrl(url)` - Extract state from URL (e.g., `-az-` → Arizona)
**Cookie Names Set**:
- `age_gate_passed: 'true'`
- `selected_state: '<state>'`
- `age_verified: 'true'`
**Bypass Methods** (tried in order):
1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
2. Standard `<select>` dropdown
3. State button/card click
4. Direct "Yes"/"Enter" button
**Usage Pattern**:
```typescript
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);
// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');
```
**Note**: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
### Dual-Transport Preflight
Workers run BOTH preflight checks on startup:
| Transport | Test Method | Use Case |
|-----------|-------------|----------|
| `curl` | axios + proxy → httpbin.org | Fast API requests |
| `http` | Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
**HTTP Preflight Steps**:
1. Get proxy from pool (CrawlRotator)
2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
3. Visit Dutchie embedded menu to establish session
4. Make GraphQL request from browser context
**Files**:
- `src/services/curl-preflight.ts`
- `src/services/puppeteer-preflight.ts`
- `migrations/084_dual_transport_preflight.sql`
**Task Method Column**: Tasks have `method` column ('curl' | 'http' | null):
- `null` = any worker can claim
- `'curl'` = only workers with passed curl preflight
- `'http'` = only workers with passed http preflight
Currently ALL crawl tasks require `method = 'http'`.
### Anti-Detect Fingerprint Distribution
Browser fingerprints are randomized using realistic market share distributions:
**Files**:
- `src/services/crawl-rotator.ts` - Device/browser selection
- `src/services/http-fingerprint.ts` - HTTP header fingerprinting
**Device Weights** (matches real traffic patterns):
| Device | Weight | Percentage |
|--------|--------|------------|
| Mobile | 62 | 62% |
| Desktop | 36 | 36% |
| Tablet | 2 | 2% |
**Allowed Browsers** (only realistic ones):
- Chrome (67% market share)
- Safari (20% market share)
- Edge (6% market share)
- Firefox (3% market share)
All other browsers are filtered out. Uses `intoli/user-agents` library for realistic UA generation.
**HTTP Header Fingerprinting**:
- DNT (Do Not Track): 30% probability of sending
- Accept headers: Browser-specific variations
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
**curl-impersonate Binaries** (for curl transport):
| Browser | Binary |
|---------|--------|
| Chrome | `curl_chrome131` |
| Edge | `curl_chrome131` |
| Firefox | `curl_ff133` |
| Safari | `curl_safari17` |
These binaries mimic real browser TLS fingerprints to avoid detection.
---
## Staggered Task Workflow (Added 2025-12-12)
### Overview
When creating many tasks at once (e.g., product refresh for all AZ stores), staggered scheduling prevents resource contention, proxy assignment lag, and API rate limiting.
### How It Works
```
1. Task created with scheduled_for = NOW() + (index * stagger_seconds)
2. Worker claims task only when scheduled_for <= NOW()
3. Worker runs preflight on EVERY task claim (proxy health check)
4. If preflight passes, worker executes task
5. If preflight fails, task released back to pending for another worker
6. Worker finishes task, polls for next available task
7. Repeat - preflight runs on each new task claim
```
### Key Points
- **Preflight is per-task, not per-startup**: Each task claim triggers a new preflight check
- **Stagger prevents thundering herd**: 15 seconds between tasks is default
- **Task assignment is the trigger**: Worker picks up task → runs preflight → executes if passed
### API Endpoints
```bash
# Create staggered tasks for specific dispensary IDs
POST /api/tasks/batch/staggered
{
"dispensary_ids": [1, 2, 3, 4],
"role": "product_refresh", # or "product_discovery"
"stagger_seconds": 15, # default: 15
"platform": "dutchie", # default: "dutchie"
"method": null # "curl" | "http" | null
}
# Create staggered tasks for AZ stores (convenience endpoint)
POST /api/tasks/batch/az-stores
{
"total_tasks": 24, # default: 24
"stagger_seconds": 15, # default: 15
"split_roles": true # default: true (12 refresh, 12 discovery)
}
```
### Example: 24 Tasks for AZ Stores
```bash
curl -X POST http://localhost:3010/api/tasks/batch/az-stores \
-H "Content-Type: application/json" \
-d '{"total_tasks": 24, "stagger_seconds": 15, "split_roles": true}'
```
Response:
```json
{
"success": true,
"total": 24,
"product_refresh": 12,
"product_discovery": 12,
"stagger_seconds": 15,
"total_duration_seconds": 345,
"estimated_completion": "2025-12-12T08:40:00.000Z",
"message": "Created 24 staggered tasks for AZ stores (12 refresh, 12 discovery)"
}
```
### Related Files
| File | Purpose |
|------|---------|
| `src/tasks/task-service.ts` | `createStaggeredTasks()` and `createAZStoreTasks()` methods |
| `src/routes/tasks.ts` | API endpoints for batch task creation |
| `src/tasks/task-worker.ts` | Worker task claiming and preflight logic |
---
## Documentation
| Doc | Purpose |
|-----|---------|
| `backend/docs/CODEBASE_MAP.md` | Current files/directories |
| `backend/docs/_archive/` | Historical docs (may be outdated) |