Files
cannaiq/CLAUDE.md
Kelly cdab71a1ee feat(workers): Add dual-transport preflight system
Workers now run both curl and http (Puppeteer) preflights on startup:
- curl-preflight.ts: Tests axios + proxy via httpbin.org
- puppeteer-preflight.ts: Tests browser + StealthPlugin via fingerprint.com
  (with amiunique.org fallback)
- Migration 084: Adds preflight columns to worker_registry and method
  column to worker_tasks
- Workers report preflight status, IP, fingerprint, and response time
- Tasks can require specific transport method (curl/http)
- Dashboard shows Transport column with preflight status badges

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:47:52 -07:00

214 lines
6.1 KiB
Markdown

# Claude Guidelines for CannaiQ
## PERMANENT RULES (NEVER VIOLATE)
### 1. NO DELETE
Never delete data, files, images, logs, or database rows. CannaiQ is a historical analytics system.
### 2. NO KILL
Never run `pkill`, `kill`, `killall`, or similar. Say "Please run `./stop-local.sh`" instead.
### 3. NO MANUAL STARTUP
Never start servers manually. Say "Please run `./setup-local.sh`" instead.
### 4. DEPLOYMENT AUTH REQUIRED
Never deploy unless user explicitly says: "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
### 5. DB POOL ONLY
Never import `src/db/migrate.ts` at runtime. Use `src/db/pool.ts` for DB access.
---
## Quick Reference
### Database Tables
| USE THIS | NOT THIS |
|----------|----------|
| `dispensaries` | `stores` (empty) |
| `store_products` | `products` (empty) |
| `store_product_snapshots` | `dutchie_product_snapshots` |
### Key Files
| Purpose | File |
|---------|------|
| Dutchie client | `src/platforms/dutchie/client.ts` |
| DB pool | `src/db/pool.ts` |
| Payload fetch | `src/tasks/handlers/payload-fetch.ts` |
| Product refresh | `src/tasks/handlers/product-refresh.ts` |
### Dutchie GraphQL
- **Endpoint**: `https://dutchie.com/api-3/graphql`
- **Hash (FilteredProducts)**: `ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0`
- **CRITICAL**: Use `Status: 'Active'` (not `null`)
### Frontends
| Folder | Domain | Build |
|--------|--------|-------|
| `cannaiq/` | cannaiq.co | Vite |
| `findadispo/` | findadispo.com | CRA |
| `findagram/` | findagram.co | CRA |
| `frontend/` | DEPRECATED | - |
---
## Deprecated Code
**DO NOT USE** anything in `src/_deprecated/`:
- `hydration/` - Use `src/tasks/handlers/`
- `scraper-v2/` - Use `src/platforms/dutchie/`
- `canonical-hydration/` - Merged into tasks
**DO NOT USE** `src/dutchie-az/db/connection.ts` - Use `src/db/pool.ts`
---
## Local Development
```bash
./setup-local.sh # Start all services
./stop-local.sh # Stop all services
```
| Service | URL |
|---------|-----|
| API | http://localhost:3010 |
| Admin | http://localhost:8080/admin |
| PostgreSQL | localhost:54320 |
---
## WordPress Plugin (ACTIVE)
### Plugin Files
| File | Purpose |
|------|---------|
| `wordpress-plugin/cannaiq-menus.php` | Main plugin (CannaIQ brand) |
| `wordpress-plugin/crawlsy-menus.php` | Legacy plugin (Crawlsy brand) |
| `wordpress-plugin/VERSION` | Version tracking |
### API Routes (Backend)
- `GET /api/v1/wordpress/dispensaries` - List dispensaries
- `GET /api/v1/wordpress/dispensary/:id/menu` - Get menu data
- Route file: `backend/src/routes/wordpress.ts`
### Versioning
Bump `wordpress-plugin/VERSION` on changes:
- Minor (x.x.N): bug fixes
- Middle (x.N.0): new features
- Major (N.0.0): breaking changes (user must request)
---
## Puppeteer Scraping (Browser-Based)
### Age Gate Bypass
Most dispensary sites require age verification. The browser scraper handles this automatically:
**Utility File**: `src/utils/age-gate.ts`
**Key Functions**:
- `setAgeGateCookies(page, url, state)` - Set cookies BEFORE navigation to prevent gate
- `hasAgeGate(page)` - Detect if page shows age verification
- `bypassAgeGate(page, state)` - Click through age gate if displayed
- `detectStateFromUrl(url)` - Extract state from URL (e.g., `-az-` → Arizona)
**Cookie Names Set**:
- `age_gate_passed: 'true'`
- `selected_state: '<state>'`
- `age_verified: 'true'`
**Bypass Methods** (tried in order):
1. Custom dropdown (shadcn/radix style) - Curaleaf pattern
2. Standard `<select>` dropdown
3. State button/card click
4. Direct "Yes"/"Enter" button
**Usage Pattern**:
```typescript
import { setAgeGateCookies, bypassAgeGate } from '../utils/age-gate';
// Set cookies BEFORE navigation
await setAgeGateCookies(page, menuUrl, 'Arizona');
await page.goto(menuUrl);
// If gate still appears, bypass it
await bypassAgeGate(page, 'Arizona');
```
**Note**: Deeply-Rooted (AZ) does NOT use age gate - good for preflight testing.
### Dual-Transport Preflight
Workers run BOTH preflight checks on startup:
| Transport | Test Method | Use Case |
|-----------|-------------|----------|
| `curl` | axios + proxy → httpbin.org | Fast API requests |
| `http` | Puppeteer + proxy + StealthPlugin | Anti-detect, browser fingerprint |
**HTTP Preflight Steps**:
1. Get proxy from pool (CrawlRotator)
2. Visit fingerprint.com demo (or amiunique.org fallback) to verify IP and anti-detect
3. Visit Dutchie embedded menu to establish session
4. Make GraphQL request from browser context
**Files**:
- `src/services/curl-preflight.ts`
- `src/services/puppeteer-preflight.ts`
- `migrations/084_dual_transport_preflight.sql`
**Task Method Column**: Tasks have `method` column ('curl' | 'http' | null):
- `null` = any worker can claim
- `'curl'` = only workers with passed curl preflight
- `'http'` = only workers with passed http preflight
Currently ALL crawl tasks require `method = 'http'`.
### Anti-Detect Fingerprint Distribution
Browser fingerprints are randomized using realistic market share distributions:
**Files**:
- `src/services/crawl-rotator.ts` - Device/browser selection
- `src/services/http-fingerprint.ts` - HTTP header fingerprinting
**Device Weights** (matches real traffic patterns):
| Device | Weight | Percentage |
|--------|--------|------------|
| Mobile | 62 | 62% |
| Desktop | 36 | 36% |
| Tablet | 2 | 2% |
**Allowed Browsers** (only realistic ones):
- Chrome (67% market share)
- Safari (20% market share)
- Edge (6% market share)
- Firefox (3% market share)
All other browsers are filtered out. Uses `intoli/user-agents` library for realistic UA generation.
**HTTP Header Fingerprinting**:
- DNT (Do Not Track): 30% probability of sending
- Accept headers: Browser-specific variations
- Header ordering: Matches real browser behavior (Chrome, Firefox, Safari, Edge each have unique order)
**curl-impersonate Binaries** (for curl transport):
| Browser | Binary |
|---------|--------|
| Chrome | `curl_chrome131` |
| Edge | `curl_chrome131` |
| Firefox | `curl_ff133` |
| Safari | `curl_safari17` |
These binaries mimic real browser TLS fingerprints to avoid detection.
---
## Documentation
| Doc | Purpose |
|-----|---------|
| `backend/docs/CODEBASE_MAP.md` | Current files/directories |
| `backend/docs/_archive/` | Historical docs (may be outdated) |