feat: Stealth worker system with mandatory proxy rotation

## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00
parent 0295637ed6
commit 56cc171287
61 changed files with 8591 additions and 2076 deletions
--- a/backend/docs/CRAWL_PIPELINE.md
+++ b/backend/docs/CRAWL_PIPELINE.md
@@ -275,6 +275,22 @@ Store metadata:

 ---

+## Worker Roles
+
+Workers pull tasks from the `worker_tasks` queue based on their assigned role.
+
+| Role | Name | Description | Handler |
+|------|------|-------------|---------|
+| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
+| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
+| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
+| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
+| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
+
+**API Endpoint:** `GET /api/worker-registry/roles`
+
+---
+
 ## Scheduling

 Crawls are scheduled via `worker_tasks` table:
@@ -282,8 +298,219 @@ Crawls are scheduled via `worker_tasks` table:
 | Role | Frequency | Description |
 |------|-----------|-------------|
 | `product_resync` | Every 4 hours | Regular product refresh |
+| `product_discovery` | On-demand | First crawl for new stores |
 | `entry_point_discovery` | On-demand | New store setup |
 | `store_discovery` | Daily | Find new stores |
+| `analytics_refresh` | Daily | Refresh analytics materialized views |
+
+---
+
+## Priority & On-Demand Tasks
+
+Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
+
+### Priority Levels
+
+| Priority | Use Case | Example |
+|----------|----------|---------|
+| 0 | Scheduled/batch tasks | Daily product_resync generation |
+| 10 | On-demand/chained tasks | entry_point → product_discovery |
+| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
+
+### Task Chaining
+
+When a task completes, the system automatically creates follow-up tasks:
+
+```
+store_discovery (completed)
+    └─► entry_point_discovery (priority: 10) for each new store
+
+entry_point_discovery (completed, success)
+    └─► product_discovery (priority: 10) for that store
+
+product_discovery (completed)
+    └─► [no chain] Store enters regular resync schedule
+```
+
+### On-Demand Task Creation
+
+Use the task service to create high-priority tasks:
+
+```typescript
+// Create immediate product resync for a store
+await taskService.createTask({
+  role: 'product_resync',
+  dispensary_id: 123,
+  platform: 'dutchie',
+  priority: 20, // Higher than batch tasks
+});
+
+// Convenience methods with default high priority (10)
+await taskService.createEntryPointTask(dispensaryId, 'dutchie');
+await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
+await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
+```
+
+### Claim Function
+
+The `claim_task()` SQL function atomically claims tasks:
+- Respects priority ordering (higher = first)
+- Uses `FOR UPDATE SKIP LOCKED` for concurrency
+- Prevents multiple active tasks per store
+
+---
+
+## Image Storage
+
+Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
+
+### Storage Path
+```
+/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
+/storage/images/brands/<brand>/logo-<hash>.webp
+```
+
+**Example:**
+```
+/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
+```
+
+### Image Proxy API
+Served via `/img/*` with on-demand resizing using **sharp**:
+
+```
+GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
+```
+
+| Param | Description |
+|-------|-------------|
+| `w` | Width in pixels (max 4000) |
+| `h` | Height in pixels (max 4000) |
+| `q` | Quality 1-100 (default 80) |
+| `fit` | cover, contain, fill, inside, outside |
+| `blur` | Blur sigma (0.3-1000) |
+| `gray` | Grayscale (1 = enabled) |
+| `format` | webp, jpeg, png, avif (default webp) |
+
+### Key Files
+| File | Purpose |
+|------|---------|
+| `src/utils/image-storage.ts` | Download & save images to local filesystem |
+| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
+
+### Download Rules
+
+| Scenario | Image Action |
+|----------|--------------|
+| **New product (first crawl)** | Download if `primaryImageUrl` exists |
+| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
+| **Product already has local image** | Skip download entirely |
+
+**Logic:**
+- Images are downloaded **once** and never re-downloaded on subsequent crawls
+- `skipIfExists: true` - filesystem check prevents re-download even if queued
+- First crawl: all products get images
+- Refresh crawl: only new products or products missing local images
+
+### Storage Rules
+- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
+- Store full resolution, resize on-demand via `/img` proxy
+- Convert to webp for consistency using **sharp**
+- Preserve original Dutchie URL as fallback in `image_url` column
+- Local path stored in `local_image_path` column
+
+---
+
+## Stealth & Anti-Detection
+
+**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
+
+Workers automatically initialize anti-detection systems on startup.
+
+### Components
+
+| Component | Purpose | Source |
+|-----------|---------|--------|
+| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
+| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
+| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
+| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
+
+### Initialization Flow
+
+```
+Worker Start
+    │
+    ├─► initializeStealth()
+    │       │
+    │       ├─► CrawlRotator.initialize()
+    │       │       └─► Load proxies from `proxies` table
+    │       │
+    │       └─► setCrawlRotator(rotator)
+    │               └─► Wire to Dutchie client
+    │
+    └─► Process tasks...
+```
+
+### Stealth Session (per task)
+
+Each crawl task starts a stealth session:
+
+```typescript
+// In product-refresh.ts, entry-point-discovery.ts
+const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
+```
+
+This creates a new identity with:
+- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
+- **Accept-Language:** Matches timezone (e.g., `America/Phoenix` → `en-US,en;q=0.9`)
+- **sec-ch-ua headers:** Proper Client Hints for the browser profile
+
+### On 403 Block
+
+When Dutchie returns 403, the client automatically:
+
+1. Records failure on current proxy (increments `failure_count`)
+2. If proxy has 5+ failures, deactivates it
+3. Rotates to next healthy proxy
+4. Rotates fingerprint
+5. Retries the request
+
+### Proxy Table Schema
+
+```sql
+CREATE TABLE proxies (
+  id SERIAL PRIMARY KEY,
+  host VARCHAR(255) NOT NULL,
+  port INTEGER NOT NULL,
+  username VARCHAR(100),
+  password VARCHAR(100),
+  protocol VARCHAR(10) DEFAULT 'http',  -- http, https, socks5
+  is_active BOOLEAN DEFAULT true,
+  last_used_at TIMESTAMPTZ,
+  failure_count INTEGER DEFAULT 0,
+  success_count INTEGER DEFAULT 0,
+  avg_response_time_ms INTEGER,
+  last_failure_at TIMESTAMPTZ,
+  last_error TEXT
+);
+```
+
+### Configuration
+
+Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
+
+### Fingerprints Available
+
+The client includes 6 browser fingerprints:
+- Chrome 131 on Windows
+- Chrome 131 on macOS
+- Chrome 120 on Windows
+- Firefox 133 on Windows
+- Safari 17.2 on macOS
+- Edge 131 on Windows
+
+Each includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.

 ---

@@ -293,6 +520,7 @@ Crawls are scheduled via `worker_tasks` table:
 - **Normalization errors:** Logged as warnings, continue with valid products
 - **Image download errors:** Non-fatal, logged, continue
 - **Database errors:** Task fails, will be retried
+- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)

 ---

@@ -305,4 +533,6 @@ Crawls are scheduled via `worker_tasks` table:
 | `src/platforms/dutchie/index.ts` | GraphQL client, session management |
 | `src/hydration/normalizers/dutchie.ts` | Payload normalization |
 | `src/hydration/canonical-upsert.ts` | Database upsert logic |
+| `src/utils/image-storage.ts` | Image download and local storage |
+| `src/routes/image-proxy.ts` | On-demand image resizing |
 | `migrations/075_consecutive_misses.sql` | OOS tracking column |