feat: Stealth worker system with mandatory proxy rotation

## Worker System
- Role-agnostic workers that can handle any task type
- Pod-based architecture with StatefulSet (5-15 pods, 5 workers each)
- Custom pod names (Aethelgard, Xylos, Kryll, etc.)
- Worker registry with friendly names and resource monitoring
- Hub-and-spoke visualization on JobQueue page

## Stealth & Anti-Detection (REQUIRED)
- Proxies are MANDATORY - workers fail to start without active proxies
- CrawlRotator initializes on worker startup
- Loads proxies from `proxies` table
- Auto-rotates proxy + fingerprint on 403 errors
- 12 browser fingerprints (Chrome, Firefox, Safari, Edge)
- Locale/timezone matching for geographic consistency

## Task System
- Renamed product_resync → product_refresh
- Task chaining: store_discovery → entry_point → product_discovery
- Priority-based claiming with FOR UPDATE SKIP LOCKED
- Heartbeat and stale task recovery

## UI Updates
- JobQueue: Pod visualization, resource monitoring on hover
- WorkersDashboard: Simplified worker list
- Removed unused filters from task list

## Other
- IP2Location service for visitor analytics
- Findagram consumer features scaffolding
- Documentation updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Kelly
2025-12-10 00:44:59 -07:00
parent 0295637ed6
commit 56cc171287
61 changed files with 8591 additions and 2076 deletions

View File

@@ -275,6 +275,22 @@ Store metadata:
---
## Worker Roles
Workers pull tasks from the `worker_tasks` queue based on their assigned role.
| Role | Name | Description | Handler |
|------|------|-------------|---------|
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
**API Endpoint:** `GET /api/worker-registry/roles`
---
## Scheduling
Crawls are scheduled via `worker_tasks` table:
@@ -282,8 +298,219 @@ Crawls are scheduled via `worker_tasks` table:
| Role | Frequency | Description |
|------|-----------|-------------|
| `product_resync` | Every 4 hours | Regular product refresh |
| `product_discovery` | On-demand | First crawl for new stores |
| `entry_point_discovery` | On-demand | New store setup |
| `store_discovery` | Daily | Find new stores |
| `analytics_refresh` | Daily | Refresh analytics materialized views |
---
## Priority & On-Demand Tasks
Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
### Priority Levels
| Priority | Use Case | Example |
|----------|----------|---------|
| 0 | Scheduled/batch tasks | Daily product_resync generation |
| 10 | On-demand/chained tasks | entry_point → product_discovery |
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
### Task Chaining
When a task completes, the system automatically creates follow-up tasks:
```
store_discovery (completed)
└─► entry_point_discovery (priority: 10) for each new store
entry_point_discovery (completed, success)
└─► product_discovery (priority: 10) for that store
product_discovery (completed)
└─► [no chain] Store enters regular resync schedule
```
### On-Demand Task Creation
Use the task service to create high-priority tasks:
```typescript
// Create immediate product resync for a store
await taskService.createTask({
role: 'product_resync',
dispensary_id: 123,
platform: 'dutchie',
priority: 20, // Higher than batch tasks
});
// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
```
### Claim Function
The `claim_task()` SQL function atomically claims tasks:
- Respects priority ordering (higher = first)
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
- Prevents multiple active tasks per store
---
## Image Storage
Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
### Storage Path
```
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp
```
**Example:**
```
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
```
### Image Proxy API
Served via `/img/*` with on-demand resizing using **sharp**:
```
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
```
| Param | Description |
|-------|-------------|
| `w` | Width in pixels (max 4000) |
| `h` | Height in pixels (max 4000) |
| `q` | Quality 1-100 (default 80) |
| `fit` | cover, contain, fill, inside, outside |
| `blur` | Blur sigma (0.3-1000) |
| `gray` | Grayscale (1 = enabled) |
| `format` | webp, jpeg, png, avif (default webp) |
### Key Files
| File | Purpose |
|------|---------|
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
### Download Rules
| Scenario | Image Action |
|----------|--------------|
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
| **Product already has local image** | Skip download entirely |
**Logic:**
- Images are downloaded **once** and never re-downloaded on subsequent crawls
- `skipIfExists: true` - filesystem check prevents re-download even if queued
- First crawl: all products get images
- Refresh crawl: only new products or products missing local images
### Storage Rules
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
- Store full resolution, resize on-demand via `/img` proxy
- Convert to webp for consistency using **sharp**
- Preserve original Dutchie URL as fallback in `image_url` column
- Local path stored in `local_image_path` column
---
## Stealth & Anti-Detection
**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
Workers automatically initialize anti-detection systems on startup.
### Components
| Component | Purpose | Source |
|-----------|---------|--------|
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
### Initialization Flow
```
Worker Start
├─► initializeStealth()
│ │
│ ├─► CrawlRotator.initialize()
│ │ └─► Load proxies from `proxies` table
│ │
│ └─► setCrawlRotator(rotator)
│ └─► Wire to Dutchie client
└─► Process tasks...
```
### Stealth Session (per task)
Each crawl task starts a stealth session:
```typescript
// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
```
This creates a new identity with:
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix``en-US,en;q=0.9`)
- **sec-ch-ua headers:** Proper Client Hints for the browser profile
### On 403 Block
When Dutchie returns 403, the client automatically:
1. Records failure on current proxy (increments `failure_count`)
2. If proxy has 5+ failures, deactivates it
3. Rotates to next healthy proxy
4. Rotates fingerprint
5. Retries the request
### Proxy Table Schema
```sql
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5
is_active BOOLEAN DEFAULT true,
last_used_at TIMESTAMPTZ,
failure_count INTEGER DEFAULT 0,
success_count INTEGER DEFAULT 0,
avg_response_time_ms INTEGER,
last_failure_at TIMESTAMPTZ,
last_error TEXT
);
```
### Configuration
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
### Fingerprints Available
The client includes 6 browser fingerprints:
- Chrome 131 on Windows
- Chrome 131 on macOS
- Chrome 120 on Windows
- Firefox 133 on Windows
- Safari 17.2 on macOS
- Edge 131 on Windows
Each includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
---
@@ -293,6 +520,7 @@ Crawls are scheduled via `worker_tasks` table:
- **Normalization errors:** Logged as warnings, continue with valid products
- **Image download errors:** Non-fatal, logged, continue
- **Database errors:** Task fails, will be retried
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)
---
@@ -305,4 +533,6 @@ Crawls are scheduled via `worker_tasks` table:
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
| `src/utils/image-storage.ts` | Image download and local storage |
| `src/routes/image-proxy.ts` | On-demand image resizing |
| `migrations/075_consecutive_misses.sql` | OOS tracking column |