feat: Stealth worker system with mandatory proxy rotation
## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -275,6 +275,22 @@ Store metadata:
|
||||
|
||||
---
|
||||
|
||||
## Worker Roles
|
||||
|
||||
Workers pull tasks from the `worker_tasks` queue based on their assigned role.
|
||||
|
||||
| Role | Name | Description | Handler |
|
||||
|------|------|-------------|---------|
|
||||
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
|
||||
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
|
||||
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
|
||||
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
|
||||
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
|
||||
|
||||
**API Endpoint:** `GET /api/worker-registry/roles`
|
||||
|
||||
---
|
||||
|
||||
## Scheduling
|
||||
|
||||
Crawls are scheduled via `worker_tasks` table:
|
||||
@@ -282,8 +298,219 @@ Crawls are scheduled via `worker_tasks` table:
|
||||
| Role | Frequency | Description |
|
||||
|------|-----------|-------------|
|
||||
| `product_resync` | Every 4 hours | Regular product refresh |
|
||||
| `product_discovery` | On-demand | First crawl for new stores |
|
||||
| `entry_point_discovery` | On-demand | New store setup |
|
||||
| `store_discovery` | Daily | Find new stores |
|
||||
| `analytics_refresh` | Daily | Refresh analytics materialized views |
|
||||
|
||||
---
|
||||
|
||||
## Priority & On-Demand Tasks
|
||||
|
||||
Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
|
||||
|
||||
### Priority Levels
|
||||
|
||||
| Priority | Use Case | Example |
|
||||
|----------|----------|---------|
|
||||
| 0 | Scheduled/batch tasks | Daily product_resync generation |
|
||||
| 10 | On-demand/chained tasks | entry_point → product_discovery |
|
||||
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
|
||||
|
||||
### Task Chaining
|
||||
|
||||
When a task completes, the system automatically creates follow-up tasks:
|
||||
|
||||
```
|
||||
store_discovery (completed)
|
||||
└─► entry_point_discovery (priority: 10) for each new store
|
||||
|
||||
entry_point_discovery (completed, success)
|
||||
└─► product_discovery (priority: 10) for that store
|
||||
|
||||
product_discovery (completed)
|
||||
└─► [no chain] Store enters regular resync schedule
|
||||
```
|
||||
|
||||
### On-Demand Task Creation
|
||||
|
||||
Use the task service to create high-priority tasks:
|
||||
|
||||
```typescript
|
||||
// Create immediate product resync for a store
|
||||
await taskService.createTask({
|
||||
role: 'product_resync',
|
||||
dispensary_id: 123,
|
||||
platform: 'dutchie',
|
||||
priority: 20, // Higher than batch tasks
|
||||
});
|
||||
|
||||
// Convenience methods with default high priority (10)
|
||||
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
|
||||
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
|
||||
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
|
||||
```
|
||||
|
||||
### Claim Function
|
||||
|
||||
The `claim_task()` SQL function atomically claims tasks:
|
||||
- Respects priority ordering (higher = first)
|
||||
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
|
||||
- Prevents multiple active tasks per store
|
||||
|
||||
---
|
||||
|
||||
## Image Storage
|
||||
|
||||
Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
|
||||
|
||||
### Storage Path
|
||||
```
|
||||
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
|
||||
/storage/images/brands/<brand>/logo-<hash>.webp
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
|
||||
```
|
||||
|
||||
### Image Proxy API
|
||||
Served via `/img/*` with on-demand resizing using **sharp**:
|
||||
|
||||
```
|
||||
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
|
||||
```
|
||||
|
||||
| Param | Description |
|
||||
|-------|-------------|
|
||||
| `w` | Width in pixels (max 4000) |
|
||||
| `h` | Height in pixels (max 4000) |
|
||||
| `q` | Quality 1-100 (default 80) |
|
||||
| `fit` | cover, contain, fill, inside, outside |
|
||||
| `blur` | Blur sigma (0.3-1000) |
|
||||
| `gray` | Grayscale (1 = enabled) |
|
||||
| `format` | webp, jpeg, png, avif (default webp) |
|
||||
|
||||
### Key Files
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
|
||||
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
|
||||
|
||||
### Download Rules
|
||||
|
||||
| Scenario | Image Action |
|
||||
|----------|--------------|
|
||||
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
|
||||
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
|
||||
| **Product already has local image** | Skip download entirely |
|
||||
|
||||
**Logic:**
|
||||
- Images are downloaded **once** and never re-downloaded on subsequent crawls
|
||||
- `skipIfExists: true` - filesystem check prevents re-download even if queued
|
||||
- First crawl: all products get images
|
||||
- Refresh crawl: only new products or products missing local images
|
||||
|
||||
### Storage Rules
|
||||
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
|
||||
- Store full resolution, resize on-demand via `/img` proxy
|
||||
- Convert to webp for consistency using **sharp**
|
||||
- Preserve original Dutchie URL as fallback in `image_url` column
|
||||
- Local path stored in `local_image_path` column
|
||||
|
||||
---
|
||||
|
||||
## Stealth & Anti-Detection
|
||||
|
||||
**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
|
||||
|
||||
Workers automatically initialize anti-detection systems on startup.
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Purpose | Source |
|
||||
|-----------|---------|--------|
|
||||
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
|
||||
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
|
||||
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
|
||||
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
|
||||
|
||||
### Initialization Flow
|
||||
|
||||
```
|
||||
Worker Start
|
||||
│
|
||||
├─► initializeStealth()
|
||||
│ │
|
||||
│ ├─► CrawlRotator.initialize()
|
||||
│ │ └─► Load proxies from `proxies` table
|
||||
│ │
|
||||
│ └─► setCrawlRotator(rotator)
|
||||
│ └─► Wire to Dutchie client
|
||||
│
|
||||
└─► Process tasks...
|
||||
```
|
||||
|
||||
### Stealth Session (per task)
|
||||
|
||||
Each crawl task starts a stealth session:
|
||||
|
||||
```typescript
|
||||
// In product-refresh.ts, entry-point-discovery.ts
|
||||
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
|
||||
```
|
||||
|
||||
This creates a new identity with:
|
||||
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
|
||||
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix` → `en-US,en;q=0.9`)
|
||||
- **sec-ch-ua headers:** Proper Client Hints for the browser profile
|
||||
|
||||
### On 403 Block
|
||||
|
||||
When Dutchie returns 403, the client automatically:
|
||||
|
||||
1. Records failure on current proxy (increments `failure_count`)
|
||||
2. If proxy has 5+ failures, deactivates it
|
||||
3. Rotates to next healthy proxy
|
||||
4. Rotates fingerprint
|
||||
5. Retries the request
|
||||
|
||||
### Proxy Table Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE proxies (
|
||||
id SERIAL PRIMARY KEY,
|
||||
host VARCHAR(255) NOT NULL,
|
||||
port INTEGER NOT NULL,
|
||||
username VARCHAR(100),
|
||||
password VARCHAR(100),
|
||||
protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
last_used_at TIMESTAMPTZ,
|
||||
failure_count INTEGER DEFAULT 0,
|
||||
success_count INTEGER DEFAULT 0,
|
||||
avg_response_time_ms INTEGER,
|
||||
last_failure_at TIMESTAMPTZ,
|
||||
last_error TEXT
|
||||
);
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
|
||||
|
||||
### Fingerprints Available
|
||||
|
||||
The client includes 6 browser fingerprints:
|
||||
- Chrome 131 on Windows
|
||||
- Chrome 131 on macOS
|
||||
- Chrome 120 on Windows
|
||||
- Firefox 133 on Windows
|
||||
- Safari 17.2 on macOS
|
||||
- Edge 131 on Windows
|
||||
|
||||
Each includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
|
||||
|
||||
---
|
||||
|
||||
@@ -293,6 +520,7 @@ Crawls are scheduled via `worker_tasks` table:
|
||||
- **Normalization errors:** Logged as warnings, continue with valid products
|
||||
- **Image download errors:** Non-fatal, logged, continue
|
||||
- **Database errors:** Task fails, will be retried
|
||||
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)
|
||||
|
||||
---
|
||||
|
||||
@@ -305,4 +533,6 @@ Crawls are scheduled via `worker_tasks` table:
|
||||
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
|
||||
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
|
||||
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
|
||||
| `src/utils/image-storage.ts` | Image download and local storage |
|
||||
| `src/routes/image-proxy.ts` | On-demand image resizing |
|
||||
| `migrations/075_consecutive_misses.sql` | OOS tracking column |
|
||||
|
||||
Reference in New Issue
Block a user