feat: Stealth worker system with mandatory proxy rotation
## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
353
docs/CRAWL_SYSTEM_V2.md
Normal file
353
docs/CRAWL_SYSTEM_V2.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# CannaiQ Crawl System V2
|
||||
|
||||
## Overview
|
||||
|
||||
The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:
|
||||
|
||||
1. **Phase 1: Store Discovery** - Weekly discovery of Dutchie-powered dispensaries
|
||||
2. **Phase 2: Product Crawling** - Regular product/price/stock updates (documented separately)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Store Discovery
|
||||
|
||||
### Purpose
|
||||
|
||||
Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.
|
||||
|
||||
### Schedule
|
||||
|
||||
- **Frequency**: Weekly (typically Sunday night)
|
||||
- **Duration**: ~2-4 hours for full US coverage
|
||||
|
||||
### Flow Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ PHASE 1: STORE DISCOVERY │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
1. IDENTITY SETUP
|
||||
┌──────────────────┐
|
||||
│ getRandomProxy() │ ──► Random IP from proxy pool
|
||||
└──────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ startSession() │ ──► Random UA + fingerprint + locale matching proxy location
|
||||
└──────────────────┘
|
||||
|
||||
2. CITY DISCOVERY (per state)
|
||||
┌──────────────────────────────┐
|
||||
│ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
|
||||
└──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ Upsert dutchie_discovery_ │
|
||||
│ cities table │
|
||||
└──────────────────────────────┘
|
||||
|
||||
3. STORE DISCOVERY (per city)
|
||||
┌───────────────────────────────┐
|
||||
│ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
|
||||
└───────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────────┐
|
||||
│ Upsert dutchie_discovery_ │
|
||||
│ locations table │
|
||||
└───────────────────────────────┘
|
||||
|
||||
4. VALIDATION & PROMOTION
|
||||
┌──────────────────────────┐
|
||||
│ validateForPromotion() │ ──► Check required fields
|
||||
└──────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────┐
|
||||
│ promoteLocation() │ ──► Upsert to dispensaries table
|
||||
└──────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────┐
|
||||
│ ensureCrawlerProfile() │ ──► Create profile with status='sandbox'
|
||||
└──────────────────────────┘
|
||||
|
||||
5. DROPPED STORE DETECTION
|
||||
┌──────────────────────────┐
|
||||
│ detectDroppedStores() │ ──► Find stores missing from discovery
|
||||
└──────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────┐
|
||||
│ Mark status='dropped' │ ──► Dashboard alert for review
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `backend/src/platforms/dutchie/client.ts` | HTTP client with proxy/fingerprint rotation |
|
||||
| `backend/src/discovery/discovery-crawler.ts` | Main discovery orchestrator |
|
||||
| `backend/src/discovery/location-discovery.ts` | City/store GraphQL fetching |
|
||||
| `backend/src/discovery/promotion.ts` | Validation and promotion logic |
|
||||
| `backend/src/scripts/run-discovery.ts` | CLI entry point |
|
||||
|
||||
---
|
||||
|
||||
## Identity Masking
|
||||
|
||||
Before any GraphQL queries, the system establishes a masked identity:
|
||||
|
||||
### 1. Proxy Selection
|
||||
|
||||
```typescript
|
||||
// backend/src/platforms/dutchie/client.ts
|
||||
|
||||
// Get random proxy from active pool (NOT state-specific)
|
||||
const proxy = await getRandomProxy();
|
||||
setProxy(proxy.url);
|
||||
```
|
||||
|
||||
The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.
|
||||
|
||||
### 2. Fingerprint + Locale Harmonization
|
||||
|
||||
```typescript
|
||||
// backend/src/platforms/dutchie/client.ts
|
||||
|
||||
function startSession(stateCode: string, timezone: string) {
|
||||
// 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
|
||||
const fingerprint = getRandomFingerprint();
|
||||
|
||||
// 2. Match Accept-Language to proxy's timezone/location
|
||||
const locale = getLocaleForTimezone(timezone);
|
||||
|
||||
// 3. Set headers for this session
|
||||
currentSession = {
|
||||
userAgent: fingerprint.ua,
|
||||
acceptLanguage: locale,
|
||||
secChUa: fingerprint.secChUa,
|
||||
// ... other fingerprint headers
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Fingerprint Pool
|
||||
|
||||
6 browser fingerprints rotate on each session and on 403 errors:
|
||||
|
||||
| Browser | Version | Platform |
|
||||
|---------|---------|----------|
|
||||
| Chrome | 120 | Windows |
|
||||
| Chrome | 120 | macOS |
|
||||
| Firefox | 121 | Windows |
|
||||
| Firefox | 121 | macOS |
|
||||
| Safari | 17.2 | macOS |
|
||||
| Edge | 120 | Windows |
|
||||
|
||||
### Timezone → Locale Mapping
|
||||
|
||||
```typescript
|
||||
const TIMEZONE_TO_LOCALE: Record<string, string> = {
|
||||
'America/New_York': 'en-US,en;q=0.9',
|
||||
'America/Chicago': 'en-US,en;q=0.9',
|
||||
'America/Denver': 'en-US,en;q=0.9',
|
||||
'America/Los_Angeles': 'en-US,en;q=0.9',
|
||||
'America/Phoenix': 'en-US,en;q=0.9',
|
||||
// ...
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## GraphQL Queries
|
||||
|
||||
### 1. getAllCitiesByState
|
||||
|
||||
Fetches cities with active dispensaries for a state.
|
||||
|
||||
```typescript
|
||||
// backend/src/discovery/location-discovery.ts
|
||||
|
||||
const response = await executeGraphQL({
|
||||
operationName: 'getAllCitiesByState',
|
||||
variables: {
|
||||
state: 'AZ',
|
||||
countryCode: 'US'
|
||||
}
|
||||
});
|
||||
// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }
|
||||
```
|
||||
|
||||
**Hash**: `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6`
|
||||
|
||||
### 2. ConsumerDispensaries
|
||||
|
||||
Fetches store data for a city/state.
|
||||
|
||||
```typescript
|
||||
// backend/src/discovery/location-discovery.ts
|
||||
|
||||
const response = await executeGraphQL({
|
||||
operationName: 'ConsumerDispensaries',
|
||||
variables: {
|
||||
dispensaryFilter: {
|
||||
city: 'Phoenix',
|
||||
state: 'AZ',
|
||||
activeOnly: true
|
||||
}
|
||||
}
|
||||
});
|
||||
// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]
|
||||
```
|
||||
|
||||
**Hash**: `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b`
|
||||
|
||||
---
|
||||
|
||||
## Database Tables
|
||||
|
||||
### Discovery Tables (Staging)
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `dutchie_discovery_cities` | Cities known to have dispensaries |
|
||||
| `dutchie_discovery_locations` | Raw discovered store data |
|
||||
|
||||
### Canonical Tables
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `dispensaries` | Promoted stores ready for crawling |
|
||||
| `dispensary_crawler_profiles` | Crawler configuration per store |
|
||||
| `dutchie_promotion_log` | Audit trail for all discovery actions |
|
||||
|
||||
---
|
||||
|
||||
## Validation Rules
|
||||
|
||||
A discovery location must have these fields to be promoted:
|
||||
|
||||
| Field | Requirement |
|
||||
|-------|-------------|
|
||||
| `platform_location_id` | MongoDB ObjectId (24 hex chars) |
|
||||
| `name` | Non-empty string |
|
||||
| `city` | Non-empty string |
|
||||
| `state_code` | Non-empty string |
|
||||
| `platform_menu_url` | Valid URL |
|
||||
|
||||
Invalid records are marked `status='rejected'` with errors logged.
|
||||
|
||||
---
|
||||
|
||||
## Dropped Store Detection
|
||||
|
||||
After discovery, the system identifies stores that may have left the Dutchie platform:
|
||||
|
||||
### Detection Criteria
|
||||
|
||||
A store is marked as "dropped" if:
|
||||
|
||||
1. It has a `platform_dispensary_id` (was previously verified)
|
||||
2. It's currently `status='open'` and `crawl_enabled=true`
|
||||
3. It was NOT seen in the latest discovery (not in `dutchie_discovery_locations` with `last_seen_at` in last 24 hours)
|
||||
|
||||
### Implementation
|
||||
|
||||
```typescript
|
||||
// backend/src/discovery/discovery-crawler.ts
|
||||
|
||||
export async function detectDroppedStores(pool: Pool, stateCode?: string) {
|
||||
// 1. Find dispensaries not in recent discovery
|
||||
// 2. Mark status='dropped'
|
||||
// 3. Log to dutchie_promotion_log
|
||||
// 4. Return list for dashboard alert
|
||||
}
|
||||
```
|
||||
|
||||
### Admin UI
|
||||
|
||||
- **Dashboard**: Red alert banner when dropped stores exist
|
||||
- **Dispensaries page**: Filter by `status=dropped` to review
|
||||
|
||||
---
|
||||
|
||||
## CLI Usage
|
||||
|
||||
```bash
|
||||
# Discover all stores in a state
|
||||
npx tsx src/scripts/run-discovery.ts discover:state AZ
|
||||
|
||||
# Discover all US states
|
||||
npx tsx src/scripts/run-discovery.ts discover:all
|
||||
|
||||
# Dry run (no DB writes)
|
||||
npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run
|
||||
|
||||
# Check stats
|
||||
npx tsx src/scripts/run-discovery.ts stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
- **2 seconds** between city requests
|
||||
- **Exponential backoff** on 429/403 responses
|
||||
- **Fingerprint rotation** on 403 errors
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Action |
|
||||
|-------|--------|
|
||||
| 403 Forbidden | Rotate fingerprint, retry |
|
||||
| 429 Rate Limited | Wait 30s, retry |
|
||||
| Network timeout | Retry up to 3 times |
|
||||
| GraphQL error | Log and continue to next city |
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Logs
|
||||
|
||||
Discovery progress is logged to stdout:
|
||||
|
||||
```
|
||||
[Discovery] Starting discovery for state: AZ
|
||||
[Discovery] Step 1: Initializing proxy...
|
||||
[Discovery] Step 2: Fetching cities...
|
||||
[Discovery] Found 45 cities for AZ
|
||||
[Discovery] Step 3: Discovering locations...
|
||||
[Discovery] City 1/45: Phoenix - found 28 stores
|
||||
...
|
||||
[Discovery] Step 4: Auto-promoting discovered locations...
|
||||
[Discovery] Created: 5 new dispensaries
|
||||
[Discovery] Updated: 40 existing dispensaries
|
||||
[Discovery] Step 5: Detecting dropped stores...
|
||||
[Discovery] Found 2 dropped stores
|
||||
```
|
||||
|
||||
### Audit Log
|
||||
|
||||
All actions logged to `dutchie_promotion_log`:
|
||||
|
||||
| Action | Description |
|
||||
|--------|-------------|
|
||||
| `promoted_create` | New dispensary created |
|
||||
| `promoted_update` | Existing dispensary updated |
|
||||
| `rejected` | Validation failed |
|
||||
| `dropped` | Store not found in discovery |
|
||||
|
||||
---
|
||||
|
||||
## Next: Phase 2
|
||||
|
||||
See `docs/PRODUCT_CRAWL_V2.md` for the product crawling phase (coming next).
|
||||
408
docs/WORKER_SYSTEM.md
Normal file
408
docs/WORKER_SYSTEM.md
Normal file
@@ -0,0 +1,408 @@
|
||||
# CannaiQ Worker System
|
||||
|
||||
## Overview
|
||||
|
||||
The Worker System is a role-based task queue that processes background jobs. All tasks go into a single pool, and workers claim tasks based on their assigned role.
|
||||
|
||||
---
|
||||
|
||||
## Design Pattern: Single Pool, Role-Based Claiming
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ TASK POOL (worker_tasks) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────┐ │
|
||||
│ │ role=store_discovery pending │ │
|
||||
│ │ role=product_resync pending │ │
|
||||
│ │ role=product_resync pending │ │
|
||||
│ │ role=product_resync pending │ │
|
||||
│ │ role=analytics_refresh pending │ │
|
||||
│ │ role=entry_point_disc pending │ │
|
||||
│ └─────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────┘
|
||||
│
|
||||
┌────────────────────────────┼────────────────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
|
||||
│ WORKER │ │ WORKER │ │ WORKER │
|
||||
│ role=product_ │ │ role=product_ │ │ role=store_ │
|
||||
│ resync │ │ resync │ │ discovery │
|
||||
│ │ │ │ │ │
|
||||
│ Claims ONLY │ │ Claims ONLY │ │ Claims ONLY │
|
||||
│ product_resync │ │ product_resync │ │ store_discovery │
|
||||
│ tasks │ │ tasks │ │ tasks │
|
||||
└──────────────────┘ └──────────────────┘ └──────────────────┘
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- All tasks go into ONE table (`worker_tasks`)
|
||||
- Each worker is assigned ONE role at startup
|
||||
- Workers only claim tasks matching their role
|
||||
- Multiple workers can share the same role (horizontal scaling)
|
||||
|
||||
---
|
||||
|
||||
## Worker Roles
|
||||
|
||||
| Role | Purpose | Per-Store? | Schedule |
|
||||
|------|---------|------------|----------|
|
||||
| `store_discovery` | Find new dispensaries via GraphQL | No | Weekly |
|
||||
| `entry_point_discovery` | Resolve platform IDs from menu URLs | Yes | On-demand |
|
||||
| `product_discovery` | Initial product fetch for new stores | Yes | On-demand |
|
||||
| `product_resync` | Regular price/stock updates | Yes | Every 4 hours |
|
||||
| `analytics_refresh` | Refresh materialized views | No | Daily |
|
||||
|
||||
---
|
||||
|
||||
## Task Lifecycle
|
||||
|
||||
```
|
||||
pending → claimed → running → completed
|
||||
↓
|
||||
failed
|
||||
↓
|
||||
(retry if < max_retries)
|
||||
```
|
||||
|
||||
| Status | Meaning |
|
||||
|--------|---------|
|
||||
| `pending` | Waiting to be claimed |
|
||||
| `claimed` | Worker has claimed, not yet started |
|
||||
| `running` | Worker is actively processing |
|
||||
| `completed` | Successfully finished |
|
||||
| `failed` | Error occurred |
|
||||
| `stale` | Worker died (heartbeat timeout) |
|
||||
|
||||
---
|
||||
|
||||
## Task Chaining
|
||||
|
||||
Tasks automatically create follow-up tasks:
|
||||
|
||||
```
|
||||
store_discovery (finds new stores)
|
||||
│
|
||||
├─ Returns newStoreIds[] in result
|
||||
▼
|
||||
entry_point_discovery (for each new store)
|
||||
│
|
||||
├─ Resolves platform_dispensary_id
|
||||
▼
|
||||
product_discovery (initial crawl)
|
||||
│
|
||||
▼
|
||||
(store enters regular schedule)
|
||||
│
|
||||
▼
|
||||
product_resync (every 4 hours)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How Claiming Works
|
||||
|
||||
### 1. Worker starts with a role
|
||||
|
||||
```bash
|
||||
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
|
||||
```
|
||||
|
||||
### 2. Worker loop polls for tasks
|
||||
|
||||
```typescript
|
||||
// Simplified worker loop
|
||||
while (running) {
|
||||
const task = await claimTask(this.role, this.workerId);
|
||||
|
||||
if (!task) {
|
||||
await sleep(5000); // No tasks, wait 5 seconds
|
||||
continue;
|
||||
}
|
||||
|
||||
await processTask(task);
|
||||
}
|
||||
```
|
||||
|
||||
### 3. SQL function claims atomically
|
||||
|
||||
```sql
|
||||
-- claim_task(role, worker_id)
|
||||
UPDATE worker_tasks
|
||||
SET status = 'claimed', worker_id = $2, claimed_at = NOW()
|
||||
WHERE id = (
|
||||
SELECT id FROM worker_tasks
|
||||
WHERE role = $1 -- Filter by worker's role
|
||||
AND status = 'pending'
|
||||
AND (scheduled_for IS NULL OR scheduled_for <= NOW())
|
||||
AND dispensary_id NOT IN ( -- Per-store locking
|
||||
SELECT dispensary_id FROM worker_tasks
|
||||
WHERE status IN ('claimed', 'running')
|
||||
)
|
||||
ORDER BY priority DESC, created_at ASC -- Priority ordering
|
||||
LIMIT 1
|
||||
FOR UPDATE SKIP LOCKED -- Atomic, no race conditions
|
||||
)
|
||||
RETURNING *;
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- `FOR UPDATE SKIP LOCKED` - Prevents race conditions between workers
|
||||
- Role filtering - Worker only sees tasks for its role
|
||||
- Per-store locking - Only one active task per dispensary
|
||||
- Priority ordering - Higher priority tasks first
|
||||
- Scheduled tasks - Respects `scheduled_for` timestamp
|
||||
|
||||
---
|
||||
|
||||
## Heartbeat & Stale Recovery
|
||||
|
||||
Workers send heartbeats every 30 seconds while processing:
|
||||
|
||||
```typescript
|
||||
// During task processing
|
||||
setInterval(() => {
|
||||
await pool.query(
|
||||
'UPDATE worker_tasks SET last_heartbeat_at = NOW() WHERE id = $1',
|
||||
[taskId]
|
||||
);
|
||||
}, 30000);
|
||||
```
|
||||
|
||||
If a worker dies, its tasks are recovered:
|
||||
|
||||
```sql
|
||||
-- recover_stale_tasks(threshold_minutes)
|
||||
UPDATE worker_tasks
|
||||
SET status = 'pending', worker_id = NULL, retry_count = retry_count + 1
|
||||
WHERE status IN ('claimed', 'running')
|
||||
AND last_heartbeat_at < NOW() - INTERVAL '10 minutes'
|
||||
AND retry_count < max_retries;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scheduling
|
||||
|
||||
### Daily Resync Generation
|
||||
|
||||
```sql
|
||||
SELECT generate_resync_tasks(6, CURRENT_DATE); -- 6 batches = every 4 hours
|
||||
```
|
||||
|
||||
Creates staggered tasks:
|
||||
| Batch | Time | Stores |
|
||||
|-------|------|--------|
|
||||
| 1 | 00:00 | 1-50 |
|
||||
| 2 | 04:00 | 51-100 |
|
||||
| 3 | 08:00 | 101-150 |
|
||||
| 4 | 12:00 | 151-200 |
|
||||
| 5 | 16:00 | 201-250 |
|
||||
| 6 | 20:00 | 251-300 |
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
### Core
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/tasks/task-service.ts` | Task CRUD, claiming, capacity metrics |
|
||||
| `src/tasks/task-worker.ts` | Worker loop, heartbeat, handler dispatch |
|
||||
| `src/routes/tasks.ts` | REST API endpoints |
|
||||
| `migrations/074_worker_task_queue.sql` | Database schema + SQL functions |
|
||||
|
||||
### Handlers
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
|
||||
| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
|
||||
| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
|
||||
| `src/tasks/handlers/product-resync.ts` | `product_resync` |
|
||||
| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
|
||||
|
||||
---
|
||||
|
||||
## Running Workers
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# Start a single worker
|
||||
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
|
||||
|
||||
# Start multiple workers (different terminals)
|
||||
WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
|
||||
WORKER_ROLE=product_resync WORKER_ID=resync-2 npx tsx src/tasks/task-worker.ts
|
||||
WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `WORKER_ROLE` | (required) | Which task role to process |
|
||||
| `WORKER_ID` | auto-generated | Custom worker identifier |
|
||||
| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
|
||||
| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
|
||||
|
||||
### Kubernetes
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: task-worker-resync
|
||||
spec:
|
||||
replicas: 5 # Scale horizontally
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: worker
|
||||
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||||
command: ["npx", "tsx", "src/tasks/task-worker.ts"]
|
||||
env:
|
||||
- name: WORKER_ROLE
|
||||
value: "product_resync"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Task Management
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/tasks` | List tasks (with filters) |
|
||||
| POST | `/api/tasks` | Create a task |
|
||||
| GET | `/api/tasks/:id` | Get task by ID |
|
||||
| GET | `/api/tasks/counts` | Counts by status |
|
||||
| GET | `/api/tasks/capacity` | Capacity metrics |
|
||||
| POST | `/api/tasks/recover-stale` | Recover dead worker tasks |
|
||||
|
||||
### Task Generation
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| POST | `/api/tasks/generate/resync` | Generate daily resync batch |
|
||||
| POST | `/api/tasks/generate/discovery` | Create store discovery task |
|
||||
|
||||
---
|
||||
|
||||
## Capacity Planning
|
||||
|
||||
The `v_worker_capacity` view provides metrics:
|
||||
|
||||
```sql
|
||||
SELECT * FROM v_worker_capacity;
|
||||
```
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `pending_tasks` | Tasks waiting |
|
||||
| `ready_tasks` | Tasks ready now (scheduled_for passed) |
|
||||
| `running_tasks` | Tasks being processed |
|
||||
| `active_workers` | Workers with recent heartbeat |
|
||||
| `tasks_per_worker_hour` | Throughput estimate |
|
||||
| `estimated_hours_to_drain` | Time to clear queue |
|
||||
|
||||
### Scaling API
|
||||
|
||||
```bash
|
||||
GET /api/tasks/capacity/product_resync
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"pending_tasks": 500,
|
||||
"active_workers": 3,
|
||||
"workers_needed": {
|
||||
"for_1_hour": 10,
|
||||
"for_4_hours": 3,
|
||||
"for_8_hours": 2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### worker_tasks
|
||||
|
||||
```sql
|
||||
CREATE TABLE worker_tasks (
|
||||
id SERIAL PRIMARY KEY,
|
||||
|
||||
-- Task identification
|
||||
role VARCHAR(50) NOT NULL,
|
||||
dispensary_id INTEGER REFERENCES dispensaries(id),
|
||||
platform VARCHAR(20),
|
||||
|
||||
-- State
|
||||
status VARCHAR(20) DEFAULT 'pending',
|
||||
priority INTEGER DEFAULT 0,
|
||||
scheduled_for TIMESTAMPTZ,
|
||||
|
||||
-- Ownership
|
||||
worker_id VARCHAR(100),
|
||||
claimed_at TIMESTAMPTZ,
|
||||
started_at TIMESTAMPTZ,
|
||||
completed_at TIMESTAMPTZ,
|
||||
last_heartbeat_at TIMESTAMPTZ,
|
||||
|
||||
-- Results
|
||||
result JSONB,
|
||||
error_message TEXT,
|
||||
retry_count INTEGER DEFAULT 0,
|
||||
max_retries INTEGER DEFAULT 3,
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### Key Indexes
|
||||
|
||||
```sql
|
||||
-- Fast claiming by role
|
||||
CREATE INDEX idx_worker_tasks_pending
|
||||
ON worker_tasks(role, priority DESC, created_at ASC)
|
||||
WHERE status = 'pending';
|
||||
|
||||
-- Prevent duplicate active tasks per store
|
||||
CREATE UNIQUE INDEX idx_worker_tasks_unique_active_store
|
||||
ON worker_tasks(dispensary_id)
|
||||
WHERE status IN ('claimed', 'running') AND dispensary_id IS NOT NULL;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Logs
|
||||
|
||||
```
|
||||
[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
|
||||
[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
|
||||
[TaskWorker] Task 123 completed successfully
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```sql
|
||||
-- Active workers
|
||||
SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
|
||||
FROM worker_tasks
|
||||
WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
|
||||
GROUP BY worker_id, role;
|
||||
|
||||
-- Task counts by role/status
|
||||
SELECT role, status, COUNT(*)
|
||||
FROM worker_tasks
|
||||
GROUP BY role, status;
|
||||
```
|
||||
@@ -33,8 +33,8 @@ or overwrites of existing data.
|
||||
| Table | Purpose | Key Columns |
|
||||
|-------|---------|-------------|
|
||||
| `dispensaries` | Store locations | id, name, slug, city, state, platform_dispensary_id |
|
||||
| `dutchie_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
|
||||
| `dutchie_product_snapshots` | Historical snapshots | dutchie_product_id, crawled_at, rec_min_price_cents |
|
||||
| `store_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
|
||||
| `store_product_snapshots` | Historical snapshots | store_product_id, crawled_at, rec_min_price_cents |
|
||||
| `brands` (view: v_brands) | Derived from products | brand_name, brand_id, product_count |
|
||||
| `categories` (view: v_categories) | Derived from products | type, subcategory, product_count |
|
||||
|
||||
@@ -147,12 +147,10 @@ CREATE TABLE IF NOT EXISTS products_from_legacy (
|
||||
|
||||
---
|
||||
|
||||
### 3. Dutchie Products
|
||||
### 3. Products (Legacy dutchie_products)
|
||||
|
||||
**Source:** `dutchie_legacy.dutchie_products`
|
||||
**Target:** `cannaiq.dutchie_products`
|
||||
|
||||
These tables have nearly identical schemas. The mapping is direct:
|
||||
**Target:** `cannaiq.store_products`
|
||||
|
||||
| Legacy Column | Canonical Column | Notes |
|
||||
|---------------|------------------|-------|
|
||||
@@ -180,15 +178,15 @@ ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
|
||||
|
||||
---
|
||||
|
||||
### 4. Dutchie Product Snapshots
|
||||
### 4. Product Snapshots (Legacy dutchie_product_snapshots)
|
||||
|
||||
**Source:** `dutchie_legacy.dutchie_product_snapshots`
|
||||
**Target:** `cannaiq.dutchie_product_snapshots`
|
||||
**Target:** `cannaiq.store_product_snapshots`
|
||||
|
||||
| Legacy Column | Canonical Column | Notes |
|
||||
|---------------|------------------|-------|
|
||||
| id | - | Generate new |
|
||||
| dutchie_product_id | dutchie_product_id | Map via product lookup |
|
||||
| dutchie_product_id | store_product_id | Map via product lookup |
|
||||
| dispensary_id | dispensary_id | Map via dispensary lookup |
|
||||
| crawled_at | crawled_at | Direct |
|
||||
| rec_min_price_cents | rec_min_price_cents | Direct |
|
||||
@@ -201,7 +199,7 @@ ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
|
||||
```sql
|
||||
-- No unique constraint on snapshots - all are historical records
|
||||
-- Just INSERT, no conflict handling needed
|
||||
INSERT INTO dutchie_product_snapshots (...) VALUES (...)
|
||||
INSERT INTO store_product_snapshots (...) VALUES (...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user