Major changes: - Split crawl into payload_fetch (API → disk) and product_refresh (disk → DB) - Add task chaining: store_discovery → product_discovery → payload_fetch → product_refresh - Add payload storage utilities for gzipped JSON on filesystem - Add /api/payloads endpoints for payload access and diffing - Add DB-driven TaskScheduler with schedule persistence - Track newDispensaryIds through discovery promotion for chaining - Add stealth improvements: HTTP fingerprinting, proxy rotation enhancements - Add Workers dashboard K8s scaling controls New files: - src/tasks/handlers/payload-fetch.ts - Fetches from API, saves to disk - src/services/task-scheduler.ts - DB-driven schedule management - src/utils/payload-storage.ts - Payload save/load utilities - src/routes/payloads.ts - Payload API endpoints - src/services/http-fingerprint.ts - Browser fingerprint generation - docs/TASK_WORKFLOW_2024-12-10.md - Complete workflow documentation Migrations: - 078: Proxy consecutive 403 tracking - 079: task_schedules table - 080: raw_crawl_payloads table - 081: payload column and last_fetch_at 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
31 KiB
Task Workflow Documentation
Date: 2024-12-10
This document describes the complete task/job processing architecture after the 2024-12-10 rewrite.
Complete Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ API SERVER POD (scraper) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌────────────────────────────────────────┐ │ │
│ │ │ Express API │ │ TaskScheduler │ │ │
│ │ │ │ │ (src/services/task-scheduler.ts) │ │ │
│ │ │ /api/job-queue │ │ │ │ │
│ │ │ /api/tasks │ │ • Polls every 60s │ │ │
│ │ │ /api/schedules │ │ • Checks task_schedules table │ │ │
│ │ └────────┬─────────┘ │ • SELECT FOR UPDATE SKIP LOCKED │ │ │
│ │ │ │ • Generates tasks when due │ │ │
│ │ │ └──────────────────┬─────────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┼──────────────────────────────────┼──────────────────────────┘ │
│ │ │ │
│ │ ┌────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ POSTGRESQL DATABASE │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ task_schedules │ │ worker_tasks │ │ │
│ │ │ │ │ │ │ │
│ │ │ • product_refresh │───────►│ • pending tasks │ │ │
│ │ │ • store_discovery │ create │ • claimed tasks │ │ │
│ │ │ • analytics_refresh │ tasks │ • running tasks │ │ │
│ │ │ │ │ • completed tasks │ │ │
│ │ │ next_run_at │ │ │ │ │
│ │ │ last_run_at │ │ role, dispensary_id │ │ │
│ │ │ interval_hours │ │ priority, status │ │ │
│ │ └─────────────────────┘ └──────────┬──────────┘ │ │
│ │ │ │ │
│ └─────────────────────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┘ │
│ │ Workers poll for tasks │
│ │ (SELECT FOR UPDATE SKIP LOCKED) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WORKER PODS (StatefulSet: scraper-worker) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ task-worker │ │ task-worker │ │ task-worker │ │ task-worker │ │ │
│ │ │ .ts │ │ .ts │ │ .ts │ │ .ts │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
Startup Sequence
┌─────────────────────────────────────────────────────────────────────────────┐
│ API SERVER STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Express app initializes │
│ │ │
│ ▼ │
│ 2. runAutoMigrations() │
│ • Runs pending migrations (including 079_task_schedules.sql) │
│ │ │
│ ▼ │
│ 3. initializeMinio() / initializeImageStorage() │
│ │ │
│ ▼ │
│ 4. cleanupOrphanedJobs() │
│ │ │
│ ▼ │
│ 5. taskScheduler.start() ◄─── NEW (per TASK_WORKFLOW_2024-12-10.md) │
│ │ │
│ ├── Recover stale tasks (workers that died) │
│ ├── Ensure default schedules exist in task_schedules │
│ ├── Check and run any due schedules immediately │
│ └── Start 60-second poll interval │
│ │ │
│ ▼ │
│ 6. app.listen(PORT) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POD STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. K8s starts pod from StatefulSet │
│ │ │
│ ▼ │
│ 2. TaskWorker.constructor() │
│ • Create DB pool │
│ • Create CrawlRotator │
│ │ │
│ ▼ │
│ 3. initializeStealth() │
│ • Load proxies from DB (REQUIRED - fails if none) │
│ • Wire rotator to Dutchie client │
│ │ │
│ ▼ │
│ 4. register() with API │
│ • Optional - continues if fails │
│ │ │
│ ▼ │
│ 5. startRegistryHeartbeat() every 30s │
│ │ │
│ ▼ │
│ 6. processNextTask() loop │
│ │ │
│ ├── Poll for pending task (FOR UPDATE SKIP LOCKED) │
│ ├── Claim task atomically │
│ ├── Execute handler (product_refresh, store_discovery, etc.) │
│ ├── Mark complete/failed │
│ ├── Chain next task if applicable │
│ └── Loop │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Schedule Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER POLL (every 60 seconds) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEGIN TRANSACTION │
│ │ │
│ ▼ │
│ SELECT * FROM task_schedules │
│ WHERE enabled = true AND next_run_at <= NOW() │
│ FOR UPDATE SKIP LOCKED ◄─── Prevents duplicate execution across replicas │
│ │ │
│ ▼ │
│ For each due schedule: │
│ │ │
│ ├── product_refresh_all │
│ │ └─► Query dispensaries needing crawl │
│ │ └─► Create product_refresh tasks in worker_tasks │
│ │ │
│ ├── store_discovery_dutchie │
│ │ └─► Create single store_discovery task │
│ │ │
│ └── analytics_refresh │
│ └─► Create single analytics_refresh task │
│ │ │
│ ▼ │
│ UPDATE task_schedules SET │
│ last_run_at = NOW(), │
│ next_run_at = NOW() + interval_hours │
│ │ │
│ ▼ │
│ COMMIT │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Task Lifecycle
┌──────────┐
│ SCHEDULE │
│ DUE │
└────┬─────┘
│
▼
┌──────────────┐ claim ┌──────────────┐ start ┌──────────────┐
│ PENDING │────────────►│ CLAIMED │────────────►│ RUNNING │
└──────────────┘ └──────────────┘ └──────┬───────┘
▲ │
│ ┌──────────────┼──────────────┐
│ retry │ │ │
│ (if retries < max) ▼ ▼ ▼
│ ┌──────────┐ ┌──────────┐ ┌──────────┐
└──────────────────────────────────│ FAILED │ │ COMPLETED│ │ STALE │
└──────────┘ └──────────┘ └────┬─────┘
│
recover_stale_tasks()
│
▼
┌──────────┐
│ PENDING │
└──────────┘
Database Tables
task_schedules (NEW - migration 079)
Stores schedule definitions. Survives restarts.
CREATE TABLE task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- product_refresh, store_discovery, etc.
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL, -- How often to run
priority INTEGER DEFAULT 0, -- Task priority when created
state_code VARCHAR(2), -- Optional filter
last_run_at TIMESTAMPTZ, -- When it last ran
next_run_at TIMESTAMPTZ, -- When it's due next
last_task_count INTEGER, -- Tasks created last run
last_error TEXT -- Error message if failed
);
worker_tasks (migration 074)
The task queue. Workers pull from here.
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3
);
Default Schedules
| Name | Role | Interval | Priority | Description |
|---|---|---|---|---|
payload_fetch_all |
payload_fetch | 4 hours | 0 | Fetch payloads from Dutchie API (chains to product_refresh) |
store_discovery_dutchie |
store_discovery | 24 hours | 5 | Find new Dutchie stores |
analytics_refresh |
analytics_refresh | 6 hours | 0 | Refresh MVs |
Task Roles
| Role | Description | Creates Tasks For |
|---|---|---|
payload_fetch |
NEW - Fetch from Dutchie API, save to disk | Each dispensary needing crawl |
product_refresh |
CHANGED - Read local payload, normalize, upsert to DB | Chained from payload_fetch |
store_discovery |
Find new dispensaries, returns newStoreIds[] | Single task per platform |
entry_point_discovery |
DEPRECATED - Resolve platform IDs | No longer used |
product_discovery |
Initial product fetch for new stores | Chained from store_discovery |
analytics_refresh |
Refresh MVs | Single global task |
Payload/Refresh Separation (2024-12-10)
The crawl workflow is now split into two phases:
payload_fetch (scheduled every 4h)
└─► Hit Dutchie GraphQL API
└─► Save raw JSON to /storage/payloads/{year}/{month}/{day}/store_{id}_{ts}.json.gz
└─► Record metadata in raw_crawl_payloads table
└─► Queue product_refresh task with payload_id
product_refresh (chained from payload_fetch)
└─► Load payload from filesystem (NOT from API)
└─► Normalize via DutchieNormalizer
└─► Upsert to store_products
└─► Create snapshots
└─► Track missing products
└─► Download images
Benefits:
- Retry-friendly: If normalize fails, re-run product_refresh without re-crawling
- Replay-able: Run product_refresh against any historical payload
- Faster refreshes: Local file read vs network call
- Historical diffs: Compare payloads to see what changed between crawls
- Less API pressure: Only payload_fetch hits Dutchie
Task Chaining
Tasks automatically queue follow-up tasks upon successful completion. This creates two main flows:
Discovery Flow (New Stores)
When store_discovery finds new dispensaries, they automatically get their initial product data:
store_discovery
└─► Discovers new locations via Dutchie GraphQL
└─► Auto-promotes valid locations to dispensaries table
└─► Collects newDispensaryIds[] from promotions
└─► Returns { newStoreIds: [...] } in result
chainNextTask() detects newStoreIds
└─► Creates product_discovery task for each new store
product_discovery
└─► Calls handlePayloadFetch() internally
└─► payload_fetch hits Dutchie API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem
└─► Normalizes and upserts to store_products
└─► Creates snapshots, downloads images
Complete Discovery Chain:
store_discovery → product_discovery → payload_fetch → product_refresh
(internal call) (queues next)
Scheduled Flow (Existing Stores)
For existing stores, payload_fetch_all schedule runs every 4 hours:
TaskScheduler (every 60s)
└─► Checks task_schedules for due schedules
└─► payload_fetch_all is due
└─► Generates payload_fetch task for each dispensary
payload_fetch
└─► Hits Dutchie GraphQL API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem (NOT API)
└─► Normalizes via DutchieNormalizer
└─► Upserts to store_products
└─► Creates snapshots
Complete Scheduled Chain:
payload_fetch → product_refresh
(queues) (reads local)
Chaining Implementation
Task chaining is handled in two places:
-
Internal chaining (handler calls handler):
product_discoverycallshandlePayloadFetch()directly
-
External chaining (chainNextTask() in task-service.ts):
- Called after task completion
store_discovery→ queuesproduct_discoveryfor each newStoreId
-
Queue-based chaining (taskService.createTask):
payload_fetchqueuesproduct_refreshwithpayload: { payload_id }
Payload API Endpoints
Raw crawl payloads can be accessed via the Payloads API:
| Endpoint | Method | Description |
|---|---|---|
GET /api/payloads |
GET | List payload metadata (paginated) |
GET /api/payloads/:id |
GET | Get payload metadata by ID |
GET /api/payloads/:id/data |
GET | Get full payload JSON (decompressed) |
GET /api/payloads/store/:dispensaryId |
GET | List payloads for a store |
GET /api/payloads/store/:dispensaryId/latest |
GET | Get latest payload for a store |
GET /api/payloads/store/:dispensaryId/diff |
GET | Diff two payloads for changes |
Payload Diff Response
The diff endpoint returns:
{
"success": true,
"from": { "id": 123, "fetchedAt": "...", "productCount": 100 },
"to": { "id": 456, "fetchedAt": "...", "productCount": 105 },
"diff": {
"added": 10,
"removed": 5,
"priceChanges": 8,
"stockChanges": 12
},
"details": {
"added": [...],
"removed": [...],
"priceChanges": [...],
"stockChanges": [...]
}
}
API Endpoints
Schedules (NEW)
| Endpoint | Method | Description |
|---|---|---|
GET /api/schedules |
GET | List all schedules |
PUT /api/schedules/:id |
PUT | Update schedule |
POST /api/schedules/:id/trigger |
POST | Run schedule immediately |
Task Creation (rewired 2024-12-10)
| Endpoint | Method | Description |
|---|---|---|
POST /api/job-queue/enqueue |
POST | Create single task |
POST /api/job-queue/enqueue-batch |
POST | Create batch tasks |
POST /api/job-queue/enqueue-state |
POST | Create tasks for state |
POST /api/tasks |
POST | Direct task creation |
Task Management
| Endpoint | Method | Description |
|---|---|---|
GET /api/tasks |
GET | List tasks |
GET /api/tasks/:id |
GET | Get single task |
GET /api/tasks/counts |
GET | Task counts by status |
POST /api/tasks/recover-stale |
POST | Recover stale tasks |
Key Files
| File | Purpose |
|---|---|
src/services/task-scheduler.ts |
NEW - DB-driven scheduler |
src/tasks/task-worker.ts |
Worker that processes tasks |
src/tasks/task-service.ts |
Task CRUD operations |
src/tasks/handlers/payload-fetch.ts |
NEW - Fetches from API, saves to disk |
src/tasks/handlers/product-refresh.ts |
CHANGED - Reads from disk, processes to DB |
src/utils/payload-storage.ts |
NEW - Payload save/load utilities |
src/routes/tasks.ts |
Task API endpoints |
src/routes/job-queue.ts |
Job Queue UI endpoints (rewired) |
migrations/079_task_schedules.sql |
Schedule table |
migrations/080_raw_crawl_payloads.sql |
Payload metadata table |
migrations/081_payload_fetch_columns.sql |
payload, last_fetch_at columns |
migrations/074_worker_task_queue.sql |
Task queue table |
Legacy Code (DEPRECATED)
| File | Status | Replacement |
|---|---|---|
src/services/scheduler.ts |
DEPRECATED | task-scheduler.ts |
dispensary_crawl_jobs table |
ORPHANED | worker_tasks |
job_schedules table |
LEGACY | task_schedules |
Dashboard Integration
Both pages remain wired to the dashboard:
| Page | Data Source | Actions |
|---|---|---|
| Job Queue | worker_tasks, task_schedules |
Create tasks, view schedules |
| Task Queue | worker_tasks |
View tasks, recover stale |
Multi-Replica Safety
The scheduler uses SELECT FOR UPDATE SKIP LOCKED to ensure:
- Only one replica executes a schedule at a time
- No duplicate tasks created
- Survives pod restarts - state in DB, not memory
- Self-healing - recovers stale tasks on startup
-- This query is atomic across all API server replicas
SELECT * FROM task_schedules
WHERE enabled = true AND next_run_at <= NOW()
FOR UPDATE SKIP LOCKED
Worker Scaling (K8s)
Workers run as a StatefulSet in Kubernetes. You can scale from the admin UI or CLI.
From Admin UI
The Workers page (/admin/workers) provides:
- Current replica count display
- Scale up/down buttons
- Target replica input
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
GET /api/workers/k8s/replicas |
GET | Get current/desired replica counts |
POST /api/workers/k8s/scale |
POST | Scale to N replicas (body: { replicas: N }) |
From CLI
# View current replicas
kubectl get statefulset scraper-worker -n dispensary-scraper
# Scale to 10 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=10
# Scale down to 3 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=3
Configuration
Environment variables for the API server:
| Variable | Default | Description |
|---|---|---|
K8S_NAMESPACE |
dispensary-scraper |
Kubernetes namespace |
K8S_WORKER_STATEFULSET |
scraper-worker |
StatefulSet name |
RBAC Requirements
The API server pod needs these K8s permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: worker-scaler
namespace: dispensary-scraper
rules:
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scraper-worker-scaler
namespace: dispensary-scraper
subjects:
- kind: ServiceAccount
name: default
namespace: dispensary-scraper
roleRef:
kind: Role
name: worker-scaler
apiGroup: rbac.authorization.k8s.io