chore: Clean up deprecated code and docs

- Move deprecated directories to src/_deprecated/: - hydration/ (old pipeline approach) - scraper-v2/ (old Puppeteer scraper) - canonical-hydration/ (merged into tasks) - Unused services: availability, crawler-logger, geolocation, etc - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser - Archive outdated docs to docs/_archive/: - ANALYTICS_RUNBOOK.md - ANALYTICS_V2_EXAMPLES.md - BRAND_INTELLIGENCE_API.md - CRAWL_PIPELINE.md - TASK_WORKFLOW_2024-12-10.md - WORKER_TASK_ARCHITECTURE.md - ORGANIC_SCRAPING_GUIDE.md - Add docs/CODEBASE_MAP.md as single source of truth - Add warning files to deprecated/archived directories - Slim down CLAUDE.md to essential rules only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00
parent f2864bd2ad
commit a35976b9e9
61 changed files with 856 additions and 1281 deletions
--- a/backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md
+++ b/backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md
@@ -0,0 +1,542 @@
+# Worker Task Architecture
+
+This document describes the unified task-based worker system that replaces the legacy fragmented job systems.
+
+## Overview
+
+The task worker architecture provides a single, unified system for managing all background work in CannaiQ:
+
+- **Store discovery** - Find new dispensaries on platforms
+- **Entry point discovery** - Resolve platform IDs from menu URLs
+- **Product discovery** - Initial product fetch for new stores
+- **Product resync** - Regular price/stock updates for existing stores
+- **Analytics refresh** - Refresh materialized views and analytics
+
+## Architecture
+
+### Database Tables
+
+**`worker_tasks`** - Central task queue
+```sql
+CREATE TABLE worker_tasks (
+  id SERIAL PRIMARY KEY,
+  role task_role NOT NULL,           -- What type of work
+  dispensary_id INTEGER,              -- Which store (if applicable)
+  platform VARCHAR(50),               -- Which platform (dutchie, etc.)
+  status task_status DEFAULT 'pending',
+  priority INTEGER DEFAULT 0,         -- Higher = process first
+  scheduled_for TIMESTAMP,            -- Don't process before this time
+  worker_id VARCHAR(100),             -- Which worker claimed it
+  claimed_at TIMESTAMP,
+  started_at TIMESTAMP,
+  completed_at TIMESTAMP,
+  last_heartbeat_at TIMESTAMP,        -- For stale detection
+  result JSONB,                       -- Output from handler
+  error_message TEXT,
+  retry_count INTEGER DEFAULT 0,
+  max_retries INTEGER DEFAULT 3,
+  created_at TIMESTAMP DEFAULT NOW(),
+  updated_at TIMESTAMP DEFAULT NOW()
+);
+```
+
+**Key indexes:**
+- `idx_worker_tasks_pending_priority` - For efficient task claiming
+- `idx_worker_tasks_active_dispensary` - Prevents concurrent tasks per store (partial unique index)
+
+### Task Roles
+
+| Role | Purpose | Per-Store | Scheduled |
+|------|---------|-----------|-----------|
+| `store_discovery` | Find new stores on a platform | No | Daily |
+| `entry_point_discovery` | Resolve platform IDs | Yes | On-demand |
+| `product_discovery` | Initial product fetch | Yes | After entry_point |
+| `product_resync` | Price/stock updates | Yes | Every 4 hours |
+| `analytics_refresh` | Refresh MVs | No | Daily |
+
+### Task Lifecycle
+
+```
+pending → claimed → running → completed
+                  ↓
+                failed
+```
+
+1. **pending** - Task is waiting to be picked up
+2. **claimed** - Worker has claimed it (atomic via SELECT FOR UPDATE SKIP LOCKED)
+3. **running** - Worker is actively processing
+4. **completed** - Task finished successfully
+5. **failed** - Task encountered an error
+6. **stale** - Task lost its worker (recovered automatically)
+
+## Files
+
+### Core Files
+
+| File | Purpose |
+|------|---------|
+| `src/tasks/task-service.ts` | TaskService - CRUD, claiming, capacity metrics |
+| `src/tasks/task-worker.ts` | TaskWorker - Main worker loop |
+| `src/tasks/index.ts` | Module exports |
+| `src/routes/tasks.ts` | API endpoints |
+| `migrations/074_worker_task_queue.sql` | Database schema |
+
+### Task Handlers
+
+| File | Role |
+|------|------|
+| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
+| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
+| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
+| `src/tasks/handlers/product-resync.ts` | `product_resync` |
+| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
+
+## Running Workers
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `WORKER_ROLE` | (required) | Which task role to process |
+| `WORKER_ID` | auto-generated | Custom worker identifier |
+| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
+| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
+
+### Starting a Worker
+
+```bash
+# Start a product resync worker
+WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
+
+# Start with custom ID
+WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
+
+# Start multiple workers for different roles
+WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts &
+WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts &
+```
+
+### Kubernetes Deployment
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: task-worker-resync
+spec:
+  replicas: 3
+  template:
+    spec:
+      containers:
+      - name: worker
+        image: code.cannabrands.app/creationshop/dispensary-scraper:latest
+        command: ["npx", "tsx", "src/tasks/task-worker.ts"]
+        env:
+        - name: WORKER_ROLE
+          value: "product_resync"
+```
+
+## API Endpoints
+
+### Task Management
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/tasks` | GET | List tasks with filters |
+| `/api/tasks` | POST | Create a new task |
+| `/api/tasks/:id` | GET | Get task by ID |
+| `/api/tasks/counts` | GET | Get counts by status |
+| `/api/tasks/capacity` | GET | Get capacity metrics |
+| `/api/tasks/capacity/:role` | GET | Get role-specific capacity |
+| `/api/tasks/recover-stale` | POST | Recover tasks from dead workers |
+
+### Task Generation
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/tasks/generate/resync` | POST | Generate daily resync tasks |
+| `/api/tasks/generate/discovery` | POST | Create store discovery task |
+
+### Migration (from legacy systems)
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/tasks/migration/status` | GET | Compare old vs new systems |
+| `/api/tasks/migration/disable-old-schedules` | POST | Disable job_schedules |
+| `/api/tasks/migration/cancel-pending-crawl-jobs` | POST | Cancel old crawl jobs |
+| `/api/tasks/migration/create-resync-tasks` | POST | Create tasks for all stores |
+| `/api/tasks/migration/full-migrate` | POST | One-click migration |
+
+### Role-Specific Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/tasks/role/:role/last-completion` | GET | Last completion time |
+| `/api/tasks/role/:role/recent` | GET | Recent completions |
+| `/api/tasks/store/:id/active` | GET | Check if store has active task |
+
+## Capacity Planning
+
+The `v_worker_capacity` view provides real-time metrics:
+
+```sql
+SELECT * FROM v_worker_capacity;
+```
+
+Returns:
+- `pending_tasks` - Tasks waiting to be claimed
+- `ready_tasks` - Tasks ready now (scheduled_for is null or past)
+- `claimed_tasks` - Tasks claimed but not started
+- `running_tasks` - Tasks actively processing
+- `completed_last_hour` - Recent completions
+- `failed_last_hour` - Recent failures
+- `active_workers` - Workers with recent heartbeats
+- `avg_duration_sec` - Average task duration
+- `tasks_per_worker_hour` - Throughput estimate
+- `estimated_hours_to_drain` - Time to clear queue
+
+### Scaling Recommendations
+
+```javascript
+// API: GET /api/tasks/capacity/:role
+{
+  "role": "product_resync",
+  "pending_tasks": 500,
+  "active_workers": 3,
+  "workers_needed": {
+    "for_1_hour": 10,
+    "for_4_hours": 3,
+    "for_8_hours": 2
+  }
+}
+```
+
+## Task Chaining
+
+Tasks can automatically create follow-up tasks:
+
+```
+store_discovery → entry_point_discovery → product_discovery
+                              ↓
+                     (store has platform_dispensary_id)
+                              ↓
+                     Daily resync tasks
+```
+
+The `chainNextTask()` method handles this automatically.
+
+## Stale Task Recovery
+
+Tasks are considered stale if `last_heartbeat_at` is older than the threshold (default 10 minutes).
+
+```sql
+SELECT recover_stale_tasks(10); -- 10 minute threshold
+```
+
+Or via API:
+```bash
+curl -X POST /api/tasks/recover-stale \
+  -H 'Content-Type: application/json' \
+  -d '{"threshold_minutes": 10}'
+```
+
+## Migration from Legacy Systems
+
+### Legacy Systems Replaced
+
+1. **job_schedules + job_run_logs** - Scheduled job definitions
+2. **dispensary_crawl_jobs** - Per-dispensary crawl queue
+3. **SyncOrchestrator + HydrationWorker** - Raw payload processing
+
+### Migration Steps
+
+**Option 1: One-Click Migration**
+```bash
+curl -X POST /api/tasks/migration/full-migrate
+```
+
+This will:
+1. Disable all job_schedules
+2. Cancel pending dispensary_crawl_jobs
+3. Generate resync tasks for all stores
+4. Create discovery and analytics tasks
+
+**Option 2: Manual Migration**
+```bash
+# 1. Check current status
+curl /api/tasks/migration/status
+
+# 2. Disable old schedules
+curl -X POST /api/tasks/migration/disable-old-schedules
+
+# 3. Cancel pending crawl jobs
+curl -X POST /api/tasks/migration/cancel-pending-crawl-jobs
+
+# 4. Create resync tasks
+curl -X POST /api/tasks/migration/create-resync-tasks \
+  -H 'Content-Type: application/json' \
+  -d '{"state_code": "AZ"}'
+
+# 5. Generate daily resync schedule
+curl -X POST /api/tasks/generate/resync \
+  -H 'Content-Type: application/json' \
+  -d '{"batches_per_day": 6}'
+```
+
+## Per-Store Locking
+
+The system prevents concurrent tasks for the same store using a partial unique index:
+
+```sql
+CREATE UNIQUE INDEX idx_worker_tasks_active_dispensary
+ON worker_tasks (dispensary_id)
+WHERE dispensary_id IS NOT NULL
+AND status IN ('claimed', 'running');
+```
+
+This ensures only one task can be active per store at any time.
+
+## Task Priority
+
+Tasks are claimed in priority order (higher first), then by creation time:
+
+```sql
+ORDER BY priority DESC, created_at ASC
+```
+
+Default priorities:
+- `store_discovery`: 0
+- `entry_point_discovery`: 10 (high - new stores)
+- `product_discovery`: 10 (high - new stores)
+- `product_resync`: 0
+- `analytics_refresh`: 0
+
+## Scheduled Tasks
+
+Tasks can be scheduled for future execution:
+
+```javascript
+await taskService.createTask({
+  role: 'product_resync',
+  dispensary_id: 123,
+  scheduled_for: new Date('2025-01-10T06:00:00Z'),
+});
+```
+
+The `generate_resync_tasks()` function creates staggered tasks throughout the day:
+
+```sql
+SELECT generate_resync_tasks(6, '2025-01-10'); -- 6 batches = every 4 hours
+```
+
+## Dashboard Integration
+
+The admin dashboard shows task queue status in the main overview:
+
+```
+Task Queue Summary
+------------------
+Pending:   45
+Running:   3
+Completed: 1,234
+Failed:    12
+```
+
+Full task management is available at `/admin/tasks`.
+
+## Error Handling
+
+Failed tasks include the error message in `error_message` and can be retried:
+
+```sql
+-- View failed tasks
+SELECT id, role, dispensary_id, error_message, retry_count
+FROM worker_tasks
+WHERE status = 'failed'
+ORDER BY completed_at DESC
+LIMIT 20;
+
+-- Retry failed tasks
+UPDATE worker_tasks
+SET status = 'pending', retry_count = retry_count + 1
+WHERE status = 'failed' AND retry_count < max_retries;
+```
+
+## Concurrent Task Processing (Added 2024-12)
+
+Workers can now process multiple tasks concurrently within a single worker instance. This improves throughput by utilizing async I/O efficiently.
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         Pod (K8s)                           │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │                    TaskWorker                        │   │
+│  │                                                      │   │
+│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐             │   │
+│  │  │ Task 1  │  │ Task 2  │  │ Task 3  │  (concurrent)│   │
+│  │  └─────────┘  └─────────┘  └─────────┘             │   │
+│  │                                                      │   │
+│  │  Resource Monitor                                    │   │
+│  │  ├── Memory: 65% (threshold: 85%)                   │   │
+│  │  ├── CPU: 45% (threshold: 90%)                      │   │
+│  │  └── Status: Normal                                  │   │
+│  └─────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MAX_CONCURRENT_TASKS` | 3 | Maximum tasks a worker will run concurrently |
+| `MEMORY_BACKOFF_THRESHOLD` | 0.85 | Back off when heap memory exceeds 85% |
+| `CPU_BACKOFF_THRESHOLD` | 0.90 | Back off when CPU exceeds 90% |
+| `BACKOFF_DURATION_MS` | 10000 | How long to wait when backing off (10s) |
+
+### How It Works
+
+1. **Main Loop**: Worker continuously tries to fill up to `MAX_CONCURRENT_TASKS`
+2. **Resource Monitoring**: Before claiming a new task, worker checks memory and CPU
+3. **Backoff**: If resources exceed thresholds, worker pauses and stops claiming new tasks
+4. **Concurrent Execution**: Tasks run in parallel using `Promise` - they don't block each other
+5. **Graceful Shutdown**: On SIGTERM/decommission, worker stops claiming but waits for active tasks
+
+### Resource Monitoring
+
+```typescript
+// ResourceStats interface
+interface ResourceStats {
+  memoryPercent: number;    // Current heap usage as decimal (0.0-1.0)
+  memoryMb: number;         // Current heap used in MB
+  memoryTotalMb: number;    // Total heap available in MB
+  cpuPercent: number;       // CPU usage as percentage (0-100)
+  isBackingOff: boolean;    // True if worker is in backoff state
+  backoffReason: string;    // Why the worker is backing off
+}
+```
+
+### Heartbeat Data
+
+Workers report the following in their heartbeat:
+
+```json
+{
+  "worker_id": "worker-abc123",
+  "current_task_id": 456,
+  "current_task_ids": [456, 457, 458],
+  "active_task_count": 3,
+  "max_concurrent_tasks": 3,
+  "status": "active",
+  "resources": {
+    "memory_mb": 256,
+    "memory_total_mb": 512,
+    "memory_rss_mb": 320,
+    "memory_percent": 50,
+    "cpu_user_ms": 12500,
+    "cpu_system_ms": 3200,
+    "cpu_percent": 45,
+    "is_backing_off": false,
+    "backoff_reason": null
+  }
+}
+```
+
+### Backoff Behavior
+
+When resources exceed thresholds:
+
+1. Worker logs the backoff reason:
+   ```
+   [TaskWorker] MyWorker backing off: Memory at 87.3% (threshold: 85%)
+   ```
+
+2. Worker stops claiming new tasks but continues existing tasks
+
+3. After `BACKOFF_DURATION_MS`, worker rechecks resources
+
+4. When resources return to normal:
+   ```
+   [TaskWorker] MyWorker resuming normal operation
+   ```
+
+### UI Display
+
+The Workers Dashboard shows:
+
+- **Tasks Column**: `2/3 tasks` (active/max concurrent)
+- **Resources Column**: Memory % and CPU % with color coding
+  - Green: < 50%
+  - Yellow: 50-74%
+  - Amber: 75-89%
+  - Red: 90%+
+- **Backing Off**: Orange warning badge when worker is in backoff state
+
+### Task Count Badge Details
+
+```
+┌─────────────────────────────────────────────┐
+│ Worker: "MyWorker"                          │
+│ Tasks: 2/3 tasks  #456, #457                │
+│ Resources: 🧠 65%  💻 45%                    │
+│ Status: ● Active                            │
+└─────────────────────────────────────────────┘
+```
+
+### Best Practices
+
+1. **Start Conservative**: Use `MAX_CONCURRENT_TASKS=3` initially
+2. **Monitor Resources**: Watch for frequent backoffs in logs
+3. **Tune Per Workload**: I/O-bound tasks benefit from higher concurrency
+4. **Scale Horizontally**: Add more pods rather than cranking concurrency too high
+
+### Code References
+
+| File | Purpose |
+|------|---------|
+| `src/tasks/task-worker.ts:68-71` | Concurrency environment variables |
+| `src/tasks/task-worker.ts:104-111` | ResourceStats interface |
+| `src/tasks/task-worker.ts:149-179` | getResourceStats() method |
+| `src/tasks/task-worker.ts:184-196` | shouldBackOff() method |
+| `src/tasks/task-worker.ts:462-516` | mainLoop() with concurrent claiming |
+| `src/routes/worker-registry.ts:148-195` | Heartbeat endpoint handling |
+| `cannaiq/src/pages/WorkersDashboard.tsx:233-305` | UI components for resources |
+
+## Monitoring
+
+### Logs
+
+Workers log to stdout:
+```
+[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
+[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
+[TaskWorker] Task 123 completed successfully
+```
+
+### Health Check
+
+Check if workers are active:
+```sql
+SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
+FROM worker_tasks
+WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
+GROUP BY worker_id, role;
+```
+
+### Metrics
+
+```sql
+-- Tasks by status
+SELECT status, COUNT(*) FROM worker_tasks GROUP BY status;
+
+-- Tasks by role
+SELECT role, status, COUNT(*) FROM worker_tasks GROUP BY role, status;
+
+-- Average duration by role
+SELECT role, AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds
+FROM worker_tasks
+WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '24 hours'
+GROUP BY role;
+```