Major changes: - Split crawl into payload_fetch (API → disk) and product_refresh (disk → DB) - Add task chaining: store_discovery → product_discovery → payload_fetch → product_refresh - Add payload storage utilities for gzipped JSON on filesystem - Add /api/payloads endpoints for payload access and diffing - Add DB-driven TaskScheduler with schedule persistence - Track newDispensaryIds through discovery promotion for chaining - Add stealth improvements: HTTP fingerprinting, proxy rotation enhancements - Add Workers dashboard K8s scaling controls New files: - src/tasks/handlers/payload-fetch.ts - Fetches from API, saves to disk - src/services/task-scheduler.ts - DB-driven schedule management - src/utils/payload-storage.ts - Payload save/load utilities - src/routes/payloads.ts - Payload API endpoints - src/services/http-fingerprint.ts - Browser fingerprint generation - docs/TASK_WORKFLOW_2024-12-10.md - Complete workflow documentation Migrations: - 078: Proxy consecutive 403 tracking - 079: task_schedules table - 080: raw_crawl_payloads table - 081: payload column and last_fetch_at 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
38 lines
1.8 KiB
SQL
38 lines
1.8 KiB
SQL
-- Migration 081: Payload Fetch Columns
|
|
-- Per TASK_WORKFLOW_2024-12-10.md: Separates API fetch from data processing
|
|
--
|
|
-- New architecture:
|
|
-- - payload_fetch: Hits Dutchie API, saves raw payload to disk
|
|
-- - product_refresh: Reads local payload, normalizes, upserts to DB
|
|
--
|
|
-- This migration adds:
|
|
-- 1. payload column to worker_tasks (for task chaining data)
|
|
-- 2. processed_at column to raw_crawl_payloads (track when payload was processed)
|
|
-- 3. last_fetch_at column to dispensaries (track when last payload was fetched)
|
|
|
|
-- Add payload column to worker_tasks for task chaining
|
|
-- Used by payload_fetch to pass payload_id to product_refresh
|
|
ALTER TABLE worker_tasks
|
|
ADD COLUMN IF NOT EXISTS payload JSONB DEFAULT NULL;
|
|
|
|
COMMENT ON COLUMN worker_tasks.payload IS 'Per TASK_WORKFLOW_2024-12-10.md: Task chaining data (e.g., payload_id from payload_fetch to product_refresh)';
|
|
|
|
-- Add processed_at to raw_crawl_payloads
|
|
-- Tracks when the payload was processed by product_refresh
|
|
ALTER TABLE raw_crawl_payloads
|
|
ADD COLUMN IF NOT EXISTS processed_at TIMESTAMPTZ DEFAULT NULL;
|
|
|
|
COMMENT ON COLUMN raw_crawl_payloads.processed_at IS 'When this payload was processed by product_refresh handler';
|
|
|
|
-- Index for finding unprocessed payloads
|
|
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_unprocessed
|
|
ON raw_crawl_payloads(dispensary_id, fetched_at DESC)
|
|
WHERE processed_at IS NULL;
|
|
|
|
-- Add last_fetch_at to dispensaries
|
|
-- Tracks when the last payload was fetched (separate from last_crawl_at which is when processing completed)
|
|
ALTER TABLE dispensaries
|
|
ADD COLUMN IF NOT EXISTS last_fetch_at TIMESTAMPTZ DEFAULT NULL;
|
|
|
|
COMMENT ON COLUMN dispensaries.last_fetch_at IS 'Per TASK_WORKFLOW_2024-12-10.md: When last payload was fetched from API (separate from last_crawl_at which is when processing completed)';
|