Compare commits

...

148 Commits

Author SHA1 Message Date
Kelly
6bcadd9e71 fix(preflight): Correct parameter order and add IP/fingerprint reporting
- Fix update_worker_preflight call to use correct parameter order:
  (worker_id, transport, status, ip, response_ms, error, fingerprint)
- Add proxyIp to both curl and http preflight reports
- Add fingerprint JSONB with timezone, location, and bot detection data
- Log HTTP IP and timezone after preflight completes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 00:32:45 -07:00
kelly
a77bf8611a Merge pull request 'feat(workers): Preflight phase 1 - Schema, StatefulSet, and timezone matching' (#55) from feat/preflight-phase1-schema into master 2025-12-12 07:30:53 +00:00
Kelly
33feca3138 fix(antidetect): Match browser timezone to proxy IP location
- Add IP geolocation lookup via ip-api.com to get timezone from proxy IP
- Use ipify.org API for reliable proxy IP detection (replaces unreliable fingerprint.com scraping)
- Set browser timezone via CDP Emulation.setTimezoneOverride to match proxy location
- Add detectedTimezone and detectedLocation to preflight result
- Add /api/worker-registry/preflight-test endpoint for smoke testing

Fixes timezone mismatch where browser showed America/Phoenix while proxy was in America/New_York

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 00:25:39 -07:00
kelly
7d85a97b63 Merge pull request 'feat: Preflight schema and StatefulSet' (#54) from feat/preflight-phase1-schema into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/54
2025-12-12 07:14:40 +00:00
Kelly
ce081effd4 feat(workers): Add preflight schema and StatefulSet
- Migration 085: Add curl_ip, http_ip, fingerprint_data, preflight_status,
  preflight_at columns to worker_registry
- StatefulSet manifest for 8 persistent workers with OnDelete update strategy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:45:04 -07:00
kelly
2ed088b4d8 Merge pull request 'feat(api): Add preflight columns to worker registry API response' (#50) from feat/preflight-api-fields into master 2025-12-12 06:22:06 +00:00
Kelly
d3c49fa246 feat(api): Add preflight columns to worker registry API response
Exposes curl_ip, http_ip, preflight_status, preflight_at, and fingerprint_data
in the /api/worker-registry/workers response.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:12:55 -07:00
kelly
52cb5014fd Merge pull request 'feat: Dual-transport preflight system for worker fingerprinting' (#49) from fix/ci-and-preflight-enforcement into master 2025-12-12 06:08:13 +00:00
Kelly
50654be910 fix: Restore hydration and product_refresh for store updates
- Moved hydration module back from _deprecated (needed for product_refresh)
- Restored product_refresh handler for processing stored payloads
- Restored geolocation service for findadispo/findagram
- Stubbed system routes that depend on deprecated SyncOrchestrator
- Removed crawler-sandbox route (deprecated)
- Fixed all TypeScript compilation errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:03:39 -07:00
Kelly
cdab71a1ee feat(workers): Add dual-transport preflight system
Workers now run both curl and http (Puppeteer) preflights on startup:
- curl-preflight.ts: Tests axios + proxy via httpbin.org
- puppeteer-preflight.ts: Tests browser + StealthPlugin via fingerprint.com
  (with amiunique.org fallback)
- Migration 084: Adds preflight columns to worker_registry and method
  column to worker_tasks
- Workers report preflight status, IP, fingerprint, and response time
- Tasks can require specific transport method (curl/http)
- Dashboard shows Transport column with preflight status badges

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:47:52 -07:00
Kelly
a35976b9e9 chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00
kelly
c68210c485 Merge pull request 'fix(ci): Remove buildx cache and add preflight enforcement' (#48) from fix/ci-and-preflight-enforcement into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/48
2025-12-12 04:53:20 +00:00
Kelly
f2864bd2ad fix(ci): Remove buildx cache and add preflight enforcement
- Remove cache_from/cache_to from CI (plugin bug splitting commas)
- Add preflight() method to CrawlRotator - tests proxy + anti-detect
- Add pre-task preflight check - workers MUST pass before executing
- Add releaseTask() to release tasks back to pending on preflight fail
- Rename proxy_test task to whoami for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:37:22 -07:00
kelly
eca9e85242 Merge pull request 'fix(ci): Fix buildx cache syntax and add proxy_test task' (#47) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/47
2025-12-12 04:30:21 +00:00
Kelly
3f958fbff3 fix(ci): Fix buildx cache_from syntax for array format
Plugin was splitting comma-separated values incorrectly.
Use array format with quoted strings instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:10:26 -07:00
Kelly
c84ef0396b feat(tasks): Add proxy_test task handler and discovery run tracking
- Add proxy_test task handler that fetches IP via proxy to verify connectivity
- Add discovery_runs migration (083) for tracking store discovery progress
- Register proxy_test in task service and worker

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:07:58 -07:00
kelly
e1c67dcee5 Merge pull request 'feat: UI polish and CI caching improvements' (#46) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/46
2025-12-12 03:47:46 +00:00
kelly
34c8a8cc67 Merge pull request 'feat(cannaiq): Add clickable logo, favicon, and remove state selector' (#45) from feat/cannaiq-ui-polish into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/45
2025-12-12 03:36:04 +00:00
Kelly
6cd1f55119 fix(workers): Preserve fantasy names on pod restart
- Re-registration no longer overwrites pod_name with K8s name
- New workers get fantasy name (Aethelgard, Xylos, etc.) as pod_name
- Document worker naming convention in CLAUDE.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 20:35:25 -07:00
Kelly
e918234928 feat(ci): Add npm cache volume for faster typechecks
- Create PVC for shared npm cache across CI jobs
- Configure Woodpecker agent to allow npm-cache volume mount
- Update typecheck steps to use shared cache directory
- First run populates cache, subsequent runs are ~3-4x faster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 19:27:32 -07:00
Kelly
888a608485 feat(cannaiq): Add clickable logo, favicon, and remove state selector
- Make CannaIQ logo clickable to return to dashboard (sidebar + mobile header)
- Add custom favicon matching the logo design
- Remove state selector dropdown from sidebar navigation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 19:24:21 -07:00
kelly
b5c3b05246 Merge pull request 'fix(workers): Fix false memory backoff and add backing-off color coding' (#44) from fix/worker-memory-backoff into master 2025-12-12 02:13:51 +00:00
Kelly
fdce5e0302 fix(workers): Fix false memory backoff and add backing-off color coding
- Fix memory calculation to use max-old-space-size (1500MB) instead of
  V8's dynamic heapTotal. This prevents false 95%+ readings when idle.
- Add yellow color for backing-off workers in pod visualization
- Update legend and tooltips with backing-off status
- Remove pool toggle from TasksDashboard (moved to Workers page)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 19:11:42 -07:00
Kelly
4679b245de perf(ci): Enable Docker layer caching for faster builds
Add cache_from and cache_to settings to all docker-buildx steps.
Uses registry-based caching to avoid rebuilding npm install layer
when package.json hasn't changed.

Expected improvement: 14min backend build → ~3-4min on cache hit.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 18:43:43 -07:00
kelly
a837070f54 Merge pull request 'refactor(admin): Consolidate JobQueue into TasksDashboard + CI worker resilience' (#43) from fix/ci-worker-resilience into master 2025-12-12 01:21:28 +00:00
Kelly
5a929e9803 refactor(admin): Consolidate JobQueue into TasksDashboard
- Move Create Task modal from JobQueue to TasksDashboard
- Add pagination to TasksDashboard (25 tasks per page)
- Add delete action for failed/completed/pending tasks
- Remove JobQueue page and route
- Rename nav item from "Task Queue" to "Tasks"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:41:21 -07:00
kelly
52b0fad410 Merge pull request 'ci: Add worker resilience check to deploy step' (#42) from fix/ci-worker-resilience into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/42
2025-12-12 00:09:48 +00:00
Kelly
9944031eea ci: Add worker resilience check to deploy step
If workers are scaled to 0, CI will now automatically scale them to 5
before updating the image. This prevents workers from being stuck at 0
if manually scaled down for maintenance.

The check only scales up if replicas=0, so it won't interfere with
normal deployments or HPA scaling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 16:40:53 -07:00
kelly
2babaa7136 Merge pull request 'ci: Remove explicit migration step from deploy' (#41) from fix/ci-remove-migration-step into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/41
2025-12-11 23:11:48 +00:00
Kelly
90567511dd ci: Remove explicit migration step from deploy
Auto-migrate runs at server startup and handles migration errors gracefully.
The explicit kubectl exec migration step was failing due to trigger
already existing (schema_migrations table out of sync).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 15:56:47 -07:00
kelly
beb16ad0cb Merge pull request 'ci: Run migrations via kubectl exec instead of separate step' (#40) from fix/ci-migrate-via-kubectl into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/40
2025-12-11 22:38:21 +00:00
Kelly
fc7fc5ea85 ci: Run migrations via kubectl exec instead of separate step
Removes the migrate step that required db_* secrets (which CI can't
access since postgres is cluster-internal). Instead, run migrations
via kubectl exec on the deployed scraper pod, which already has DB
access via its env vars.

Deploy order:
1. Deploy scraper image
2. Wait for rollout
3. Run migrations via kubectl exec
4. Deploy remaining services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 15:16:12 -07:00
kelly
ab8956b14b Merge pull request 'fix: Revert CI event array syntax to single value' (#39) from fix/ci-event-syntax into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/39
2025-12-11 22:10:03 +00:00
Kelly
1d9c90641f fix: Revert event array syntax to single value
The [push, manual] array syntax broke CI config parsing.
Reverting to event: push which is known to work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 14:48:34 -07:00
kelly
6126b907f2 Merge pull request 'ci: Support manual pipeline events for deploy' (#38) from ci/support-manual-pipelines into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/38
2025-12-11 21:12:26 +00:00
Kelly
cc93d2d483 ci: Support manual pipeline events for deploy
Allow deploy steps to run on both push and manual events.
This enables triggering deploys via `woodpecker-cli pipeline create`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 13:54:04 -07:00
kelly
7642c17ec0 Merge pull request 'ci: Fix pipeline config path' (#37) from fix/ci-config-path into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/37
2025-12-11 20:01:30 +00:00
Kelly
cb60dcf352 ci: Add root-level woodpecker config for CI detection 2025-12-11 12:37:35 -07:00
kelly
5ffe05d519 Merge pull request 'feat: Concurrent task processing with resource-based backoff' (#36) from feat/worker-scaling into master 2025-12-11 18:49:01 +00:00
Kelly
8e2f07c941 feat(workers): Concurrent task processing with resource-based backoff
Workers can now process multiple tasks concurrently (default: 3 max).
Self-regulate based on resource usage - back off at 85% memory or 90% CPU.

Backend changes:
- TaskWorker handles concurrent tasks using async Maps
- Resource monitoring (memory %, CPU %) with backoff logic
- Heartbeat reports active_task_count, max_concurrent_tasks, resource stats
- Decommission support via worker_commands table

Frontend changes:
- Workers Dashboard shows tasks per worker (N/M format)
- Resource badges with color-coded thresholds
- Pod visualization with clickable selection
- Decommission controls per worker

New env vars:
- MAX_CONCURRENT_TASKS (default: 3)
- MEMORY_BACKOFF_THRESHOLD (default: 0.85)
- CPU_BACKOFF_THRESHOLD (default: 0.90)
- BACKOFF_DURATION_MS (default: 10000)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:47:24 -07:00
kelly
0b6e615075 Merge pull request 'fix: Use React Router Link to prevent scroll reset' (#35) from feat/worker-scaling into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/35
2025-12-11 17:14:25 +00:00
Kelly
be251c6fb3 fix: Use React Router Link for nav to prevent scroll reset
Changed sidebar NavLink from <a> to <Link> for client-side navigation.
This prevents full page reload and scroll position reset.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 09:59:34 -07:00
kelly
efb1e89e33 Merge pull request 'fix: Show only git SHA in header' (#34) from feat/worker-scaling into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/34
2025-12-11 16:56:04 +00:00
Kelly
529c447413 refactor: Move worker scaling to Workers page with password confirmation
- Worker scaling controls now on /workers page only (removed from /tasks)
- Password confirmation required before scaling
- Show only git SHA in header (removed version number)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 09:41:58 -07:00
Kelly
1eaf95c06b fix: Show only git SHA in header, remove version number
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 09:35:41 -07:00
kelly
138ed17d8b Merge pull request 'feat: Worker scaling from admin UI with password confirmation' (#33) from feat/worker-scaling into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/33
2025-12-11 16:29:04 +00:00
Kelly
a880c41d89 feat: Add password confirmation for worker scaling + RBAC
- Add /api/auth/verify-password endpoint for re-authentication
- Add PasswordConfirmModal component for sensitive actions
- Worker scaling (+/-) now requires password confirmation
- Add RBAC (ServiceAccount, Role, RoleBinding) for scraper pod
- Scraper pod can now read/scale worker deployment via k8s API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 09:16:27 -07:00
Kelly
2a9ae61dce fix(ui): Make Tasks Dashboard header sticky
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 09:04:21 -07:00
kelly
1f21911fa1 Merge pull request 'feat(admin): Worker scaling controls via k8s API' (#32) from feat/worker-scaling into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/32
2025-12-11 16:01:25 +00:00
Kelly
6f0a58f5d2 fix(k8s): Correct API call signatures for k8s client v1.4
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 08:47:27 -07:00
Kelly
8206dce821 feat(admin): Worker scaling controls via k8s API
- Add /api/k8s/workers endpoint to get deployment status
- Add /api/k8s/workers/scale endpoint to scale replicas (0-50)
- Add worker scaling UI to Tasks Dashboard (+/- 5 workers)
- Shows ready/desired replica count
- Uses in-cluster config in k8s, kubeconfig locally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 08:24:32 -07:00
kelly
ced1afaa8a Merge pull request 'fix(ci): CI config fix + 25 workers + pool starts paused' (#31) from fix/ci-filename into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/31
2025-12-11 08:47:26 +00:00
Kelly
d6c602c567 fix(ui): Remove Cleanup Stale button from Workers page
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:29:25 -07:00
Kelly
a252a7fefd feat(tasks): 25 workers, pool starts paused by default
- Increase worker replicas from 5 to 25
- Task pool now starts PAUSED on deploy, admin must click Start Pool
- Prevents workers from grabbing tasks before system is ready

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:19:02 -07:00
Kelly
83b06c21cc fix(tasks): Stop spinner in status cards when pool is paused
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:14:25 -07:00
kelly
f5214da54c Merge pull request 'fix(ci): Fix Woodpecker config - remove invalid top-level when' (#30) from fix/ci-filename into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/30
2025-12-11 08:07:16 +00:00
Kelly
e3d4dd0127 fix(ci): Fix Woodpecker config - remove invalid top-level when
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 00:45:08 -07:00
kelly
d0ee0d72f5 Merge pull request 'feat(tasks): Add task pool start/stop toggle' (#29) from feat/task-pool-toggle into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/29
2025-12-11 07:21:02 +00:00
kelly
521f0550cd Merge pull request 'feat: Admin UI cleanup and dispensary schedule page' (#28) from fix/worker-proxy-wait into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/28
2025-12-11 07:08:03 +00:00
Kelly
8a09691e91 feat(tasks): Add task pool start/stop toggle
- Add task-pool-state.ts for shared pause/resume state
- Add /api/tasks/pool/status, pause, resume endpoints
- Add Start/Stop Pool toggle button to TasksDashboard
- Spinner stops when pool is closed
- Fix is_active column name in store-discovery.ts
- Fix missing active column in task-service.ts claimTask
- Auto-refresh every 15 seconds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 00:07:14 -07:00
Kelly
459ad7d9c9 fix(tasks): Fix missing column errors in task queries
- Change 'active' to 'is_active' in states table query (store-discovery.ts)
- Remove non-existent 'active' column check from worker_tasks query (task-service.ts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:54:28 -07:00
Kelly
d102d27731 feat(admin): Dispensary schedule page and UI cleanup
- Add DispensarySchedule page showing crawl history and upcoming schedule
- Add /dispensaries/:state/:city/:slug/schedule route
- Add API endpoint for store crawl history
- Update View Schedule link to use dispensary-specific route
- Remove colored badges from DispensaryDetail product table (plain text)
- Make Details button ghost style in product table
- Add "Sort by States" option to IntelligenceBrands
- Remove status filter dropdown from Dispensaries page

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:50:47 -07:00
kelly
01810c40a1 Merge pull request 'fix(worker): Wait for proxies instead of crashing' (#27) from fix/worker-proxy-wait into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/27
2025-12-11 06:43:29 +00:00
Kelly
b7d33e1cbf fix(admin): Clean up store detail and intelligence pages
- Remove Update dropdown from DispensaryDetail page
- Remove Crawl Now button from StoreDetailPage
- Change "Last Crawl" to "Last Updated" on both detail pages
- Tone down emerald colors on StoreDetailPage (use gray borders/tabs)
- Simplify THC/CBD/Stock badges to plain text
- Remove duplicate state dropdown from IntelligenceStores filters
- Make store rows clickable in IntelligenceStores

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:42:50 -07:00
Kelly
5b34b5a78c fix(admin): Consistent navigation across Intelligence pages
- Add state selector dropdown to all three Intelligence pages (Brands, Stores, Pricing)
- Use consistent emerald-styled page navigation badges with current page highlighted
- Remove Refresh buttons from all Intelligence pages
- Update chart styling to use emerald gradient bars (matching Pricing page)
- Load all available states from orchestrator API instead of extracting from local data
- Fix z-index and styling on state dropdown for better visibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:32:57 -07:00
Kelly
c091d2316b fix(dashboard): Remove refresh button and HealthPanel
- Removed refresh button and refreshing state from Dashboard
- Removed HealthPanel component (deploy status auto-refresh)
- Simplified header layout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:28:18 -07:00
Kelly
e8862b8a8b fix(national): Remove Refresh Metrics button
Removed unused refresh button and related state/handlers from
National Dashboard.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:27:01 -07:00
Kelly
1b46ab699d fix(national): Show all states count, not filtered "active" states
The "Active States" metric was arbitrary and confusing. Changed to
show total states count - all states in the system regardless of
whether they have data or not.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:25:50 -07:00
Kelly
ac1995f63f fix(pricing): Simplify category chart to prevent overflow
- Replace complex price range bars with simple horizontal bars
- Use overflow-hidden to prevent bars extending beyond container
- Calculate bar width as percentage of max avg price
- Limit to top 12 categories for cleaner display
- Fixed-width labels prevent layout shift

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:24:31 -07:00
Kelly
de93669652 fix(national): Count active states by product data, not crawl status
Active states should count states with actual product data, not just
states where crawling is enabled. A state can have historical data
even if crawling is currently disabled.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:23:20 -07:00
Kelly
dffc124920 fix(national): Fix active states count and remove StateBadge
- Change active_states to count states with crawl_enabled=true dispensaries
- Filter all national summary queries by crawl_enabled=true
- Remove unused StateBadge from National Dashboard header
- StateBadge was showing "Arizona" with no way to change it

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:22:19 -07:00
Kelly
932ceb0287 feat(intelligence): Add state filter to all Intelligence pages
- Add state filter to Intelligence Brands API and frontend
- Add state filter to Intelligence Pricing API and frontend
- Add state filter to Intelligence Stores API and frontend
- Fix null safety issues with toLocaleString() calls
- Update backend /stores endpoint to return skuCount, snapshotCount, chainName
- Add overall stats to pricing endpoint

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:19:54 -07:00
Kelly
824d48fd85 fix: Add curl to Docker, add active flag to worker_tasks
- Install curl in Docker container for Dutchie HTTP requests
- Add 'active' column to worker_tasks (default false) to prevent
  accidental task execution on startup
- Update task-service to only claim tasks where active=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:12:09 -07:00
Kelly
47fdab0382 fix: Filter orchestrator states by crawl_enabled
The states dropdown was showing count of ALL dispensaries instead of
just crawl-enabled ones. Now correctly filters to match the actual
stores that would be displayed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:09:04 -07:00
Kelly
ed7ddc6375 ci: Add database migration step to deploy pipeline
Migrations now run automatically before deployments:
1. Build new Docker image
2. Run migrations using the new image
3. Deploy to Kubernetes

Requires new secrets: db_host, db_port, db_name, db_user, db_pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:07:25 -07:00
Kelly
cf06f4a8c0 feat(worker): Listen for proxy_added notifications
- Workers now use PostgreSQL LISTEN/NOTIFY to wake up immediately when proxies are added
- Added trigger on proxies table to NOTIFY 'proxy_added' when active proxy inserted/updated
- Falls back to 30s polling if LISTEN fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:58:00 -07:00
Kelly
a2fa21f65c fix(worker): Wait for proxies instead of crashing on startup
- Task worker now waits up to 60 minutes for active proxies
- Retries every 30 seconds with clear logging
- Updated K8s scraper-worker.yaml with Deployment definition
- Deployment uses task-worker.js entrypoint with correct liveness probe

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:55:04 -07:00
kelly
61e915968f Merge pull request 'feat(tasks): Refactor task workflow with payload/refresh separation' (#26) from feat/task-workflow-refactor into master 2025-12-11 05:24:11 +00:00
Kelly
4949b22457 feat(tasks): Refactor task workflow with payload/refresh separation
Major changes:
- Split crawl into payload_fetch (API → disk) and product_refresh (disk → DB)
- Add task chaining: store_discovery → product_discovery → payload_fetch → product_refresh
- Add payload storage utilities for gzipped JSON on filesystem
- Add /api/payloads endpoints for payload access and diffing
- Add DB-driven TaskScheduler with schedule persistence
- Track newDispensaryIds through discovery promotion for chaining
- Add stealth improvements: HTTP fingerprinting, proxy rotation enhancements
- Add Workers dashboard K8s scaling controls

New files:
- src/tasks/handlers/payload-fetch.ts - Fetches from API, saves to disk
- src/services/task-scheduler.ts - DB-driven schedule management
- src/utils/payload-storage.ts - Payload save/load utilities
- src/routes/payloads.ts - Payload API endpoints
- src/services/http-fingerprint.ts - Browser fingerprint generation
- docs/TASK_WORKFLOW_2024-12-10.md - Complete workflow documentation

Migrations:
- 078: Proxy consecutive 403 tracking
- 079: task_schedules table
- 080: raw_crawl_payloads table
- 081: payload column and last_fetch_at

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:15:35 -07:00
Kelly
1fb0eb94c2 security: Add authMiddleware to analytics-v2 routes
- Add authMiddleware to analytics-v2.ts to require authentication
- Add permanent rule #6 to CLAUDE.md: "ALL API ROUTES REQUIRE AUTHENTICATION"
- Add forbidden action #19: "Creating API routes without authMiddleware"
- Document authentication flow and trusted origins

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 19:01:44 -07:00
Kelly
9aefb554bc fix: Correct Analytics V2 SQL queries for schema alignment
- Fix JOIN path: store_products -> dispensaries -> states (was incorrectly joining sp.state_id which doesn't exist)
- Fix column names to use *_raw suffixes (category_raw, brand_name_raw, name_raw)
- Fix row mappings to read correct column names from query results
- Add ::timestamp casts for interval arithmetic in StoreAnalyticsService

All Analytics V2 endpoints now work correctly:
- /state/legal-breakdown
- /state/recreational
- /category/all
- /category/rec-vs-med
- /state/:code/summary
- /store/:id/summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 18:52:57 -07:00
kelly
a4338669a9 Merge pull request 'fix(auth): Prioritize JWT token over trusted origin bypass' (#24) from fix/auth-token-priority into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/24
2025-12-11 01:34:10 +00:00
Kelly
1fa9ea496c fix(auth): Prioritize JWT token over trusted origin bypass
When a user logs in and has a Bearer token, use their actual identity
instead of falling back to internal@system. This ensures logged-in
users see their real email in the admin UI.

Order of auth:
1. If Bearer token provided → use JWT/API token (real user identity)
2. If no token → check trusted origins (for API access like WordPress)
3. Otherwise → 401 unauthorized

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 18:21:50 -07:00
kelly
31756a2233 Merge pull request 'chore: Add WordPress plugin v1.6.0 download files' (#23) from chore/wordpress-plugin-downloads into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/23
2025-12-11 00:40:53 +00:00
Kelly
166583621b chore: Add WordPress plugin v1.6.0 download files
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 17:23:25 -07:00
kelly
ca952c4674 Merge pull request 'fix(ci): Use YAML map format for docker-buildx build_args' (#21) from fix/ci-build-args-format into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/21
2025-12-10 23:54:33 +00:00
kelly
4054778b6c Merge pull request 'feat: Add wildcard support for trusted domains' (#20) from fix/trusted-origins-wildcards into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/20
2025-12-10 23:54:11 +00:00
Kelly
56a5f00015 fix(ci): Use YAML map format for docker-buildx build_args
The woodpeckerci/plugin-docker-buildx plugin expects build_args as a
YAML map (key: value), not a list. This was causing build args to not
be passed to the Docker build, resulting in unknown git SHA and build
info in the deployed application.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 16:42:05 -07:00
Kelly
a96d50c481 docs(wordpress): Add deprecation comments for legacy shortcode/migration code
Clarifies that crawlsy_* and dutchie_* shortcodes are deprecated aliases
for backward compatibility only. New implementations should use cannaiq_*.

Also documents the token migration logic that preserves old API tokens.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 16:24:56 -07:00
kelly
4806212f46 Merge pull request 'fix(ci): Use YAML list format for docker-buildx build_args' (#18) from fix/ci-build-args into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/18
2025-12-10 22:29:41 +00:00
kelly
2486f3c6b2 Merge pull request 'feat(analytics): Add Brand Intelligence API endpoint' (#19) from feat/brand-intelligence-api into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/19
2025-12-10 22:29:26 +00:00
Kelly
f25bebf6ee feat: Add wildcard support for trusted domains
Add *.cannaiq.co and *.cannabrands.app to trusted domains list.
Updated isTrustedDomain() to recognize *.domain.com as wildcard
that matches the base domain and any subdomain.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:29:23 -07:00
Kelly
22dad6d0fc feat: Add wildcard trusted origins for cannaiq.co and cannabrands.app
Add *.cannaiq.co and *.cannabrands.app patterns to both:
- auth/middleware.ts (admin routes)
- public-api.ts (consumer /api/v1/* routes)

This allows any subdomain of these domains to access the API without
requiring an API key.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:25:04 -07:00
Kelly
03eab66d35 chore: Bump backend version to 1.6.0
Harmonize backend version with WordPress plugin version so admin UI displays correct version.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 15:06:42 -07:00
Kelly
97b1ab23d8 fix(ci): Use YAML list format for docker-buildx build_args
The woodpecker docker-buildx plugin expects build_args as a YAML list,
not a comma-separated string. The previous format resulted in all args
being passed as a single malformed arg with "*=" prefix.

This fix ensures APP_GIT_SHA, APP_BUILD_TIME, etc. are properly passed
to the Dockerfile so the /api/version endpoint returns correct values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 14:56:18 -07:00
Kelly
9fff0ba430 feat(analytics): Add Brand Intelligence API endpoint
New endpoint: GET /api/analytics/v2/brand/:name/intelligence

Returns comprehensive brand analytics payload including:
- Performance snapshot (active SKUs, revenue, stores, market share)
- Alerts (lost stores, delisted SKUs, competitor takeovers)
- SKU performance (velocity, status, stock levels)
- Retail footprint (penetration by region, whitespace opportunities)
- Competitive landscape (price positioning, head-to-head comparisons)
- Inventory health (days of stock, risk levels, overstock alerts)
- Promotion effectiveness (baseline vs promo velocity, lift, ROI)

Supports time windows (7d/30d/90d), state filtering, and category filtering.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 14:53:35 -07:00
kelly
7d3e91b2e6 Merge pull request 'feat(wordpress): Add new Elementor widgets and dynamic selectors v1.6.0' (#17) from feat/wordpress-widgets into master 2025-12-10 20:41:44 +00:00
Kelly
74957a9ec5 feat(wordpress): Add new Elementor widgets and dynamic selectors v1.6.0
New Widgets:
- Brand Grid: Display brands in a grid with product counts
- Category List: Show categories in grid/list/pills layouts
- Specials Grid: Display products on sale with discount badges

Enhanced Product Grid Widget:
- Dynamic category dropdown (fetches from API)
- Dynamic brand dropdown (fetches from API)
- "On Special Only" toggle filter

New Plugin Methods:
- fetch_categories() - Get categories from API
- fetch_brands() - Get brands from API
- fetch_specials() - Get products on sale
- get_category_options() - Cached options for Elementor
- get_brand_options() - Cached options for Elementor

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 13:41:17 -07:00
kelly
2d035c46cf Merge pull request 'fix: Findagram brands page crash and PWA icon errors' (#16) from fix/findagram-brands-crash into master 2025-12-10 20:11:40 +00:00
Kelly
53445fe72a fix: Findagram brands page crash and PWA icon errors
- Fix mapBrandForUI to use correct 'brand' field from API response
- Add null check in Brands.jsx filter to prevent crash on undefined names
- Fix BrandPenetrationService sps.brand_name -> sps.brand_name_raw
- Remove missing logo192.png and logo512.png from PWA manifest

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 13:06:23 -07:00
kelly
37cc8956c5 Merge pull request 'fix: Join states through dispensaries in BrandPenetrationService' (#15) from feat/ci-auto-merge into master 2025-12-10 19:36:06 +00:00
Kelly
197c82f921 fix: Join states through dispensaries in BrandPenetrationService
The store_products table doesn't have a state_id column - must join
through dispensaries to get state info. Also fixed column references
to use brand_name_raw and category_raw.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 12:18:10 -07:00
kelly
2c52493a9c Merge pull request 'fix(docker): Use npm install instead of npm ci for reliability' (#14) from feat/ci-auto-merge into master 2025-12-10 18:44:21 +00:00
Kelly
2ee2ba6b8c fix(docker): Use npm install instead of npm ci for reliability
npm ci can fail when package-lock.json has minor mismatches with
package.json. npm install is more forgiving and appropriate for
Docker builds where determinism is less critical than reliability.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 11:28:29 -07:00
kelly
bafcf1694a Merge pull request 'feat(analytics): Brand promotional history + specials fix + API key editing' (#13) from feat/ci-auto-merge into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/13
2025-12-10 18:12:59 +00:00
Kelly
95792aab15 feat(analytics): Brand promotional history + specials fix + API key editing
- Add brand promotional history endpoint (GET /api/analytics/v2/brand/:name/promotions)
  - Tracks when products go on special, duration, discounts, quantity sold estimates
  - Aggregates by category with frequency metrics (weekly/monthly)
- Add quantity changes endpoint (GET /api/analytics/v2/store/:id/quantity-changes)
  - Filter by direction (increase/decrease/all) for sales vs restock estimation
- Fix canonical-upsert to populate stock_quantity and total_quantity_available
- Add API key edit functionality in admin UI
  - Edit allowed domains and IPs
  - Display domains in list view

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-10 10:59:03 -07:00
kelly
38ae2c3a3e Merge pull request 'feat/ci-auto-merge' (#12) from feat/ci-auto-merge into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/12
2025-12-10 17:26:21 +00:00
Kelly
249d3c1b7f fix: Build args format for version info + schema-tolerant routes
CI/CD:
- Fix build_args format in woodpecker CI (comma-separated, not YAML list)
- This fixes "unknown" SHA/version showing on remote deployments

Backend schema-tolerant fixes (graceful fallbacks when tables missing):
- users.ts: Check which columns exist before querying
- worker-registry.ts: Return empty result if table doesn't exist
- task-service.ts: Add tableExists() helper, handle missing tables/views
- proxies.ts: Return totalProxies in test-all response

Frontend fixes:
- Proxies: Use total from response for accurate progress display
- SEO PagesTab: Dim Generate button when no AI provider active

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 09:53:21 -07:00
Kelly
9647f94f89 fix: Copy migrations folder to Docker image + fix SQL FILTER syntax
- Dockerfile: Add COPY migrations ./migrations so auto-migrate works on remote
- intelligence.ts: Fix FILTER clause placement in aggregate functions
  - FILTER must be inside AVG(), not wrapping ROUND()
  - Remove redundant FILTER on MIN (already filtered by WHERE)
  - Remove unsupported FILTER on PERCENTILE_CONT

These fixes resolve:
- "Failed to get task counts" (worker_tasks table missing)
- "FILTER specified but round is not an aggregate function" errors
- /national page "column m.state does not exist" (mv_state_metrics missing)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 09:38:05 -07:00
Kelly
afc288d2cf feat(ci): Auto-merge PRs after all type checks pass
Uses Gitea API to merge PR automatically when all typecheck jobs succeed.
Requires gitea_token secret in Woodpecker.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 09:27:26 -07:00
kelly
df01ce6aad Merge pull request 'feat: Auto-migrations on startup, worker exit location, proxy improvements' (#11) from feat/auto-migrations into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/11
2025-12-10 16:07:17 +00:00
Kelly
aea93bc96b fix(ci): Revert volume caching - may have broken CI trigger 2025-12-10 08:53:10 -07:00
Kelly
4e84f30f8b feat: Auto-retry tasks, 403 proxy rotation, task deletion
- Fix 403 handler to rotate BOTH proxy and fingerprint (was only fingerprint)
- Add auto-retry logic to task service (retry up to max_retries before failing)
- Add error tooltip on task status badge showing retry count and error message
- Add DELETE /api/tasks/:id endpoint (only for non-running tasks)
- Add delete button to JobQueue task table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 08:41:14 -07:00
Kelly
b20a0a4fa5 fix: Add generic delete method to ApiClient + CI speedups
- Add delete<T>() method to ApiClient for WorkersDashboard cleanup
- Add npm cache volume for faster npm ci
- Add TypeScript incremental builds with tsBuildInfoFile cache
- Should significantly speed up repeated CI runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 08:27:11 -07:00
Kelly
6eb1babc86 feat: Auto-migrations on startup, worker exit location, proxy improvements
- Add auto-migration system that runs SQL files from migrations/ on server startup
- Track applied migrations in schema_migrations table
- Show proxy exit location in Workers dashboard
- Add "Cleanup Stale" button to remove old workers
- Add remove button for individual workers
- Include proxy location (city, state, country) in worker heartbeats
- Update Proxy interface with location fields
- Re-enable bulk proxy import without ON CONFLICT

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 08:05:24 -07:00
kelly
9a9c2f76a2 Merge pull request 'feat: Stealth worker system with mandatory proxy rotation' (#10) from feat/stealth-worker-system into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/10
2025-12-10 08:13:42 +00:00
Kelly
56cc171287 feat: Stealth worker system with mandatory proxy rotation
## Worker System
- Role-agnostic workers that can handle any task type
- Pod-based architecture with StatefulSet (5-15 pods, 5 workers each)
- Custom pod names (Aethelgard, Xylos, Kryll, etc.)
- Worker registry with friendly names and resource monitoring
- Hub-and-spoke visualization on JobQueue page

## Stealth & Anti-Detection (REQUIRED)
- Proxies are MANDATORY - workers fail to start without active proxies
- CrawlRotator initializes on worker startup
- Loads proxies from `proxies` table
- Auto-rotates proxy + fingerprint on 403 errors
- 12 browser fingerprints (Chrome, Firefox, Safari, Edge)
- Locale/timezone matching for geographic consistency

## Task System
- Renamed product_resync → product_refresh
- Task chaining: store_discovery → entry_point → product_discovery
- Priority-based claiming with FOR UPDATE SKIP LOCKED
- Heartbeat and stale task recovery

## UI Updates
- JobQueue: Pod visualization, resource monitoring on hover
- WorkersDashboard: Simplified worker list
- Removed unused filters from task list

## Other
- IP2Location service for visitor analytics
- Findagram consumer features scaffolding
- Documentation updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00
Kelly
0295637ed6 fix: Public API column mappings and OOS detection
- Fix store_products column references (name_raw, brand_name_raw, category_raw)
- Fix v_product_snapshots column references (crawled_at, *_cents pricing)
- Fix dispensaries column references (zipcode, logo_image, remove hours/amenities)
- Add services and license_type to dispensary API response
- Add consecutive_misses OOS tracking to product-resync handler
- Add migration 075 for consecutive_misses column
- Add CRAWL_PIPELINE.md documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 20:44:53 -07:00
Kelly
9c6dd37316 fix(ci): Use YAML list format for docker-buildx build_args
The woodpecker docker-buildx plugin requires build_args as a YAML list,
not a comma-separated string. This fixes the build version/hash not being
passed to the Docker image.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 18:03:50 -07:00
kelly
524d13209a Merge pull request 'fix: Remove legacy imports from task handlers' (#9) from fix/task-handler-typescript-errors into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/9
2025-12-10 00:42:39 +00:00
Kelly
9199db3927 fix: Remove legacy imports from task handlers
- Remove non-existent DutchieClient import from product-resync and entry-point-discovery
- Remove non-existent DiscoveryCrawler import from store-discovery
- Use scrapeStore from scraper-v2 for product resync
- Use discoverState from discovery module for store discovery
- Fix Pool type by using getPool() instead of pool wrapper
- Update FullDiscoveryResult property access to use correct field names

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 17:25:19 -07:00
Kelly
a0652c7c73 fix(types): Fix TypeScript errors in TasksDashboard, Layout, and Users
- Fix TaskCounts type in api.ts to match TasksDashboard interface
- Make VersionInfo.version optional in Layout.tsx
- Fix boolean type in Users.tsx disabled prop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 17:02:40 -07:00
Kelly
89c262ee20 feat(tasks): Add unified task-based worker architecture
Replace fragmented job systems (job_schedules, dispensary_crawl_jobs, SyncOrchestrator)
with a single unified task queue:

- Add worker_tasks table with atomic task claiming via SELECT FOR UPDATE SKIP LOCKED
- Add TaskService for CRUD, claiming, and capacity metrics
- Add TaskWorker with role-based handlers (resync, discovery, analytics)
- Add /api/tasks endpoints for management and migration from legacy systems
- Add TasksDashboard UI and integrate task counts into main dashboard
- Add comprehensive documentation

Task roles: store_discovery, entry_point_discovery, product_discovery, product_resync, analytics_refresh

Run workers with: WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 16:27:03 -07:00
Kelly
7f9cf559cf fix(k8s): Update worker deployment to use v2 hydration worker
The old dutchie-az/services/worker.js no longer exists. Workers now use
the hydration pipeline at dist/scripts/run-hydration.js with --loop mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 15:01:18 -07:00
Kelly
bbe039c868 feat(api): Add job queue management endpoints and fix SQL type errors
- Add GET /api/job-queue/available - list dispensaries available for crawling
- Add GET /api/job-queue/history - get recent job history with results
- Add POST /api/job-queue/enqueue-batch - queue multiple dispensaries at once
- Add POST /api/job-queue/enqueue-state - queue all crawl-enabled dispensaries for a state
- Add POST /api/job-queue/clear-pending - clear pending jobs with optional filters
- Fix SQL parameter type errors by adding explicit casts ($2::text, $3::integer)
- Fix route ordering to prevent /:id from matching /available and /history

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 14:10:55 -07:00
Kelly
4e5c09a2a5 chore(dashboard): Remove DeployStatus block
Version info already shown in Layout sidebar header.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 14:10:22 -07:00
Kelly
7f65598332 feat(admin): Show version info at top of sidebar
- Add package.json version to /api/version endpoint
- Move version display from footer to top (next to logo)
- Show format: v1.5.1 (abc1234) - 12/9/2024

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 13:58:36 -07:00
Kelly
75315ed91e fix(ci): Use comma-separated build_args for docker-buildx plugin
The docker-buildx plugin expects build_args as a comma-separated string,
not a YAML list. This should fix the build_sha/build_time being null.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 13:56:37 -07:00
Kelly
7fe7d17b43 fix(consumer): Use relative API URLs for findadispo/findagram
The consumer frontends were hardcoded to use cannaiq.co as the API
URL, but each domain has its own /api path in the ingress that routes
to the shared backend. Using relative URLs allows each site to make
API calls to its own domain.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 13:38:10 -07:00
Kelly
7e517b5801 ci: Use self-hosted base images to avoid Docker Hub rate limits
Cached node:20, node:20-slim, and nginx:alpine to code.cannabrands.app.
No more Docker Hub dependency for builds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 13:07:21 -07:00
Kelly
38ba9021d1 ci: Retry build (Docker Hub rate limit) 2025-12-09 12:58:36 -07:00
Kelly
ddebad48d3 ci: Remove auto-migrations from deploy step
Database was restored from backup - no migrations needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 12:52:04 -07:00
Kelly
1cebf2e296 fix(health): Add build_sha and build_time to health endpoint
Reads APP_GIT_SHA and APP_BUILD_TIME env vars set during Docker build.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 12:22:52 -07:00
Kelly
1d6e67d837 feat(api): Add store metrics endpoints with localhost bypass
New public API v1 endpoints for third-party integrations:
- GET /api/v1/stores/:id/metrics - Store performance metrics
- GET /api/v1/stores/:id/product-metrics - Product-level price changes
- GET /api/v1/stores/:id/competitor-snapshot - Competitive intelligence

Also adds localhost IP bypass for local development testing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 12:14:13 -07:00
Kelly
cfb4b6e4ce fix(cannaiq): Fix TypeScript error in DeployStatus component
Properly destructure api.get response which returns { data: T }

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 12:08:29 -07:00
Kelly
f418c403d6 feat(auth): Add *.cannabrands.app to trusted origins whitelist
Adds pattern-based origin matching to support wildcard subdomains.
All *.cannabrands.app origins now bypass API key authentication.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 12:06:14 -07:00
Kelly
be4221af46 ci: Retrigger build 2025-12-09 11:54:16 -07:00
Kelly
ca07606b05 feat(k8s): Add Redis deployment for production
- Add k8s/redis.yaml with Redis 7 Alpine deployment
- Add REDIS_HOST and REDIS_PORT to configmap
- Redis configured with 200MB max memory and LRU eviction
- 1GB persistent volume for data persistence

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:40:11 -07:00
Kelly
baf1bf2eb7 fix(health): Require Redis in production, optional in local
Redis health check now returns error status when not configured in
production/staging environments, but remains optional in local dev.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:38:49 -07:00
Kelly
4ef3a8d72b fix(build): Fix TypeScript errors breaking CI build
- Add missing 'original' property to LocalImageSizes in brand logo download
- Remove test scripts with type errors (test-image-download.ts, test-stealth-with-db.ts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:36:28 -07:00
Kelly
09dd756eff feat(admin): Add deploy status panel to dashboard
Shows running version vs latest git commit, pipeline status with steps,
and how many commits behind if not on latest. Uses Woodpecker and Gitea
APIs to fetch CI/CD information. Auto-refreshes every 30 seconds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:26:41 -07:00
Kelly
ec8ef6210c ci: Run migrations inside K8s cluster after deploy
DB is internal to the cluster, so migrations must run via kubectl exec
into the scraper pod after deployment completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:16:21 -07:00
Kelly
a9b7a4d7a9 ci: Add proper SQL migration runner with tracking
- Creates run-migrations.ts that reads migrations/*.sql files
- Tracks applied migrations in schema_migrations table by filename
- Handles existing version-based schema by adding filename column
- CI now runs migrations before deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:12:50 -07:00
Kelly
5119d5ccf9 ci: Add migration step before deploy
Migrations now run automatically after Docker builds but before K8s deploy.
Requires DATABASE_URL secret to be configured in Woodpecker.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:09:49 -07:00
Kelly
91efd1d03d feat(images): Add local image storage with on-demand resizing
- Store product images locally with hierarchy: /images/products/<state>/<store>/<brand>/<product>/
- Add /img/* proxy endpoint for on-demand resizing via Sharp
- Implement per-product image checking to skip existing downloads
- Fix pathToUrl() to correctly generate /images/... URLs
- Add frontend getImageUrl() helper with preset sizes (thumb, medium, large)
- Update all product pages to use optimized image URLs
- Add stealth session support for Dutchie GraphQL crawls
- Include test scripts for crawl and image verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:04:50 -07:00
Kelly
aa776226b0 fix(consumer): Wire findagram/findadispo to public API
- Update Dockerfiles to use cannaiq.co as API base URL
- Change findagram API client from /api/az to /api/v1 endpoints
- Add trusted origin bypass in public-api middleware for consumer sites
- Consumer sites (findagram.co, findadispo.com) can now access /api/v1
  endpoints without API key authentication

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-09 11:04:50 -07:00
kelly
e9435150e9 Merge pull request 'feature/wp-plugin-versioning-and-fixes' (#7) from feature/wp-plugin-versioning-and-fixes into master 2025-12-09 17:15:33 +00:00
Kelly
d399b966e6 ci: trigger build 2025-12-09 10:03:29 -07:00
227 changed files with 29961 additions and 5329 deletions

View File

@@ -1,12 +1,9 @@
when:
- event: [push, pull_request]
steps:
# ===========================================
# PR VALIDATION: Parallel type checks (PRs only)
# ===========================================
typecheck-backend:
image: node:20
image: code.cannabrands.app/creationshop/node:20
commands:
- cd backend
- npm ci --prefer-offline
@@ -16,7 +13,7 @@ steps:
event: pull_request
typecheck-cannaiq:
image: node:20
image: code.cannabrands.app/creationshop/node:20
commands:
- cd cannaiq
- npm ci --prefer-offline
@@ -26,7 +23,7 @@ steps:
event: pull_request
typecheck-findadispo:
image: node:20
image: code.cannabrands.app/creationshop/node:20
commands:
- cd findadispo/frontend
- npm ci --prefer-offline
@@ -36,7 +33,7 @@ steps:
event: pull_request
typecheck-findagram:
image: node:20
image: code.cannabrands.app/creationshop/node:20
commands:
- cd findagram/frontend
- npm ci --prefer-offline
@@ -45,8 +42,34 @@ steps:
when:
event: pull_request
# ===========================================
# AUTO-MERGE: Merge PR after all checks pass
# ===========================================
auto-merge:
image: alpine:latest
environment:
GITEA_TOKEN:
from_secret: gitea_token
commands:
- apk add --no-cache curl
- |
echo "Merging PR #${CI_COMMIT_PULL_REQUEST}..."
curl -s -X POST \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"Do":"merge"}' \
"https://code.cannabrands.app/api/v1/repos/Creationshop/dispensary-scraper/pulls/${CI_COMMIT_PULL_REQUEST}/merge"
depends_on:
- typecheck-backend
- typecheck-cannaiq
- typecheck-findadispo
- typecheck-findagram
when:
event: pull_request
# ===========================================
# MASTER DEPLOY: Parallel Docker builds
# NOTE: cache_from/cache_to removed due to plugin bug splitting on commas
# ===========================================
docker-backend:
image: woodpeckerci/plugin-docker-buildx
@@ -65,10 +88,10 @@ steps:
platforms: linux/amd64
provenance: false
build_args:
- APP_BUILD_VERSION=${CI_COMMIT_SHA:0:8}
- APP_GIT_SHA=${CI_COMMIT_SHA}
- APP_BUILD_TIME=${CI_PIPELINE_CREATED}
- CONTAINER_IMAGE_TAG=${CI_COMMIT_SHA:0:8}
APP_BUILD_VERSION: ${CI_COMMIT_SHA:0:8}
APP_GIT_SHA: ${CI_COMMIT_SHA}
APP_BUILD_TIME: ${CI_PIPELINE_CREATED}
CONTAINER_IMAGE_TAG: ${CI_COMMIT_SHA:0:8}
depends_on: []
when:
branch: master
@@ -138,7 +161,7 @@ steps:
event: push
# ===========================================
# STAGE 3: Deploy (after all Docker builds)
# STAGE 3: Deploy and Run Migrations
# ===========================================
deploy:
image: bitnami/kubectl:latest
@@ -149,12 +172,17 @@ steps:
- mkdir -p ~/.kube
- echo "$KUBECONFIG_CONTENT" | tr -d '[:space:]' | base64 -d > ~/.kube/config
- chmod 600 ~/.kube/config
# Deploy backend first
- kubectl set image deployment/scraper scraper=code.cannabrands.app/creationshop/dispensary-scraper:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl rollout status deployment/scraper -n dispensary-scraper --timeout=300s
# Note: Migrations run automatically at startup via auto-migrate
# Deploy remaining services
# Resilience: ensure workers are scaled up if at 0
- REPLICAS=$(kubectl get deployment scraper-worker -n dispensary-scraper -o jsonpath='{.spec.replicas}'); if [ "$REPLICAS" = "0" ]; then echo "Scaling workers from 0 to 5"; kubectl scale deployment/scraper-worker --replicas=5 -n dispensary-scraper; fi
- kubectl set image deployment/scraper-worker worker=code.cannabrands.app/creationshop/dispensary-scraper:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/cannaiq-frontend cannaiq-frontend=code.cannabrands.app/creationshop/cannaiq-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/findadispo-frontend findadispo-frontend=code.cannabrands.app/creationshop/findadispo-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/findagram-frontend findagram-frontend=code.cannabrands.app/creationshop/findagram-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl rollout status deployment/scraper -n dispensary-scraper --timeout=300s
- kubectl rollout status deployment/cannaiq-frontend -n dispensary-scraper --timeout=120s
depends_on:
- docker-backend

191
.woodpecker/ci.yml Normal file
View File

@@ -0,0 +1,191 @@
steps:
# ===========================================
# PR VALIDATION: Only typecheck changed projects
# ===========================================
typecheck-backend:
image: code.cannabrands.app/creationshop/node:20
commands:
- npm config set cache /npm-cache/backend --global
- cd backend
- npm ci --prefer-offline
- npx tsc --noEmit
volumes:
- npm-cache:/npm-cache
depends_on: []
when:
event: pull_request
path:
include: ['backend/**']
typecheck-cannaiq:
image: code.cannabrands.app/creationshop/node:20
commands:
- npm config set cache /npm-cache/cannaiq --global
- cd cannaiq
- npm ci --prefer-offline
- npx tsc --noEmit
volumes:
- npm-cache:/npm-cache
depends_on: []
when:
event: pull_request
path:
include: ['cannaiq/**']
# findadispo/findagram typechecks skipped - they have || true anyway
# ===========================================
# AUTO-MERGE: Merge PR after all checks pass
# ===========================================
auto-merge:
image: alpine:latest
environment:
GITEA_TOKEN:
from_secret: gitea_token
commands:
- apk add --no-cache curl
- |
echo "Merging PR #${CI_COMMIT_PULL_REQUEST}..."
curl -s -X POST \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"Do":"merge"}' \
"https://code.cannabrands.app/api/v1/repos/Creationshop/dispensary-scraper/pulls/${CI_COMMIT_PULL_REQUEST}/merge"
depends_on:
- typecheck-backend
- typecheck-cannaiq
when:
event: pull_request
# ===========================================
# MASTER DEPLOY: Parallel Docker builds
# ===========================================
docker-backend:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: code.cannabrands.app
repo: code.cannabrands.app/creationshop/dispensary-scraper
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
dockerfile: backend/Dockerfile
context: backend
username:
from_secret: registry_username
password:
from_secret: registry_password
platforms: linux/amd64
provenance: false
cache_from: type=registry,ref=code.cannabrands.app/creationshop/dispensary-scraper:cache
cache_to: type=registry,ref=code.cannabrands.app/creationshop/dispensary-scraper:cache,mode=max
build_args:
APP_BUILD_VERSION: ${CI_COMMIT_SHA:0:8}
APP_GIT_SHA: ${CI_COMMIT_SHA}
APP_BUILD_TIME: ${CI_PIPELINE_CREATED}
CONTAINER_IMAGE_TAG: ${CI_COMMIT_SHA:0:8}
depends_on: []
when:
branch: master
event: push
docker-cannaiq:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: code.cannabrands.app
repo: code.cannabrands.app/creationshop/cannaiq-frontend
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
dockerfile: cannaiq/Dockerfile
context: cannaiq
username:
from_secret: registry_username
password:
from_secret: registry_password
platforms: linux/amd64
provenance: false
cache_from: type=registry,ref=code.cannabrands.app/creationshop/cannaiq-frontend:cache
cache_to: type=registry,ref=code.cannabrands.app/creationshop/cannaiq-frontend:cache,mode=max
depends_on: []
when:
branch: master
event: push
docker-findadispo:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: code.cannabrands.app
repo: code.cannabrands.app/creationshop/findadispo-frontend
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
dockerfile: findadispo/frontend/Dockerfile
context: findadispo/frontend
username:
from_secret: registry_username
password:
from_secret: registry_password
platforms: linux/amd64
provenance: false
cache_from: type=registry,ref=code.cannabrands.app/creationshop/findadispo-frontend:cache
cache_to: type=registry,ref=code.cannabrands.app/creationshop/findadispo-frontend:cache,mode=max
depends_on: []
when:
branch: master
event: push
docker-findagram:
image: woodpeckerci/plugin-docker-buildx
settings:
registry: code.cannabrands.app
repo: code.cannabrands.app/creationshop/findagram-frontend
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
dockerfile: findagram/frontend/Dockerfile
context: findagram/frontend
username:
from_secret: registry_username
password:
from_secret: registry_password
platforms: linux/amd64
provenance: false
cache_from: type=registry,ref=code.cannabrands.app/creationshop/findagram-frontend:cache
cache_to: type=registry,ref=code.cannabrands.app/creationshop/findagram-frontend:cache,mode=max
depends_on: []
when:
branch: master
event: push
# ===========================================
# STAGE 3: Deploy and Run Migrations
# ===========================================
deploy:
image: bitnami/kubectl:latest
environment:
KUBECONFIG_CONTENT:
from_secret: kubeconfig_data
commands:
- mkdir -p ~/.kube
- echo "$KUBECONFIG_CONTENT" | tr -d '[:space:]' | base64 -d > ~/.kube/config
- chmod 600 ~/.kube/config
# Deploy backend first
- kubectl set image deployment/scraper scraper=code.cannabrands.app/creationshop/dispensary-scraper:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl rollout status deployment/scraper -n dispensary-scraper --timeout=300s
# Note: Migrations run automatically at startup via auto-migrate
# Deploy remaining services
# Resilience: ensure workers are scaled up if at 0
- REPLICAS=$(kubectl get deployment scraper-worker -n dispensary-scraper -o jsonpath='{.spec.replicas}'); if [ "$REPLICAS" = "0" ]; then echo "Scaling workers from 0 to 5"; kubectl scale deployment/scraper-worker --replicas=5 -n dispensary-scraper; fi
- kubectl set image deployment/scraper-worker worker=code.cannabrands.app/creationshop/dispensary-scraper:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/cannaiq-frontend cannaiq-frontend=code.cannabrands.app/creationshop/cannaiq-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/findadispo-frontend findadispo-frontend=code.cannabrands.app/creationshop/findadispo-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl set image deployment/findagram-frontend findagram-frontend=code.cannabrands.app/creationshop/findagram-frontend:${CI_COMMIT_SHA:0:8} -n dispensary-scraper
- kubectl rollout status deployment/cannaiq-frontend -n dispensary-scraper --timeout=120s
depends_on:
- docker-backend
- docker-cannaiq
- docker-findadispo
- docker-findagram
when:
branch: master
event: push

1337
CLAUDE.md

File diff suppressed because it is too large Load Diff

3
backend/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
# IP2Location database (downloaded separately)
data/ip2location/

View File

@@ -1,17 +1,17 @@
# Build stage
# Image: code.cannabrands.app/creationshop/dispensary-scraper
FROM node:20-slim AS builder
FROM code.cannabrands.app/creationshop/node:20-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
RUN npm install
COPY . .
RUN npm run build
# Production stage
FROM node:20-slim
FROM code.cannabrands.app/creationshop/node:20-slim
# Build arguments for version info
ARG APP_BUILD_VERSION=dev
@@ -25,8 +25,9 @@ ENV APP_GIT_SHA=${APP_GIT_SHA}
ENV APP_BUILD_TIME=${APP_BUILD_TIME}
ENV CONTAINER_IMAGE_TAG=${CONTAINER_IMAGE_TAG}
# Install Chromium dependencies
# Install Chromium dependencies and curl for HTTP requests
RUN apt-get update && apt-get install -y \
curl \
chromium \
fonts-liberation \
libnss3 \
@@ -43,10 +44,13 @@ ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
RUN npm install --omit=dev
COPY --from=builder /app/dist ./dist
# Copy migrations for auto-migrate on startup
COPY migrations ./migrations
# Create local images directory for when MinIO is not configured
RUN mkdir -p /app/public/images/products

View File

@@ -0,0 +1,218 @@
# CannaiQ Backend Codebase Map
**Last Updated:** 2025-12-12
**Purpose:** Help Claude and developers understand which code is current vs deprecated
---
## Quick Reference: What to Use
### For Crawling/Scraping
| Task | Use This | NOT This |
|------|----------|----------|
| Fetch products | `src/tasks/handlers/payload-fetch.ts` | `src/hydration/*` |
| Process products | `src/tasks/handlers/product-refresh.ts` | `src/scraper-v2/*` |
| GraphQL client | `src/platforms/dutchie/client.ts` | `src/dutchie-az/services/graphql-client.ts` |
| Worker system | `src/tasks/task-worker.ts` | `src/dutchie-az/services/worker.ts` |
### For Database
| Task | Use This | NOT This |
|------|----------|----------|
| Get DB pool | `src/db/pool.ts` | `src/dutchie-az/db/connection.ts` |
| Run migrations | `src/db/migrate.ts` (CLI only) | Never import at runtime |
| Query products | `store_products` table | `products`, `dutchie_products` |
| Query stores | `dispensaries` table | `stores` table |
### For Discovery
| Task | Use This |
|------|----------|
| Discover stores | `src/discovery/*.ts` |
| Run discovery | `npx tsx src/scripts/run-discovery.ts` |
---
## Directory Status
### ACTIVE DIRECTORIES (Use These)
```
src/
├── auth/ # JWT/session auth, middleware
├── db/ # Database pool, migrations
├── discovery/ # Dutchie store discovery pipeline
├── middleware/ # Express middleware
├── multi-state/ # Multi-state query support
├── platforms/ # Platform-specific clients (Dutchie, Jane, etc)
│ └── dutchie/ # THE Dutchie client - use this one
├── routes/ # Express API routes
├── services/ # Core services (logger, scheduler, etc)
├── tasks/ # Task system (workers, handlers, scheduler)
│ └── handlers/ # Task handlers (payload_fetch, product_refresh, etc)
├── types/ # TypeScript types
└── utils/ # Utilities (storage, image processing)
```
### DEPRECATED DIRECTORIES (DO NOT USE)
```
src/
├── hydration/ # DEPRECATED - Old pipeline approach
├── scraper-v2/ # DEPRECATED - Old scraper engine
├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers
├── dutchie-az/ # PARTIAL - Some parts deprecated, some active
│ ├── db/ # DEPRECATED - Use src/db/pool.ts
│ └── services/ # PARTIAL - worker.ts still runs, graphql-client.ts deprecated
├── portals/ # FUTURE - Not yet implemented
├── seo/ # PARTIAL - Settings work, templates WIP
└── system/ # DEPRECATED - Old orchestration system
```
### DEPRECATED FILES (DO NOT USE)
```
src/dutchie-az/db/connection.ts # Use src/db/pool.ts instead
src/dutchie-az/services/graphql-client.ts # Use src/platforms/dutchie/client.ts
src/hydration/*.ts # Entire directory deprecated
src/scraper-v2/*.ts # Entire directory deprecated
```
---
## Key Files Reference
### Entry Points
| File | Purpose | Status |
|------|---------|--------|
| `src/index.ts` | Main Express server | ACTIVE |
| `src/dutchie-az/services/worker.ts` | Worker process entry | ACTIVE |
| `src/tasks/task-worker.ts` | Task worker (new system) | ACTIVE |
### Dutchie Integration
| File | Purpose | Status |
|------|---------|--------|
| `src/platforms/dutchie/client.ts` | GraphQL client, hashes, curl | **PRIMARY** |
| `src/platforms/dutchie/queries.ts` | High-level query functions | ACTIVE |
| `src/platforms/dutchie/index.ts` | Re-exports | ACTIVE |
### Task Handlers
| File | Purpose | Status |
|------|---------|--------|
| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** |
| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** |
| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE |
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs | ACTIVE |
| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE |
### Database
| File | Purpose | Status |
|------|---------|--------|
| `src/db/pool.ts` | Canonical DB pool | **PRIMARY** |
| `src/db/migrate.ts` | Migration runner (CLI only) | CLI ONLY |
| `src/db/auto-migrate.ts` | Auto-run migrations on startup | ACTIVE |
### Configuration
| File | Purpose | Status |
|------|---------|--------|
| `.env` | Environment variables | ACTIVE |
| `package.json` | Dependencies | ACTIVE |
| `tsconfig.json` | TypeScript config | ACTIVE |
---
## GraphQL Hashes (CRITICAL)
The correct hashes are in `src/platforms/dutchie/client.ts`:
```typescript
export const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};
```
**ALWAYS** use `Status: 'Active'` for FilteredProducts (not `null` or `'All'`).
---
## Scripts Reference
### Useful Scripts (in `src/scripts/`)
| Script | Purpose |
|--------|---------|
| `run-discovery.ts` | Run Dutchie discovery |
| `crawl-single-store.ts` | Test crawl a single store |
| `test-dutchie-graphql.ts` | Test GraphQL queries |
### One-Off Scripts (probably don't need)
| Script | Purpose |
|--------|---------|
| `harmonize-az-dispensaries.ts` | One-time data cleanup |
| `bootstrap-stores-for-dispensaries.ts` | One-time migration |
| `backfill-*.ts` | Historical backfill scripts |
---
## API Routes
### Active Routes (in `src/routes/`)
| Route File | Mount Point | Purpose |
|------------|-------------|---------|
| `auth.ts` | `/api/auth` | Login/logout/session |
| `stores.ts` | `/api/stores` | Store CRUD |
| `dashboard.ts` | `/api/dashboard` | Dashboard stats |
| `workers.ts` | `/api/workers` | Worker monitoring |
| `pipeline.ts` | `/api/pipeline` | Crawl triggers |
| `discovery.ts` | `/api/discovery` | Discovery management |
| `analytics.ts` | `/api/analytics` | Analytics queries |
| `wordpress.ts` | `/api/v1/wordpress` | WordPress plugin API |
---
## Documentation Files
### Current Docs (in `backend/docs/`)
| Doc | Purpose | Currency |
|-----|---------|----------|
| `TASK_WORKFLOW_2024-12-10.md` | Task system architecture | CURRENT |
| `WORKER_TASK_ARCHITECTURE.md` | Worker/task design | CURRENT |
| `CRAWL_PIPELINE.md` | Crawl pipeline overview | CURRENT |
| `ORGANIC_SCRAPING_GUIDE.md` | Browser-based scraping | CURRENT |
| `CODEBASE_MAP.md` | This file | CURRENT |
| `ANALYTICS_V2_EXAMPLES.md` | Analytics API examples | CURRENT |
| `BRAND_INTELLIGENCE_API.md` | Brand API docs | CURRENT |
### Root Docs
| Doc | Purpose | Currency |
|-----|---------|----------|
| `CLAUDE.md` | Claude instructions | **PRIMARY** |
| `README.md` | Project overview | NEEDS UPDATE |
---
## Common Mistakes to Avoid
1. **Don't use `src/hydration/`** - It's an old approach that was superseded by the task system
2. **Don't use `src/dutchie-az/db/connection.ts`** - Use `src/db/pool.ts` instead
3. **Don't import `src/db/migrate.ts` at runtime** - It will crash. Only use for CLI migrations.
4. **Don't query `stores` table** - It's empty. Use `dispensaries`.
5. **Don't query `products` table** - It's empty. Use `store_products`.
6. **Don't use wrong GraphQL hash** - Always get hash from `GRAPHQL_HASHES` in client.ts
7. **Don't use `Status: null`** - It returns 0 products. Use `Status: 'Active'`.
---
## When in Doubt
1. Check if the file is imported in `src/index.ts` - if not, it may be deprecated
2. Check the last modified date - older files may be stale
3. Look for `DEPRECATED` comments in the code
4. Ask: "Is there a newer version of this in `src/tasks/` or `src/platforms/`?"
5. Read the relevant doc in `docs/` before modifying code

View File

@@ -0,0 +1,394 @@
# Brand Intelligence API
## Endpoint
```
GET /api/analytics/v2/brand/:name/intelligence
```
## Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `window` | `7d\|30d\|90d` | `30d` | Time window for trend calculations |
| `state` | string | - | Filter by state code (e.g., `AZ`) |
| `category` | string | - | Filter by category (e.g., `Flower`) |
## Response Payload Schema
```typescript
interface BrandIntelligenceResult {
brand_name: string;
window: '7d' | '30d' | '90d';
generated_at: string; // ISO timestamp when data was computed
performance_snapshot: PerformanceSnapshot;
alerts: Alerts;
sku_performance: SkuPerformance[];
retail_footprint: RetailFootprint;
competitive_landscape: CompetitiveLandscape;
inventory_health: InventoryHealth;
promo_performance: PromoPerformance;
}
```
---
## Section 1: Performance Snapshot
Summary cards with key brand metrics.
```typescript
interface PerformanceSnapshot {
active_skus: number; // Total products in catalog
total_revenue_30d: number | null; // Estimated from qty × price
total_stores: number; // Active retail partners
new_stores_30d: number; // New distribution in window
market_share: number | null; // % of category SKUs
avg_wholesale_price: number | null;
price_position: 'premium' | 'value' | 'competitive';
}
```
**UI Label Mapping:**
| Field | User-Facing Label | Helper Text |
|-------|-------------------|-------------|
| `active_skus` | Active Products | X total in catalog |
| `total_revenue_30d` | Monthly Revenue | Estimated from sales |
| `total_stores` | Retail Distribution | Active retail partners |
| `new_stores_30d` | New Opportunities | X new in last 30 days |
| `market_share` | Category Position | % of category |
| `avg_wholesale_price` | Avg Wholesale | Per unit |
| `price_position` | Pricing Tier | Premium/Value/Market Rate |
---
## Section 2: Alerts
Issues requiring attention.
```typescript
interface Alerts {
lost_stores_30d_count: number;
lost_skus_30d_count: number;
competitor_takeover_count: number;
avg_oos_duration_days: number | null;
avg_reorder_lag_days: number | null;
items: AlertItem[];
}
interface AlertItem {
type: 'lost_store' | 'delisted_sku' | 'shelf_loss' | 'extended_oos';
severity: 'critical' | 'warning';
store_name?: string;
product_name?: string;
competitor_brand?: string;
days_since?: number;
state_code?: string;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `lost_stores_30d_count` | Accounts at Risk |
| `lost_skus_30d_count` | Delisted SKUs |
| `competitor_takeover_count` | Shelf Losses |
| `avg_oos_duration_days` | Avg Stockout Length |
| `avg_reorder_lag_days` | Avg Restock Time |
| `severity: critical` | Urgent |
| `severity: warning` | Watch |
---
## Section 3: SKU Performance (Product Velocity)
How fast each SKU sells.
```typescript
interface SkuPerformance {
store_product_id: number;
product_name: string;
category: string | null;
daily_velocity: number; // Units/day estimate
velocity_status: 'hot' | 'steady' | 'slow' | 'stale';
retail_price: number | null;
on_sale: boolean;
stores_carrying: number;
stock_status: 'in_stock' | 'low_stock' | 'out_of_stock';
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `daily_velocity` | Daily Rate |
| `velocity_status` | Momentum |
| `velocity_status: hot` | Hot |
| `velocity_status: steady` | Steady |
| `velocity_status: slow` | Slow |
| `velocity_status: stale` | Stale |
| `retail_price` | Retail Price |
| `on_sale` | Promo (badge) |
**Velocity Thresholds:**
- `hot`: >= 5 units/day
- `steady`: >= 1 unit/day
- `slow`: >= 0.1 units/day
- `stale`: < 0.1 units/day
---
## Section 4: Retail Footprint
Store placement and coverage.
```typescript
interface RetailFootprint {
total_stores: number;
in_stock_count: number;
out_of_stock_count: number;
penetration_by_region: RegionPenetration[];
whitespace_stores: WhitespaceStore[];
}
interface RegionPenetration {
state_code: string;
store_count: number;
percent_reached: number; // % of state's dispensaries
in_stock: number;
out_of_stock: number;
}
interface WhitespaceStore {
store_id: number;
store_name: string;
state_code: string;
city: string | null;
category_fit: number; // How many competing brands they carry
competitor_brands: string[];
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `penetration_by_region` | Market Coverage by Region |
| `percent_reached` | X% reached |
| `in_stock` | X stocked |
| `out_of_stock` | X out |
| `whitespace_stores` | Expansion Opportunities |
| `category_fit` | X fit |
---
## Section 5: Competitive Landscape
Market positioning vs competitors.
```typescript
interface CompetitiveLandscape {
brand_price_position: 'premium' | 'value' | 'competitive';
market_share_trend: MarketSharePoint[];
competitors: Competitor[];
head_to_head_skus: HeadToHead[];
}
interface MarketSharePoint {
date: string;
share_percent: number;
}
interface Competitor {
brand_name: string;
store_overlap_percent: number;
price_position: 'premium' | 'value' | 'competitive';
avg_price: number | null;
sku_count: number;
}
interface HeadToHead {
product_name: string;
brand_price: number;
competitor_brand: string;
competitor_price: number;
price_diff_percent: number;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `price_position: premium` | Premium Tier |
| `price_position: value` | Value Leader |
| `price_position: competitive` | Market Rate |
| `market_share_trend` | Share of Shelf Trend |
| `head_to_head_skus` | Price Comparison |
| `store_overlap_percent` | X% store overlap |
---
## Section 6: Inventory Health
Stock projections and risk levels.
```typescript
interface InventoryHealth {
critical_count: number; // <7 days stock
warning_count: number; // 7-14 days stock
healthy_count: number; // 14-90 days stock
overstocked_count: number; // >90 days stock
skus: InventorySku[];
overstock_alert: OverstockItem[];
}
interface InventorySku {
store_product_id: number;
product_name: string;
store_name: string;
days_of_stock: number | null;
risk_level: 'critical' | 'elevated' | 'moderate' | 'healthy';
current_quantity: number | null;
daily_sell_rate: number | null;
}
interface OverstockItem {
product_name: string;
store_name: string;
excess_units: number;
days_of_stock: number;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `risk_level: critical` | Reorder Now |
| `risk_level: elevated` | Low Stock |
| `risk_level: moderate` | Monitor |
| `risk_level: healthy` | Healthy |
| `critical_count` | Urgent (<7 days) |
| `warning_count` | Low (7-14 days) |
| `overstocked_count` | Excess (>90 days) |
| `days_of_stock` | X days remaining |
| `overstock_alert` | Overstock Alert |
| `excess_units` | X excess units |
---
## Section 7: Promotion Effectiveness
How promotions impact sales.
```typescript
interface PromoPerformance {
avg_baseline_velocity: number | null;
avg_promo_velocity: number | null;
avg_velocity_lift: number | null; // % increase during promo
avg_efficiency_score: number | null; // ROI proxy
promotions: Promotion[];
}
interface Promotion {
product_name: string;
store_name: string;
status: 'active' | 'scheduled' | 'ended';
start_date: string;
end_date: string | null;
regular_price: number;
promo_price: number;
discount_percent: number;
baseline_velocity: number | null;
promo_velocity: number | null;
velocity_lift: number | null;
efficiency_score: number | null;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `avg_baseline_velocity` | Normal Rate |
| `avg_promo_velocity` | During Promos |
| `avg_velocity_lift` | Avg Sales Lift |
| `avg_efficiency_score` | ROI Score |
| `velocity_lift` | Sales Lift |
| `efficiency_score` | ROI Score |
| `status: active` | Live |
| `status: scheduled` | Scheduled |
| `status: ended` | Ended |
---
## Example Queries
### Get full payload
```javascript
const response = await fetch('/api/analytics/v2/brand/Wyld/intelligence?window=30d');
const data = await response.json();
```
### Extract summary cards (flattened)
```javascript
const { performance_snapshot: ps, alerts } = data;
const summaryCards = {
activeProducts: ps.active_skus,
monthlyRevenue: ps.total_revenue_30d,
retailDistribution: ps.total_stores,
newOpportunities: ps.new_stores_30d,
categoryPosition: ps.market_share,
avgWholesale: ps.avg_wholesale_price,
pricingTier: ps.price_position,
accountsAtRisk: alerts.lost_stores_30d_count,
delistedSkus: alerts.lost_skus_30d_count,
shelfLosses: alerts.competitor_takeover_count,
};
```
### Get top 10 fastest selling SKUs
```javascript
const topSkus = data.sku_performance
.filter(sku => sku.velocity_status === 'hot' || sku.velocity_status === 'steady')
.sort((a, b) => b.daily_velocity - a.daily_velocity)
.slice(0, 10);
```
### Get critical inventory alerts only
```javascript
const criticalInventory = data.inventory_health.skus
.filter(sku => sku.risk_level === 'critical');
```
### Get states with <50% penetration
```javascript
const underPenetrated = data.retail_footprint.penetration_by_region
.filter(region => region.percent_reached < 50)
.sort((a, b) => a.percent_reached - b.percent_reached);
```
### Get active promotions with positive lift
```javascript
const effectivePromos = data.promo_performance.promotions
.filter(p => p.status === 'active' && p.velocity_lift > 0)
.sort((a, b) => b.velocity_lift - a.velocity_lift);
```
### Build chart data for market share trend
```javascript
const chartData = data.competitive_landscape.market_share_trend.map(point => ({
x: new Date(point.date),
y: point.share_percent,
}));
```
---
## Notes for Frontend Implementation
1. **All fields are snake_case** - transform to camelCase if needed
2. **Null values are possible** - handle gracefully in UI
3. **Arrays may be empty** - show appropriate empty states
4. **Timestamps are ISO format** - parse with `new Date()`
5. **Percentages are already computed** - no need to multiply by 100
6. **The `window` parameter affects trend calculations** - 7d/30d/90d

View File

@@ -0,0 +1,539 @@
# Crawl Pipeline Documentation
## Overview
The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.
---
## Pipeline Stages
```
┌─────────────────────┐
│ store_discovery │ Find new dispensaries
└─────────┬───────────┘
┌─────────────────────┐
│ entry_point_discovery│ Resolve slug → platform_dispensary_id
└─────────┬───────────┘
┌─────────────────────┐
│ product_discovery │ Initial product crawl
└─────────┬───────────┘
┌─────────────────────┐
│ product_resync │ Recurring crawl (every 4 hours)
└─────────────────────┘
```
---
## Stage Details
### 1. Store Discovery
**Purpose:** Find new dispensaries to crawl
**Handler:** `src/tasks/handlers/store-discovery.ts`
**Flow:**
1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states
2. Extract dispensary info (name, address, menu_url)
3. Insert into `dutchie_discovery_locations`
4. Queue `entry_point_discovery` for each new location
---
### 2. Entry Point Discovery
**Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)
**Handler:** `src/tasks/handlers/entry-point-discovery.ts`
**Flow:**
1. Load dispensary from database
2. Extract slug from `menu_url`:
- `/embedded-menu/<slug>` or `/dispensary/<slug>`
3. Start stealth session (fingerprint + proxy)
4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL
5. Update dispensary with `platform_dispensary_id`
6. Queue `product_discovery` task
**Example:**
```
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b
```
---
### 3. Product Discovery
**Purpose:** Initial crawl of a new dispensary
**Handler:** `src/tasks/handlers/product-discovery.ts`
Same as product_resync but for first-time crawls.
---
### 4. Product Resync
**Purpose:** Recurring crawl to capture price/stock changes
**Handler:** `src/tasks/handlers/product-resync.ts`
**Flow:**
#### Step 1: Load Dispensary Info
```sql
SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
```
#### Step 2: Start Stealth Session
- Generate random browser fingerprint
- Set locale/timezone matching state
- Optional proxy rotation
#### Step 3: Fetch Products via GraphQL
**Endpoint:** `https://dutchie.com/api-3/graphql`
**Variables:**
```javascript
{
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: "<platform_dispensary_id>",
pricingType: "rec",
Status: "All",
types: [],
useCache: false,
isDefaultSort: true,
sortBy: "popularSortIdx",
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false
},
page: 0,
perPage: 100
}
```
**Key Notes:**
- `Status: "All"` returns all products (Active returns same count)
- `Status: null` returns 0 products (broken)
- `pricingType: "rec"` returns BOTH rec and med prices
- Paginate until `products.length < perPage` or `allProducts.length >= totalCount`
#### Step 4: Normalize Data
Transform raw Dutchie payload to canonical format via `DutchieNormalizer`.
#### Step 5: Upsert Products
Insert/update `store_products` table with normalized data.
#### Step 6: Create Snapshots
Insert point-in-time record to `store_product_snapshots`.
#### Step 7: Track Missing Products (OOS Detection)
```sql
-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id = ANY($2)
-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id NOT IN (...)
AND consecutive_misses < 3
-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
AND consecutive_misses >= 3
AND stock_status != 'oos'
```
#### Step 8: Download Images
For new products, download and store images locally.
#### Step 9: Update Dispensary
```sql
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
```
---
## GraphQL Payload Structure
### Product Fields (from filteredProducts.products[])
| Field | Type | Description |
|-------|------|-------------|
| `_id` / `id` | string | MongoDB ObjectId (24 hex chars) |
| `Name` | string | Product display name |
| `brandName` | string | Brand name |
| `brand.name` | string | Brand name (nested) |
| `brand.description` | string | Brand description |
| `type` | string | Category (Flower, Edible, Concentrate, etc.) |
| `subcategory` | string | Subcategory |
| `strainType` | string | Hybrid, Indica, Sativa, N/A |
| `Status` | string | Always "Active" in feed |
| `Image` | string | Primary image URL |
| `images[]` | array | All product images |
### Pricing Fields
| Field | Type | Description |
|-------|------|-------------|
| `Prices[]` | number[] | Rec prices per option |
| `recPrices[]` | number[] | Rec prices |
| `medicalPrices[]` | number[] | Medical prices |
| `recSpecialPrices[]` | number[] | Rec sale prices |
| `medicalSpecialPrices[]` | number[] | Medical sale prices |
| `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) |
| `rawOptions[]` | string[] | Raw weight options ("3.5g") |
### Inventory Fields (POSMetaData.children[])
| Field | Type | Description |
|-------|------|-------------|
| `quantity` | number | Total inventory count |
| `quantityAvailable` | number | Available for online orders |
| `kioskQuantityAvailable` | number | Available for kiosk orders |
| `option` | string | Which size option this is for |
### Potency Fields
| Field | Type | Description |
|-------|------|-------------|
| `THCContent.range[]` | number[] | THC percentage |
| `CBDContent.range[]` | number[] | CBD percentage |
| `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown |
### Specials (specialData.bogoSpecials[])
| Field | Type | Description |
|-------|------|-------------|
| `specialName` | string | Deal name |
| `specialType` | string | "bogo", "sale", etc. |
| `itemsForAPrice.value` | string | Bundle price |
| `bogoRewards[].totalQuantity.quantity` | number | Required quantity |
---
## OOS Detection Logic
Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`:
| Scenario | Action |
|----------|--------|
| Product in feed | `consecutive_misses = 0` |
| Product missing 1st time | `consecutive_misses = 1` |
| Product missing 2nd time | `consecutive_misses = 2` |
| Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` |
| Product returns to feed | `consecutive_misses = 0`, update stock_status |
**Why 3 misses?**
- Protects against false positives from crawl failures
- Single bad crawl doesn't trigger mass OOS alerts
- Balances detection speed vs accuracy
---
## Database Tables
### store_products
Current state of each product:
- `provider_product_id` - Dutchie's MongoDB ObjectId
- `name_raw`, `brand_name_raw` - Raw values from feed
- `price_rec`, `price_med` - Current prices
- `is_in_stock`, `stock_status` - Availability
- `consecutive_misses` - OOS detection counter
- `last_seen_at` - Last time product was in feed
### store_product_snapshots
Point-in-time records for historical analysis:
- One row per product per crawl
- Captures price, stock, potency at that moment
- Used for price history, analytics
### dispensaries
Store metadata:
- `platform_dispensary_id` - MongoDB ObjectId for GraphQL
- `menu_url` - Source URL
- `last_crawl_at` - Last successful crawl
- `crawl_enabled` - Whether to crawl
---
## Worker Roles
Workers pull tasks from the `worker_tasks` queue based on their assigned role.
| Role | Name | Description | Handler |
|------|------|-------------|---------|
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
**API Endpoint:** `GET /api/worker-registry/roles`
---
## Scheduling
Crawls are scheduled via `worker_tasks` table:
| Role | Frequency | Description |
|------|-----------|-------------|
| `product_resync` | Every 4 hours | Regular product refresh |
| `product_discovery` | On-demand | First crawl for new stores |
| `entry_point_discovery` | On-demand | New store setup |
| `store_discovery` | Daily | Find new stores |
| `analytics_refresh` | Daily | Refresh analytics materialized views |
---
## Priority & On-Demand Tasks
Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
### Priority Levels
| Priority | Use Case | Example |
|----------|----------|---------|
| 0 | Scheduled/batch tasks | Daily product_resync generation |
| 10 | On-demand/chained tasks | entry_point → product_discovery |
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
### Task Chaining
When a task completes, the system automatically creates follow-up tasks:
```
store_discovery (completed)
└─► entry_point_discovery (priority: 10) for each new store
entry_point_discovery (completed, success)
└─► product_discovery (priority: 10) for that store
product_discovery (completed)
└─► [no chain] Store enters regular resync schedule
```
### On-Demand Task Creation
Use the task service to create high-priority tasks:
```typescript
// Create immediate product resync for a store
await taskService.createTask({
role: 'product_resync',
dispensary_id: 123,
platform: 'dutchie',
priority: 20, // Higher than batch tasks
});
// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
```
### Claim Function
The `claim_task()` SQL function atomically claims tasks:
- Respects priority ordering (higher = first)
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
- Prevents multiple active tasks per store
---
## Image Storage
Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
### Storage Path
```
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp
```
**Example:**
```
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
```
### Image Proxy API
Served via `/img/*` with on-demand resizing using **sharp**:
```
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
```
| Param | Description |
|-------|-------------|
| `w` | Width in pixels (max 4000) |
| `h` | Height in pixels (max 4000) |
| `q` | Quality 1-100 (default 80) |
| `fit` | cover, contain, fill, inside, outside |
| `blur` | Blur sigma (0.3-1000) |
| `gray` | Grayscale (1 = enabled) |
| `format` | webp, jpeg, png, avif (default webp) |
### Key Files
| File | Purpose |
|------|---------|
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
### Download Rules
| Scenario | Image Action |
|----------|--------------|
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
| **Product already has local image** | Skip download entirely |
**Logic:**
- Images are downloaded **once** and never re-downloaded on subsequent crawls
- `skipIfExists: true` - filesystem check prevents re-download even if queued
- First crawl: all products get images
- Refresh crawl: only new products or products missing local images
### Storage Rules
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
- Store full resolution, resize on-demand via `/img` proxy
- Convert to webp for consistency using **sharp**
- Preserve original Dutchie URL as fallback in `image_url` column
- Local path stored in `local_image_path` column
---
## Stealth & Anti-Detection
**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
Workers automatically initialize anti-detection systems on startup.
### Components
| Component | Purpose | Source |
|-----------|---------|--------|
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
### Initialization Flow
```
Worker Start
├─► initializeStealth()
│ │
│ ├─► CrawlRotator.initialize()
│ │ └─► Load proxies from `proxies` table
│ │
│ └─► setCrawlRotator(rotator)
│ └─► Wire to Dutchie client
└─► Process tasks...
```
### Stealth Session (per task)
Each crawl task starts a stealth session:
```typescript
// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
```
This creates a new identity with:
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix``en-US,en;q=0.9`)
- **sec-ch-ua headers:** Proper Client Hints for the browser profile
### On 403 Block
When Dutchie returns 403, the client automatically:
1. Records failure on current proxy (increments `failure_count`)
2. If proxy has 5+ failures, deactivates it
3. Rotates to next healthy proxy
4. Rotates fingerprint
5. Retries the request
### Proxy Table Schema
```sql
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5
is_active BOOLEAN DEFAULT true,
last_used_at TIMESTAMPTZ,
failure_count INTEGER DEFAULT 0,
success_count INTEGER DEFAULT 0,
avg_response_time_ms INTEGER,
last_failure_at TIMESTAMPTZ,
last_error TEXT
);
```
### Configuration
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
### User-Agent Generation
See `workflow-12102025.md` for full specification.
**Summary:**
- Uses `intoli/user-agents` library (daily-updated market share data)
- Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
- Browser whitelist: Chrome, Safari, Edge, Firefox only
- UA sticks until IP rotates (403 or manual rotation)
- Failure = alert admin + stop crawl (no fallback)
Each fingerprint includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
---
## Error Handling
- **GraphQL errors:** Logged, task marked failed, retried later
- **Normalization errors:** Logged as warnings, continue with valid products
- **Image download errors:** Non-fatal, logged, continue
- **Database errors:** Task fails, will be retried
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)
---
## Files
| File | Purpose |
|------|---------|
| `src/tasks/handlers/product-resync.ts` | Main crawl handler |
| `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution |
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
| `src/utils/image-storage.ts` | Image download and local storage |
| `src/routes/image-proxy.ts` | On-demand image resizing |
| `migrations/075_consecutive_misses.sql` | OOS tracking column |

View File

@@ -0,0 +1,297 @@
# Organic Browser-Based Scraping Guide
**Last Updated:** 2025-12-12
**Status:** Production-ready proof of concept
---
## Overview
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
---
## Why Organic Scraping?
| Approach | Detection Risk | Speed | Complexity |
|----------|---------------|-------|------------|
| Direct curl | Higher | Fast | Low |
| curl-impersonate | Medium | Fast | Medium |
| **Browser-based (organic)** | **Lowest** | Slower | Higher |
Direct curl requests can be fingerprinted via:
- TLS fingerprint (cipher suites, extensions)
- Header order and values
- Missing cookies/session data
- Request patterns
Browser-based requests inherit:
- Real Chrome TLS fingerprint
- Session cookies from page visit
- Natural header order
- JavaScript execution environment
---
## Implementation
### Dependencies
```bash
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
```
### Core Script: `test-intercept.js`
Located at: `backend/test-intercept.js`
```javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function capturePayload(config) {
const { dispensaryId, platformId, cName, outputPath } = config;
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// STEP 1: Establish session by visiting the menu
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
// STEP 2: Fetch ALL products using GraphQL from browser context
const result = await page.evaluate(async (platformId) => {
const allProducts = [];
let pageNum = 0;
const perPage = 100;
let totalCount = 0;
const sessionId = 'browser-session-' + Date.now();
while (pageNum < 30) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // CRITICAL: Must be 'Active', not null
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: pageNum,
perPage: perPage,
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
}
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions)
});
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
method: 'GET',
headers: {
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include'
});
const json = await response.json();
const data = json?.data?.filteredProducts;
if (!data?.products) break;
allProducts.push(...data.products);
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
if (allProducts.length >= totalCount) break;
pageNum++;
await new Promise(r => setTimeout(r, 200)); // Polite delay
}
return { products: allProducts, totalCount };
}, platformId);
await browser.close();
// STEP 3: Save payload
const payload = {
dispensaryId,
platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: result.products.length,
products: result.products,
};
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
return payload;
}
```
---
## Critical Parameters
### GraphQL Hash (FilteredProducts)
```
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
```
**WARNING:** Using the wrong hash returns HTTP 400.
### Status Parameter
| Value | Result |
|-------|--------|
| `'Active'` | Returns in-stock products (1019 in test) |
| `null` | Returns 0 products |
| `'All'` | Returns HTTP 400 |
**ALWAYS use `Status: 'Active'`**
### Required Headers
```javascript
{
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': 'unique-session-id',
'apollographql-client-name': 'Marketplace (production)',
}
```
### Endpoint
```
https://dutchie.com/api-3/graphql
```
---
## Performance Benchmarks
Test store: AZ-Deeply-Rooted (1019 products)
| Metric | Value |
|--------|-------|
| Total products | 1019 |
| Time | 18.5 seconds |
| Payload size | 11.8 MB |
| Pages fetched | 11 (100 per page) |
| Success rate | 100% |
---
## Payload Format
The output matches the existing `payload-fetch.ts` handler format:
```json
{
"dispensaryId": 123,
"platformId": "6405ef617056e8014d79101b",
"cName": "AZ-Deeply-Rooted",
"fetchedAt": "2025-12-12T05:05:19.837Z",
"productCount": 1019,
"products": [
{
"id": "6927508db4851262f629a869",
"Name": "Product Name",
"brand": { "name": "Brand Name", ... },
"type": "Flower",
"THC": "25%",
"Prices": [...],
"Options": [...],
...
}
]
}
```
---
## Integration Points
### As a Task Handler
The organic approach can be integrated as an alternative to curl-based fetching:
```typescript
// In src/tasks/handlers/organic-payload-fetch.ts
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
// Use puppeteer-based capture
// Save to same payload storage
// Queue product_refresh task
}
```
### Worker Configuration
Add to job_schedules:
```sql
INSERT INTO job_schedules (name, role, cron_expression)
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
```
---
## Troubleshooting
### HTTP 400 Bad Request
- Check hash is correct: `ee29c060...`
- Verify Status is `'Active'` (string, not null)
### 0 Products Returned
- Status was likely `null` or `'All'` - use `'Active'`
- Check platformId is valid MongoDB ObjectId
### Session Not Established
- Increase timeout on initial page.goto()
- Check cName is valid (matches embedded-menu URL)
### Detection/Blocking
- StealthPlugin should handle most cases
- Add random delays between pages
- Use headless: 'new' (not true/false)
---
## Files Reference
| File | Purpose |
|------|---------|
| `backend/test-intercept.js` | Proof of concept script |
| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation |
| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler |
| `backend/src/utils/payload-storage.ts` | Payload save/load utilities |
---
## See Also
- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation
- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture
- `CLAUDE.md` - Project rules and constraints

View File

@@ -0,0 +1,25 @@
# ARCHIVED DOCUMENTATION
**WARNING: These docs may be outdated or inaccurate.**
The code has evolved significantly. These docs are kept for historical reference only.
## What to Use Instead
**The single source of truth is:**
- `CLAUDE.md` (root) - Essential rules and quick reference
- `docs/CODEBASE_MAP.md` - Current file/directory reference
## Why Archive?
These docs were written during development iterations and may reference:
- Old file paths that no longer exist
- Deprecated approaches (hydration, scraper-v2)
- APIs that have changed
- Database schemas that evolved
## If You Need Details
1. First check CODEBASE_MAP.md for current file locations
2. Then read the actual source code
3. Only use archive docs as a last resort for historical context

View File

@@ -0,0 +1,584 @@
# Task Workflow Documentation
**Date: 2024-12-10**
This document describes the complete task/job processing architecture after the 2024-12-10 rewrite.
---
## Complete Architecture
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ API SERVER POD (scraper) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌────────────────────────────────────────┐ │ │
│ │ │ Express API │ │ TaskScheduler │ │ │
│ │ │ │ │ (src/services/task-scheduler.ts) │ │ │
│ │ │ /api/job-queue │ │ │ │ │
│ │ │ /api/tasks │ │ • Polls every 60s │ │ │
│ │ │ /api/schedules │ │ • Checks task_schedules table │ │ │
│ │ └────────┬─────────┘ │ • SELECT FOR UPDATE SKIP LOCKED │ │ │
│ │ │ │ • Generates tasks when due │ │ │
│ │ │ └──────────────────┬─────────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┼──────────────────────────────────┼──────────────────────────┘ │
│ │ │ │
│ │ ┌────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ POSTGRESQL DATABASE │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ task_schedules │ │ worker_tasks │ │ │
│ │ │ │ │ │ │ │
│ │ │ • product_refresh │───────►│ • pending tasks │ │ │
│ │ │ • store_discovery │ create │ • claimed tasks │ │ │
│ │ │ • analytics_refresh │ tasks │ • running tasks │ │ │
│ │ │ │ │ • completed tasks │ │ │
│ │ │ next_run_at │ │ │ │ │
│ │ │ last_run_at │ │ role, dispensary_id │ │ │
│ │ │ interval_hours │ │ priority, status │ │ │
│ │ └─────────────────────┘ └──────────┬──────────┘ │ │
│ │ │ │ │
│ └─────────────────────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┘ │
│ │ Workers poll for tasks │
│ │ (SELECT FOR UPDATE SKIP LOCKED) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WORKER PODS (StatefulSet: scraper-worker) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ task-worker │ │ task-worker │ │ task-worker │ │ task-worker │ │ │
│ │ │ .ts │ │ .ts │ │ .ts │ │ .ts │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
```
---
## Startup Sequence
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ API SERVER STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Express app initializes │
│ │ │
│ ▼ │
│ 2. runAutoMigrations() │
│ • Runs pending migrations (including 079_task_schedules.sql) │
│ │ │
│ ▼ │
│ 3. initializeMinio() / initializeImageStorage() │
│ │ │
│ ▼ │
│ 4. cleanupOrphanedJobs() │
│ │ │
│ ▼ │
│ 5. taskScheduler.start() ◄─── NEW (per TASK_WORKFLOW_2024-12-10.md) │
│ │ │
│ ├── Recover stale tasks (workers that died) │
│ ├── Ensure default schedules exist in task_schedules │
│ ├── Check and run any due schedules immediately │
│ └── Start 60-second poll interval │
│ │ │
│ ▼ │
│ 6. app.listen(PORT) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POD STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. K8s starts pod from StatefulSet │
│ │ │
│ ▼ │
│ 2. TaskWorker.constructor() │
│ • Create DB pool │
│ • Create CrawlRotator │
│ │ │
│ ▼ │
│ 3. initializeStealth() │
│ • Load proxies from DB (REQUIRED - fails if none) │
│ • Wire rotator to Dutchie client │
│ │ │
│ ▼ │
│ 4. register() with API │
│ • Optional - continues if fails │
│ │ │
│ ▼ │
│ 5. startRegistryHeartbeat() every 30s │
│ │ │
│ ▼ │
│ 6. processNextTask() loop │
│ │ │
│ ├── Poll for pending task (FOR UPDATE SKIP LOCKED) │
│ ├── Claim task atomically │
│ ├── Execute handler (product_refresh, store_discovery, etc.) │
│ ├── Mark complete/failed │
│ ├── Chain next task if applicable │
│ └── Loop │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Schedule Flow
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER POLL (every 60 seconds) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEGIN TRANSACTION │
│ │ │
│ ▼ │
│ SELECT * FROM task_schedules │
│ WHERE enabled = true AND next_run_at <= NOW() │
│ FOR UPDATE SKIP LOCKED ◄─── Prevents duplicate execution across replicas │
│ │ │
│ ▼ │
│ For each due schedule: │
│ │ │
│ ├── product_refresh_all │
│ │ └─► Query dispensaries needing crawl │
│ │ └─► Create product_refresh tasks in worker_tasks │
│ │ │
│ ├── store_discovery_dutchie │
│ │ └─► Create single store_discovery task │
│ │ │
│ └── analytics_refresh │
│ └─► Create single analytics_refresh task │
│ │ │
│ ▼ │
│ UPDATE task_schedules SET │
│ last_run_at = NOW(), │
│ next_run_at = NOW() + interval_hours │
│ │ │
│ ▼ │
│ COMMIT │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Task Lifecycle
```
┌──────────┐
│ SCHEDULE │
│ DUE │
└────┬─────┘
┌──────────────┐ claim ┌──────────────┐ start ┌──────────────┐
│ PENDING │────────────►│ CLAIMED │────────────►│ RUNNING │
└──────────────┘ └──────────────┘ └──────┬───────┘
▲ │
│ ┌──────────────┼──────────────┐
│ retry │ │ │
│ (if retries < max) ▼ ▼ ▼
│ ┌──────────┐ ┌──────────┐ ┌──────────┐
└──────────────────────────────────│ FAILED │ │ COMPLETED│ │ STALE │
└──────────┘ └──────────┘ └────┬─────┘
recover_stale_tasks()
┌──────────┐
│ PENDING │
└──────────┘
```
---
## Database Tables
### task_schedules (NEW - migration 079)
Stores schedule definitions. Survives restarts.
```sql
CREATE TABLE task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- product_refresh, store_discovery, etc.
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL, -- How often to run
priority INTEGER DEFAULT 0, -- Task priority when created
state_code VARCHAR(2), -- Optional filter
last_run_at TIMESTAMPTZ, -- When it last ran
next_run_at TIMESTAMPTZ, -- When it's due next
last_task_count INTEGER, -- Tasks created last run
last_error TEXT -- Error message if failed
);
```
### worker_tasks (migration 074)
The task queue. Workers pull from here.
```sql
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3
);
```
---
## Default Schedules
| Name | Role | Interval | Priority | Description |
|------|------|----------|----------|-------------|
| `payload_fetch_all` | payload_fetch | 4 hours | 0 | Fetch payloads from Dutchie API (chains to product_refresh) |
| `store_discovery_dutchie` | store_discovery | 24 hours | 5 | Find new Dutchie stores |
| `analytics_refresh` | analytics_refresh | 6 hours | 0 | Refresh MVs |
---
## Task Roles
| Role | Description | Creates Tasks For |
|------|-------------|-------------------|
| `payload_fetch` | **NEW** - Fetch from Dutchie API, save to disk | Each dispensary needing crawl |
| `product_refresh` | **CHANGED** - Read local payload, normalize, upsert to DB | Chained from payload_fetch |
| `store_discovery` | Find new dispensaries, returns newStoreIds[] | Single task per platform |
| `entry_point_discovery` | **DEPRECATED** - Resolve platform IDs | No longer used |
| `product_discovery` | Initial product fetch for new stores | Chained from store_discovery |
| `analytics_refresh` | Refresh MVs | Single global task |
### Payload/Refresh Separation (2024-12-10)
The crawl workflow is now split into two phases:
```
payload_fetch (scheduled every 4h)
└─► Hit Dutchie GraphQL API
└─► Save raw JSON to /storage/payloads/{year}/{month}/{day}/store_{id}_{ts}.json.gz
└─► Record metadata in raw_crawl_payloads table
└─► Queue product_refresh task with payload_id
product_refresh (chained from payload_fetch)
└─► Load payload from filesystem (NOT from API)
└─► Normalize via DutchieNormalizer
└─► Upsert to store_products
└─► Create snapshots
└─► Track missing products
└─► Download images
```
**Benefits:**
- **Retry-friendly**: If normalize fails, re-run product_refresh without re-crawling
- **Replay-able**: Run product_refresh against any historical payload
- **Faster refreshes**: Local file read vs network call
- **Historical diffs**: Compare payloads to see what changed between crawls
- **Less API pressure**: Only payload_fetch hits Dutchie
---
## Task Chaining
Tasks automatically queue follow-up tasks upon successful completion. This creates two main flows:
### Discovery Flow (New Stores)
When `store_discovery` finds new dispensaries, they automatically get their initial product data:
```
store_discovery
└─► Discovers new locations via Dutchie GraphQL
└─► Auto-promotes valid locations to dispensaries table
└─► Collects newDispensaryIds[] from promotions
└─► Returns { newStoreIds: [...] } in result
chainNextTask() detects newStoreIds
└─► Creates product_discovery task for each new store
product_discovery
└─► Calls handlePayloadFetch() internally
└─► payload_fetch hits Dutchie API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem
└─► Normalizes and upserts to store_products
└─► Creates snapshots, downloads images
```
**Complete Discovery Chain:**
```
store_discovery → product_discovery → payload_fetch → product_refresh
(internal call) (queues next)
```
### Scheduled Flow (Existing Stores)
For existing stores, `payload_fetch_all` schedule runs every 4 hours:
```
TaskScheduler (every 60s)
└─► Checks task_schedules for due schedules
└─► payload_fetch_all is due
└─► Generates payload_fetch task for each dispensary
payload_fetch
└─► Hits Dutchie GraphQL API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem (NOT API)
└─► Normalizes via DutchieNormalizer
└─► Upserts to store_products
└─► Creates snapshots
```
**Complete Scheduled Chain:**
```
payload_fetch → product_refresh
(queues) (reads local)
```
### Chaining Implementation
Task chaining is handled in two places:
1. **Internal chaining (handler calls handler):**
- `product_discovery` calls `handlePayloadFetch()` directly
2. **External chaining (chainNextTask() in task-service.ts):**
- Called after task completion
- `store_discovery` → queues `product_discovery` for each newStoreId
3. **Queue-based chaining (taskService.createTask):**
- `payload_fetch` queues `product_refresh` with `payload: { payload_id }`
---
## Payload API Endpoints
Raw crawl payloads can be accessed via the Payloads API:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/payloads` | GET | List payload metadata (paginated) |
| `GET /api/payloads/:id` | GET | Get payload metadata by ID |
| `GET /api/payloads/:id/data` | GET | Get full payload JSON (decompressed) |
| `GET /api/payloads/store/:dispensaryId` | GET | List payloads for a store |
| `GET /api/payloads/store/:dispensaryId/latest` | GET | Get latest payload for a store |
| `GET /api/payloads/store/:dispensaryId/diff` | GET | Diff two payloads for changes |
### Payload Diff Response
The diff endpoint returns:
```json
{
"success": true,
"from": { "id": 123, "fetchedAt": "...", "productCount": 100 },
"to": { "id": 456, "fetchedAt": "...", "productCount": 105 },
"diff": {
"added": 10,
"removed": 5,
"priceChanges": 8,
"stockChanges": 12
},
"details": {
"added": [...],
"removed": [...],
"priceChanges": [...],
"stockChanges": [...]
}
}
```
---
## API Endpoints
### Schedules (NEW)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/schedules` | GET | List all schedules |
| `PUT /api/schedules/:id` | PUT | Update schedule |
| `POST /api/schedules/:id/trigger` | POST | Run schedule immediately |
### Task Creation (rewired 2024-12-10)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `POST /api/job-queue/enqueue` | POST | Create single task |
| `POST /api/job-queue/enqueue-batch` | POST | Create batch tasks |
| `POST /api/job-queue/enqueue-state` | POST | Create tasks for state |
| `POST /api/tasks` | POST | Direct task creation |
### Task Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/tasks` | GET | List tasks |
| `GET /api/tasks/:id` | GET | Get single task |
| `GET /api/tasks/counts` | GET | Task counts by status |
| `POST /api/tasks/recover-stale` | POST | Recover stale tasks |
---
## Key Files
| File | Purpose |
|------|---------|
| `src/services/task-scheduler.ts` | **NEW** - DB-driven scheduler |
| `src/tasks/task-worker.ts` | Worker that processes tasks |
| `src/tasks/task-service.ts` | Task CRUD operations |
| `src/tasks/handlers/payload-fetch.ts` | **NEW** - Fetches from API, saves to disk |
| `src/tasks/handlers/product-refresh.ts` | **CHANGED** - Reads from disk, processes to DB |
| `src/utils/payload-storage.ts` | **NEW** - Payload save/load utilities |
| `src/routes/tasks.ts` | Task API endpoints |
| `src/routes/job-queue.ts` | Job Queue UI endpoints (rewired) |
| `migrations/079_task_schedules.sql` | Schedule table |
| `migrations/080_raw_crawl_payloads.sql` | Payload metadata table |
| `migrations/081_payload_fetch_columns.sql` | payload, last_fetch_at columns |
| `migrations/074_worker_task_queue.sql` | Task queue table |
---
## Legacy Code (DEPRECATED)
| File | Status | Replacement |
|------|--------|-------------|
| `src/services/scheduler.ts` | DEPRECATED | `task-scheduler.ts` |
| `dispensary_crawl_jobs` table | ORPHANED | `worker_tasks` |
| `job_schedules` table | LEGACY | `task_schedules` |
---
## Dashboard Integration
Both pages remain wired to the dashboard:
| Page | Data Source | Actions |
|------|-------------|---------|
| **Job Queue** | `worker_tasks`, `task_schedules` | Create tasks, view schedules |
| **Task Queue** | `worker_tasks` | View tasks, recover stale |
---
## Multi-Replica Safety
The scheduler uses `SELECT FOR UPDATE SKIP LOCKED` to ensure:
1. **Only one replica** executes a schedule at a time
2. **No duplicate tasks** created
3. **Survives pod restarts** - state in DB, not memory
4. **Self-healing** - recovers stale tasks on startup
```sql
-- This query is atomic across all API server replicas
SELECT * FROM task_schedules
WHERE enabled = true AND next_run_at <= NOW()
FOR UPDATE SKIP LOCKED
```
---
## Worker Scaling (K8s)
Workers run as a StatefulSet in Kubernetes. You can scale from the admin UI or CLI.
### From Admin UI
The Workers page (`/admin/workers`) provides:
- Current replica count display
- Scale up/down buttons
- Target replica input
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/workers/k8s/replicas` | GET | Get current/desired replica counts |
| `POST /api/workers/k8s/scale` | POST | Scale to N replicas (body: `{ replicas: N }`) |
### From CLI
```bash
# View current replicas
kubectl get statefulset scraper-worker -n dispensary-scraper
# Scale to 10 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=10
# Scale down to 3 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=3
```
### Configuration
Environment variables for the API server:
| Variable | Default | Description |
|----------|---------|-------------|
| `K8S_NAMESPACE` | `dispensary-scraper` | Kubernetes namespace |
| `K8S_WORKER_STATEFULSET` | `scraper-worker` | StatefulSet name |
### RBAC Requirements
The API server pod needs these K8s permissions:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: worker-scaler
namespace: dispensary-scraper
rules:
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scraper-worker-scaler
namespace: dispensary-scraper
subjects:
- kind: ServiceAccount
name: default
namespace: dispensary-scraper
roleRef:
kind: Role
name: worker-scaler
apiGroup: rbac.authorization.k8s.io
```

View File

@@ -0,0 +1,542 @@
# Worker Task Architecture
This document describes the unified task-based worker system that replaces the legacy fragmented job systems.
## Overview
The task worker architecture provides a single, unified system for managing all background work in CannaiQ:
- **Store discovery** - Find new dispensaries on platforms
- **Entry point discovery** - Resolve platform IDs from menu URLs
- **Product discovery** - Initial product fetch for new stores
- **Product resync** - Regular price/stock updates for existing stores
- **Analytics refresh** - Refresh materialized views and analytics
## Architecture
### Database Tables
**`worker_tasks`** - Central task queue
```sql
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform (dutchie, etc.)
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB, -- Output from handler
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
```
**Key indexes:**
- `idx_worker_tasks_pending_priority` - For efficient task claiming
- `idx_worker_tasks_active_dispensary` - Prevents concurrent tasks per store (partial unique index)
### Task Roles
| Role | Purpose | Per-Store | Scheduled |
|------|---------|-----------|-----------|
| `store_discovery` | Find new stores on a platform | No | Daily |
| `entry_point_discovery` | Resolve platform IDs | Yes | On-demand |
| `product_discovery` | Initial product fetch | Yes | After entry_point |
| `product_resync` | Price/stock updates | Yes | Every 4 hours |
| `analytics_refresh` | Refresh MVs | No | Daily |
### Task Lifecycle
```
pending → claimed → running → completed
failed
```
1. **pending** - Task is waiting to be picked up
2. **claimed** - Worker has claimed it (atomic via SELECT FOR UPDATE SKIP LOCKED)
3. **running** - Worker is actively processing
4. **completed** - Task finished successfully
5. **failed** - Task encountered an error
6. **stale** - Task lost its worker (recovered automatically)
## Files
### Core Files
| File | Purpose |
|------|---------|
| `src/tasks/task-service.ts` | TaskService - CRUD, claiming, capacity metrics |
| `src/tasks/task-worker.ts` | TaskWorker - Main worker loop |
| `src/tasks/index.ts` | Module exports |
| `src/routes/tasks.ts` | API endpoints |
| `migrations/074_worker_task_queue.sql` | Database schema |
### Task Handlers
| File | Role |
|------|------|
| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
| `src/tasks/handlers/product-resync.ts` | `product_resync` |
| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
## Running Workers
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `WORKER_ROLE` | (required) | Which task role to process |
| `WORKER_ID` | auto-generated | Custom worker identifier |
| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
### Starting a Worker
```bash
# Start a product resync worker
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
# Start with custom ID
WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
# Start multiple workers for different roles
WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts &
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts &
```
### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: task-worker-resync
spec:
replicas: 3
template:
spec:
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["npx", "tsx", "src/tasks/task-worker.ts"]
env:
- name: WORKER_ROLE
value: "product_resync"
```
## API Endpoints
### Task Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks` | GET | List tasks with filters |
| `/api/tasks` | POST | Create a new task |
| `/api/tasks/:id` | GET | Get task by ID |
| `/api/tasks/counts` | GET | Get counts by status |
| `/api/tasks/capacity` | GET | Get capacity metrics |
| `/api/tasks/capacity/:role` | GET | Get role-specific capacity |
| `/api/tasks/recover-stale` | POST | Recover tasks from dead workers |
### Task Generation
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/generate/resync` | POST | Generate daily resync tasks |
| `/api/tasks/generate/discovery` | POST | Create store discovery task |
### Migration (from legacy systems)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/migration/status` | GET | Compare old vs new systems |
| `/api/tasks/migration/disable-old-schedules` | POST | Disable job_schedules |
| `/api/tasks/migration/cancel-pending-crawl-jobs` | POST | Cancel old crawl jobs |
| `/api/tasks/migration/create-resync-tasks` | POST | Create tasks for all stores |
| `/api/tasks/migration/full-migrate` | POST | One-click migration |
### Role-Specific Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/role/:role/last-completion` | GET | Last completion time |
| `/api/tasks/role/:role/recent` | GET | Recent completions |
| `/api/tasks/store/:id/active` | GET | Check if store has active task |
## Capacity Planning
The `v_worker_capacity` view provides real-time metrics:
```sql
SELECT * FROM v_worker_capacity;
```
Returns:
- `pending_tasks` - Tasks waiting to be claimed
- `ready_tasks` - Tasks ready now (scheduled_for is null or past)
- `claimed_tasks` - Tasks claimed but not started
- `running_tasks` - Tasks actively processing
- `completed_last_hour` - Recent completions
- `failed_last_hour` - Recent failures
- `active_workers` - Workers with recent heartbeats
- `avg_duration_sec` - Average task duration
- `tasks_per_worker_hour` - Throughput estimate
- `estimated_hours_to_drain` - Time to clear queue
### Scaling Recommendations
```javascript
// API: GET /api/tasks/capacity/:role
{
"role": "product_resync",
"pending_tasks": 500,
"active_workers": 3,
"workers_needed": {
"for_1_hour": 10,
"for_4_hours": 3,
"for_8_hours": 2
}
}
```
## Task Chaining
Tasks can automatically create follow-up tasks:
```
store_discovery → entry_point_discovery → product_discovery
(store has platform_dispensary_id)
Daily resync tasks
```
The `chainNextTask()` method handles this automatically.
## Stale Task Recovery
Tasks are considered stale if `last_heartbeat_at` is older than the threshold (default 10 minutes).
```sql
SELECT recover_stale_tasks(10); -- 10 minute threshold
```
Or via API:
```bash
curl -X POST /api/tasks/recover-stale \
-H 'Content-Type: application/json' \
-d '{"threshold_minutes": 10}'
```
## Migration from Legacy Systems
### Legacy Systems Replaced
1. **job_schedules + job_run_logs** - Scheduled job definitions
2. **dispensary_crawl_jobs** - Per-dispensary crawl queue
3. **SyncOrchestrator + HydrationWorker** - Raw payload processing
### Migration Steps
**Option 1: One-Click Migration**
```bash
curl -X POST /api/tasks/migration/full-migrate
```
This will:
1. Disable all job_schedules
2. Cancel pending dispensary_crawl_jobs
3. Generate resync tasks for all stores
4. Create discovery and analytics tasks
**Option 2: Manual Migration**
```bash
# 1. Check current status
curl /api/tasks/migration/status
# 2. Disable old schedules
curl -X POST /api/tasks/migration/disable-old-schedules
# 3. Cancel pending crawl jobs
curl -X POST /api/tasks/migration/cancel-pending-crawl-jobs
# 4. Create resync tasks
curl -X POST /api/tasks/migration/create-resync-tasks \
-H 'Content-Type: application/json' \
-d '{"state_code": "AZ"}'
# 5. Generate daily resync schedule
curl -X POST /api/tasks/generate/resync \
-H 'Content-Type: application/json' \
-d '{"batches_per_day": 6}'
```
## Per-Store Locking
The system prevents concurrent tasks for the same store using a partial unique index:
```sql
CREATE UNIQUE INDEX idx_worker_tasks_active_dispensary
ON worker_tasks (dispensary_id)
WHERE dispensary_id IS NOT NULL
AND status IN ('claimed', 'running');
```
This ensures only one task can be active per store at any time.
## Task Priority
Tasks are claimed in priority order (higher first), then by creation time:
```sql
ORDER BY priority DESC, created_at ASC
```
Default priorities:
- `store_discovery`: 0
- `entry_point_discovery`: 10 (high - new stores)
- `product_discovery`: 10 (high - new stores)
- `product_resync`: 0
- `analytics_refresh`: 0
## Scheduled Tasks
Tasks can be scheduled for future execution:
```javascript
await taskService.createTask({
role: 'product_resync',
dispensary_id: 123,
scheduled_for: new Date('2025-01-10T06:00:00Z'),
});
```
The `generate_resync_tasks()` function creates staggered tasks throughout the day:
```sql
SELECT generate_resync_tasks(6, '2025-01-10'); -- 6 batches = every 4 hours
```
## Dashboard Integration
The admin dashboard shows task queue status in the main overview:
```
Task Queue Summary
------------------
Pending: 45
Running: 3
Completed: 1,234
Failed: 12
```
Full task management is available at `/admin/tasks`.
## Error Handling
Failed tasks include the error message in `error_message` and can be retried:
```sql
-- View failed tasks
SELECT id, role, dispensary_id, error_message, retry_count
FROM worker_tasks
WHERE status = 'failed'
ORDER BY completed_at DESC
LIMIT 20;
-- Retry failed tasks
UPDATE worker_tasks
SET status = 'pending', retry_count = retry_count + 1
WHERE status = 'failed' AND retry_count < max_retries;
```
## Concurrent Task Processing (Added 2024-12)
Workers can now process multiple tasks concurrently within a single worker instance. This improves throughput by utilizing async I/O efficiently.
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Pod (K8s) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ TaskWorker │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Task 1 │ │ Task 2 │ │ Task 3 │ (concurrent)│ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Resource Monitor │ │
│ │ ├── Memory: 65% (threshold: 85%) │ │
│ │ ├── CPU: 45% (threshold: 90%) │ │
│ │ └── Status: Normal │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MAX_CONCURRENT_TASKS` | 3 | Maximum tasks a worker will run concurrently |
| `MEMORY_BACKOFF_THRESHOLD` | 0.85 | Back off when heap memory exceeds 85% |
| `CPU_BACKOFF_THRESHOLD` | 0.90 | Back off when CPU exceeds 90% |
| `BACKOFF_DURATION_MS` | 10000 | How long to wait when backing off (10s) |
### How It Works
1. **Main Loop**: Worker continuously tries to fill up to `MAX_CONCURRENT_TASKS`
2. **Resource Monitoring**: Before claiming a new task, worker checks memory and CPU
3. **Backoff**: If resources exceed thresholds, worker pauses and stops claiming new tasks
4. **Concurrent Execution**: Tasks run in parallel using `Promise` - they don't block each other
5. **Graceful Shutdown**: On SIGTERM/decommission, worker stops claiming but waits for active tasks
### Resource Monitoring
```typescript
// ResourceStats interface
interface ResourceStats {
memoryPercent: number; // Current heap usage as decimal (0.0-1.0)
memoryMb: number; // Current heap used in MB
memoryTotalMb: number; // Total heap available in MB
cpuPercent: number; // CPU usage as percentage (0-100)
isBackingOff: boolean; // True if worker is in backoff state
backoffReason: string; // Why the worker is backing off
}
```
### Heartbeat Data
Workers report the following in their heartbeat:
```json
{
"worker_id": "worker-abc123",
"current_task_id": 456,
"current_task_ids": [456, 457, 458],
"active_task_count": 3,
"max_concurrent_tasks": 3,
"status": "active",
"resources": {
"memory_mb": 256,
"memory_total_mb": 512,
"memory_rss_mb": 320,
"memory_percent": 50,
"cpu_user_ms": 12500,
"cpu_system_ms": 3200,
"cpu_percent": 45,
"is_backing_off": false,
"backoff_reason": null
}
}
```
### Backoff Behavior
When resources exceed thresholds:
1. Worker logs the backoff reason:
```
[TaskWorker] MyWorker backing off: Memory at 87.3% (threshold: 85%)
```
2. Worker stops claiming new tasks but continues existing tasks
3. After `BACKOFF_DURATION_MS`, worker rechecks resources
4. When resources return to normal:
```
[TaskWorker] MyWorker resuming normal operation
```
### UI Display
The Workers Dashboard shows:
- **Tasks Column**: `2/3 tasks` (active/max concurrent)
- **Resources Column**: Memory % and CPU % with color coding
- Green: < 50%
- Yellow: 50-74%
- Amber: 75-89%
- Red: 90%+
- **Backing Off**: Orange warning badge when worker is in backoff state
### Task Count Badge Details
```
┌─────────────────────────────────────────────┐
│ Worker: "MyWorker" │
│ Tasks: 2/3 tasks #456, #457 │
│ Resources: 🧠 65% 💻 45% │
│ Status: ● Active │
└─────────────────────────────────────────────┘
```
### Best Practices
1. **Start Conservative**: Use `MAX_CONCURRENT_TASKS=3` initially
2. **Monitor Resources**: Watch for frequent backoffs in logs
3. **Tune Per Workload**: I/O-bound tasks benefit from higher concurrency
4. **Scale Horizontally**: Add more pods rather than cranking concurrency too high
### Code References
| File | Purpose |
|------|---------|
| `src/tasks/task-worker.ts:68-71` | Concurrency environment variables |
| `src/tasks/task-worker.ts:104-111` | ResourceStats interface |
| `src/tasks/task-worker.ts:149-179` | getResourceStats() method |
| `src/tasks/task-worker.ts:184-196` | shouldBackOff() method |
| `src/tasks/task-worker.ts:462-516` | mainLoop() with concurrent claiming |
| `src/routes/worker-registry.ts:148-195` | Heartbeat endpoint handling |
| `cannaiq/src/pages/WorkersDashboard.tsx:233-305` | UI components for resources |
## Monitoring
### Logs
Workers log to stdout:
```
[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
[TaskWorker] Task 123 completed successfully
```
### Health Check
Check if workers are active:
```sql
SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
FROM worker_tasks
WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
GROUP BY worker_id, role;
```
### Metrics
```sql
-- Tasks by status
SELECT status, COUNT(*) FROM worker_tasks GROUP BY status;
-- Tasks by role
SELECT role, status, COUNT(*) FROM worker_tasks GROUP BY role, status;
-- Average duration by role
SELECT role, AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds
FROM worker_tasks
WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '24 hours'
GROUP BY role;
```

View File

@@ -0,0 +1,69 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: ip2location-update
namespace: default
spec:
# Run on the 1st of every month at 3am UTC
schedule: "0 3 1 * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: ip2location-updater
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
set -e
echo "Downloading IP2Location LITE DB5..."
# Download to temp
cd /tmp
curl -L -o ip2location.zip "https://www.ip2location.com/download/?token=${IP2LOCATION_TOKEN}&file=DB5LITEBIN"
# Extract
unzip -o ip2location.zip
# Find and copy the BIN file
BIN_FILE=$(ls *.BIN 2>/dev/null | head -1)
if [ -z "$BIN_FILE" ]; then
echo "ERROR: No BIN file found"
exit 1
fi
# Copy to shared volume
cp "$BIN_FILE" /data/IP2LOCATION-LITE-DB5.BIN
echo "Done! Database updated: /data/IP2LOCATION-LITE-DB5.BIN"
env:
- name: IP2LOCATION_TOKEN
valueFrom:
secretKeyRef:
name: dutchie-backend-secret
key: IP2LOCATION_TOKEN
volumeMounts:
- name: ip2location-data
mountPath: /data
restartPolicy: OnFailure
volumes:
- name: ip2location-data
persistentVolumeClaim:
claimName: ip2location-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ip2location-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Mi

View File

@@ -26,6 +26,12 @@ spec:
name: dutchie-backend-config
- secretRef:
name: dutchie-backend-secret
env:
- name: IP2LOCATION_DB_PATH
value: /data/ip2location/IP2LOCATION-LITE-DB5.BIN
volumeMounts:
- name: ip2location-data
mountPath: /data/ip2location
resources:
requests:
memory: "256Mi"
@@ -45,3 +51,7 @@ spec:
port: 3010
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: ip2location-data
persistentVolumeClaim:
claimName: ip2location-pvc

View File

@@ -0,0 +1,77 @@
apiVersion: v1
kind: Service
metadata:
name: scraper-worker
namespace: dispensary-scraper
labels:
app: scraper-worker
spec:
clusterIP: None # Headless service required for StatefulSet
selector:
app: scraper-worker
ports:
- port: 3010
name: http
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: scraper-worker
namespace: dispensary-scraper
spec:
serviceName: scraper-worker
replicas: 8
podManagementPolicy: Parallel # Start all pods at once
updateStrategy:
type: OnDelete # Pods only update when manually deleted - no automatic restarts
selector:
matchLabels:
app: scraper-worker
template:
metadata:
labels:
app: scraper-worker
spec:
terminationGracePeriodSeconds: 60
imagePullSecrets:
- name: regcred
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
imagePullPolicy: Always
command: ["node"]
args: ["dist/tasks/task-worker.js"]
env:
- name: WORKER_MODE
value: "true"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MAX_CONCURRENT_TASKS
value: "50"
- name: API_BASE_URL
value: http://scraper
- name: NODE_OPTIONS
value: --max-old-space-size=1500
envFrom:
- configMapRef:
name: scraper-config
- secretRef:
name: scraper-secrets
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pgrep -f 'task-worker' > /dev/null
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3

View File

@@ -0,0 +1,12 @@
-- Add timezone column to proxies table for geo-consistent fingerprinting
-- This allows matching Accept-Language and other headers to proxy location
ALTER TABLE proxies
ADD COLUMN IF NOT EXISTS timezone VARCHAR(50);
-- Add timezone to failed_proxies as well
ALTER TABLE failed_proxies
ADD COLUMN IF NOT EXISTS timezone VARCHAR(50);
-- Comment explaining usage
COMMENT ON COLUMN proxies.timezone IS 'IANA timezone (e.g., America/Phoenix) for geo-consistent fingerprinting';

View File

@@ -0,0 +1,27 @@
-- Migration: Worker Commands Table
-- Purpose: Store commands for workers (decommission, etc.)
-- Workers poll this table after each task to check for commands
CREATE TABLE IF NOT EXISTS worker_commands (
id SERIAL PRIMARY KEY,
worker_id TEXT NOT NULL,
command TEXT NOT NULL, -- 'decommission', 'pause', 'resume'
reason TEXT,
issued_by TEXT,
issued_at TIMESTAMPTZ DEFAULT NOW(),
acknowledged_at TIMESTAMPTZ,
executed_at TIMESTAMPTZ,
status TEXT DEFAULT 'pending' -- 'pending', 'acknowledged', 'executed', 'cancelled'
);
-- Index for worker lookups
CREATE INDEX IF NOT EXISTS idx_worker_commands_worker_id ON worker_commands(worker_id);
CREATE INDEX IF NOT EXISTS idx_worker_commands_pending ON worker_commands(worker_id, status) WHERE status = 'pending';
-- Add decommission_requested column to worker_registry for quick checks
ALTER TABLE worker_registry ADD COLUMN IF NOT EXISTS decommission_requested BOOLEAN DEFAULT FALSE;
ALTER TABLE worker_registry ADD COLUMN IF NOT EXISTS decommission_reason TEXT;
ALTER TABLE worker_registry ADD COLUMN IF NOT EXISTS decommission_requested_at TIMESTAMPTZ;
-- Comment
COMMENT ON TABLE worker_commands IS 'Commands issued to workers (decommission after task, pause, etc.)';

View File

@@ -0,0 +1,322 @@
-- Migration 074: Worker Task Queue System
-- Implements role-based task queue with per-store locking and capacity tracking
-- Task queue table
CREATE TABLE IF NOT EXISTS worker_tasks (
id SERIAL PRIMARY KEY,
-- Task identification
role VARCHAR(50) NOT NULL, -- store_discovery, entry_point_discovery, product_discovery, product_resync, analytics_refresh
dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE CASCADE,
platform VARCHAR(20), -- dutchie, jane, treez, etc.
-- Task state
status VARCHAR(20) NOT NULL DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = more urgent
-- Scheduling
scheduled_for TIMESTAMPTZ, -- For batch scheduling (e.g., every 4 hours)
-- Ownership
worker_id VARCHAR(100), -- Pod name or worker ID
claimed_at TIMESTAMPTZ,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
last_heartbeat_at TIMESTAMPTZ,
-- Results
result JSONB, -- Task output data
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
-- Metadata
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
-- Constraints
CONSTRAINT valid_status CHECK (status IN ('pending', 'claimed', 'running', 'completed', 'failed', 'stale'))
);
-- Indexes for efficient task claiming
CREATE INDEX IF NOT EXISTS idx_worker_tasks_pending
ON worker_tasks(role, priority DESC, created_at ASC)
WHERE status = 'pending';
CREATE INDEX IF NOT EXISTS idx_worker_tasks_claimed
ON worker_tasks(worker_id, claimed_at)
WHERE status = 'claimed';
CREATE INDEX IF NOT EXISTS idx_worker_tasks_running
ON worker_tasks(worker_id, last_heartbeat_at)
WHERE status = 'running';
CREATE INDEX IF NOT EXISTS idx_worker_tasks_dispensary
ON worker_tasks(dispensary_id)
WHERE dispensary_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_worker_tasks_scheduled
ON worker_tasks(scheduled_for)
WHERE status = 'pending' AND scheduled_for IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_worker_tasks_history
ON worker_tasks(role, completed_at DESC)
WHERE status IN ('completed', 'failed');
-- Partial unique index to prevent duplicate active tasks per store
-- Only one task can be claimed/running for a given dispensary at a time
CREATE UNIQUE INDEX IF NOT EXISTS idx_worker_tasks_unique_active_store
ON worker_tasks(dispensary_id)
WHERE status IN ('claimed', 'running') AND dispensary_id IS NOT NULL;
-- Worker registration table (tracks active workers)
CREATE TABLE IF NOT EXISTS worker_registry (
id SERIAL PRIMARY KEY,
worker_id VARCHAR(100) UNIQUE NOT NULL,
role VARCHAR(50) NOT NULL,
pod_name VARCHAR(100),
hostname VARCHAR(100),
started_at TIMESTAMPTZ DEFAULT NOW(),
last_heartbeat_at TIMESTAMPTZ DEFAULT NOW(),
tasks_completed INTEGER DEFAULT 0,
tasks_failed INTEGER DEFAULT 0,
status VARCHAR(20) DEFAULT 'active',
CONSTRAINT valid_worker_status CHECK (status IN ('active', 'idle', 'offline'))
);
CREATE INDEX IF NOT EXISTS idx_worker_registry_role
ON worker_registry(role, status);
CREATE INDEX IF NOT EXISTS idx_worker_registry_heartbeat
ON worker_registry(last_heartbeat_at)
WHERE status = 'active';
-- Task completion tracking (summarized history)
CREATE TABLE IF NOT EXISTS task_completion_log (
id SERIAL PRIMARY KEY,
role VARCHAR(50) NOT NULL,
date DATE NOT NULL DEFAULT CURRENT_DATE,
hour INTEGER NOT NULL DEFAULT EXTRACT(HOUR FROM NOW()),
tasks_created INTEGER DEFAULT 0,
tasks_completed INTEGER DEFAULT 0,
tasks_failed INTEGER DEFAULT 0,
avg_duration_sec NUMERIC(10,2),
min_duration_sec NUMERIC(10,2),
max_duration_sec NUMERIC(10,2),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(role, date, hour)
);
-- Capacity planning view
CREATE OR REPLACE VIEW v_worker_capacity AS
SELECT
role,
COUNT(*) FILTER (WHERE status = 'pending') as pending_tasks,
COUNT(*) FILTER (WHERE status = 'pending' AND (scheduled_for IS NULL OR scheduled_for <= NOW())) as ready_tasks,
COUNT(*) FILTER (WHERE status = 'claimed') as claimed_tasks,
COUNT(*) FILTER (WHERE status = 'running') as running_tasks,
COUNT(*) FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour') as completed_last_hour,
COUNT(*) FILTER (WHERE status = 'failed' AND completed_at > NOW() - INTERVAL '1 hour') as failed_last_hour,
COUNT(DISTINCT worker_id) FILTER (WHERE status IN ('claimed', 'running')) as active_workers,
AVG(EXTRACT(EPOCH FROM (completed_at - started_at)))
FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour') as avg_duration_sec,
-- Capacity planning metrics
CASE
WHEN COUNT(*) FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour') > 0
THEN 3600.0 / NULLIF(AVG(EXTRACT(EPOCH FROM (completed_at - started_at)))
FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour'), 0)
ELSE NULL
END as tasks_per_worker_hour,
-- Estimated time to drain queue
CASE
WHEN COUNT(DISTINCT worker_id) FILTER (WHERE status IN ('claimed', 'running')) > 0
AND COUNT(*) FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour') > 0
THEN COUNT(*) FILTER (WHERE status = 'pending') / NULLIF(
COUNT(DISTINCT worker_id) FILTER (WHERE status IN ('claimed', 'running')) *
(3600.0 / NULLIF(AVG(EXTRACT(EPOCH FROM (completed_at - started_at)))
FILTER (WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '1 hour'), 0)),
0
)
ELSE NULL
END as estimated_hours_to_drain
FROM worker_tasks
GROUP BY role;
-- Task history view (for UI)
CREATE OR REPLACE VIEW v_task_history AS
SELECT
t.id,
t.role,
t.dispensary_id,
d.name as dispensary_name,
t.platform,
t.status,
t.priority,
t.worker_id,
t.scheduled_for,
t.claimed_at,
t.started_at,
t.completed_at,
t.error_message,
t.retry_count,
t.created_at,
EXTRACT(EPOCH FROM (t.completed_at - t.started_at)) as duration_sec
FROM worker_tasks t
LEFT JOIN dispensaries d ON d.id = t.dispensary_id
ORDER BY t.created_at DESC;
-- Function to claim a task atomically
CREATE OR REPLACE FUNCTION claim_task(
p_role VARCHAR(50),
p_worker_id VARCHAR(100)
) RETURNS worker_tasks AS $$
DECLARE
claimed_task worker_tasks;
BEGIN
UPDATE worker_tasks
SET
status = 'claimed',
worker_id = p_worker_id,
claimed_at = NOW(),
updated_at = NOW()
WHERE id = (
SELECT id FROM worker_tasks
WHERE role = p_role
AND status = 'pending'
AND (scheduled_for IS NULL OR scheduled_for <= NOW())
-- Exclude stores that already have an active task
AND (dispensary_id IS NULL OR dispensary_id NOT IN (
SELECT dispensary_id FROM worker_tasks
WHERE status IN ('claimed', 'running')
AND dispensary_id IS NOT NULL
))
ORDER BY priority DESC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING * INTO claimed_task;
RETURN claimed_task;
END;
$$ LANGUAGE plpgsql;
-- Function to mark stale tasks (workers that died)
CREATE OR REPLACE FUNCTION recover_stale_tasks(
stale_threshold_minutes INTEGER DEFAULT 10
) RETURNS INTEGER AS $$
DECLARE
recovered_count INTEGER;
BEGIN
WITH stale AS (
UPDATE worker_tasks
SET
status = 'pending',
worker_id = NULL,
claimed_at = NULL,
started_at = NULL,
retry_count = retry_count + 1,
updated_at = NOW()
WHERE status IN ('claimed', 'running')
AND last_heartbeat_at < NOW() - (stale_threshold_minutes || ' minutes')::INTERVAL
AND retry_count < max_retries
RETURNING id
)
SELECT COUNT(*) INTO recovered_count FROM stale;
-- Mark tasks that exceeded retries as failed
UPDATE worker_tasks
SET
status = 'failed',
error_message = 'Exceeded max retries after worker failures',
completed_at = NOW(),
updated_at = NOW()
WHERE status IN ('claimed', 'running')
AND last_heartbeat_at < NOW() - (stale_threshold_minutes || ' minutes')::INTERVAL
AND retry_count >= max_retries;
RETURN recovered_count;
END;
$$ LANGUAGE plpgsql;
-- Function to generate daily resync tasks
CREATE OR REPLACE FUNCTION generate_resync_tasks(
p_batches_per_day INTEGER DEFAULT 6, -- Every 4 hours
p_date DATE DEFAULT CURRENT_DATE
) RETURNS INTEGER AS $$
DECLARE
store_count INTEGER;
stores_per_batch INTEGER;
batch_num INTEGER;
scheduled_time TIMESTAMPTZ;
created_count INTEGER := 0;
BEGIN
-- Count active stores that need resync
SELECT COUNT(*) INTO store_count
FROM dispensaries
WHERE crawl_enabled = true
AND menu_type = 'dutchie'
AND platform_dispensary_id IS NOT NULL;
IF store_count = 0 THEN
RETURN 0;
END IF;
stores_per_batch := CEIL(store_count::NUMERIC / p_batches_per_day);
FOR batch_num IN 0..(p_batches_per_day - 1) LOOP
scheduled_time := p_date + (batch_num * 4 || ' hours')::INTERVAL;
INSERT INTO worker_tasks (role, dispensary_id, platform, scheduled_for, priority)
SELECT
'product_resync',
d.id,
'dutchie',
scheduled_time,
0
FROM (
SELECT id, ROW_NUMBER() OVER (ORDER BY id) as rn
FROM dispensaries
WHERE crawl_enabled = true
AND menu_type = 'dutchie'
AND platform_dispensary_id IS NOT NULL
) d
WHERE d.rn > (batch_num * stores_per_batch)
AND d.rn <= ((batch_num + 1) * stores_per_batch)
ON CONFLICT DO NOTHING;
GET DIAGNOSTICS created_count = created_count + ROW_COUNT;
END LOOP;
RETURN created_count;
END;
$$ LANGUAGE plpgsql;
-- Trigger to update timestamp
CREATE OR REPLACE FUNCTION update_worker_tasks_timestamp()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS worker_tasks_updated_at ON worker_tasks;
CREATE TRIGGER worker_tasks_updated_at
BEFORE UPDATE ON worker_tasks
FOR EACH ROW
EXECUTE FUNCTION update_worker_tasks_timestamp();
-- Comments
COMMENT ON TABLE worker_tasks IS 'Central task queue for all worker roles';
COMMENT ON TABLE worker_registry IS 'Registry of active workers and their stats';
COMMENT ON TABLE task_completion_log IS 'Hourly aggregated task completion metrics';
COMMENT ON VIEW v_worker_capacity IS 'Real-time capacity planning metrics per role';
COMMENT ON VIEW v_task_history IS 'Task history with dispensary details for UI';
COMMENT ON FUNCTION claim_task IS 'Atomically claim a task for a worker, respecting per-store locking';
COMMENT ON FUNCTION recover_stale_tasks IS 'Release tasks from dead workers back to pending';
COMMENT ON FUNCTION generate_resync_tasks IS 'Generate daily product resync tasks in batches';

View File

@@ -0,0 +1,13 @@
-- Migration 075: Add consecutive_misses column to store_products
-- Used to track how many consecutive crawls a product has been missing from the feed
-- After 3 consecutive misses, product is marked as OOS
ALTER TABLE store_products
ADD COLUMN IF NOT EXISTS consecutive_misses INTEGER NOT NULL DEFAULT 0;
-- Index for finding products that need OOS check
CREATE INDEX IF NOT EXISTS idx_store_products_consecutive_misses
ON store_products (dispensary_id, consecutive_misses)
WHERE consecutive_misses > 0;
COMMENT ON COLUMN store_products.consecutive_misses IS 'Number of consecutive crawls where product was not in feed. Reset to 0 when seen. At 3, mark OOS.';

View File

@@ -0,0 +1,71 @@
-- Visitor location analytics for Findagram
-- Tracks visitor locations to understand popular areas
CREATE TABLE IF NOT EXISTS visitor_locations (
id SERIAL PRIMARY KEY,
-- Location data (from IP lookup)
ip_hash VARCHAR(64), -- Hashed IP for privacy (SHA256)
city VARCHAR(100),
state VARCHAR(100),
state_code VARCHAR(10),
country VARCHAR(100),
country_code VARCHAR(10),
latitude DECIMAL(10, 7),
longitude DECIMAL(10, 7),
-- Visit metadata
domain VARCHAR(50) NOT NULL, -- 'findagram.co', 'findadispo.com', etc.
page_path VARCHAR(255), -- '/products', '/dispensaries/123', etc.
referrer VARCHAR(500),
user_agent VARCHAR(500),
-- Session tracking
session_id VARCHAR(64), -- For grouping page views in a session
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for analytics queries
CREATE INDEX IF NOT EXISTS idx_visitor_locations_domain ON visitor_locations(domain);
CREATE INDEX IF NOT EXISTS idx_visitor_locations_city_state ON visitor_locations(city, state_code);
CREATE INDEX IF NOT EXISTS idx_visitor_locations_created_at ON visitor_locations(created_at);
CREATE INDEX IF NOT EXISTS idx_visitor_locations_session ON visitor_locations(session_id);
-- Aggregated daily stats (materialized for performance)
CREATE TABLE IF NOT EXISTS visitor_location_stats (
id SERIAL PRIMARY KEY,
date DATE NOT NULL,
domain VARCHAR(50) NOT NULL,
city VARCHAR(100),
state VARCHAR(100),
state_code VARCHAR(10),
country_code VARCHAR(10),
-- Metrics
visit_count INTEGER DEFAULT 0,
unique_sessions INTEGER DEFAULT 0,
UNIQUE(date, domain, city, state_code, country_code)
);
CREATE INDEX IF NOT EXISTS idx_visitor_stats_date ON visitor_location_stats(date);
CREATE INDEX IF NOT EXISTS idx_visitor_stats_domain ON visitor_location_stats(domain);
CREATE INDEX IF NOT EXISTS idx_visitor_stats_state ON visitor_location_stats(state_code);
-- View for easy querying of top locations
CREATE OR REPLACE VIEW v_top_visitor_locations AS
SELECT
domain,
city,
state,
state_code,
country_code,
COUNT(*) as total_visits,
COUNT(DISTINCT session_id) as unique_sessions,
MAX(created_at) as last_visit
FROM visitor_locations
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY domain, city, state, state_code, country_code
ORDER BY total_visits DESC;

View File

@@ -0,0 +1,141 @@
-- Migration 076: Worker Registry for Dynamic Workers
-- Workers register on startup, receive a friendly name, and report heartbeats
-- Name pool for workers (expandable, no hardcoding)
CREATE TABLE IF NOT EXISTS worker_name_pool (
id SERIAL PRIMARY KEY,
name VARCHAR(50) UNIQUE NOT NULL,
in_use BOOLEAN DEFAULT FALSE,
assigned_to VARCHAR(100), -- worker_id
assigned_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Seed with initial names (can add more via API)
INSERT INTO worker_name_pool (name) VALUES
('Alice'), ('Bella'), ('Clara'), ('Diana'), ('Elena'),
('Fiona'), ('Grace'), ('Hazel'), ('Iris'), ('Julia'),
('Katie'), ('Luna'), ('Mia'), ('Nora'), ('Olive'),
('Pearl'), ('Quinn'), ('Rosa'), ('Sara'), ('Tara'),
('Uma'), ('Vera'), ('Wendy'), ('Xena'), ('Yuki'), ('Zara'),
('Amber'), ('Blake'), ('Coral'), ('Dawn'), ('Echo'),
('Fleur'), ('Gem'), ('Haven'), ('Ivy'), ('Jade'),
('Kira'), ('Lotus'), ('Maple'), ('Nova'), ('Onyx'),
('Pixel'), ('Quest'), ('Raven'), ('Sage'), ('Terra'),
('Unity'), ('Violet'), ('Willow'), ('Xylo'), ('Yara'), ('Zen')
ON CONFLICT (name) DO NOTHING;
-- Worker registry - tracks active workers
CREATE TABLE IF NOT EXISTS worker_registry (
id SERIAL PRIMARY KEY,
worker_id VARCHAR(100) UNIQUE NOT NULL, -- e.g., "pod-abc123" or uuid
friendly_name VARCHAR(50), -- assigned from pool
role VARCHAR(50) NOT NULL, -- task role
pod_name VARCHAR(100), -- k8s pod name
hostname VARCHAR(100), -- machine hostname
ip_address VARCHAR(50), -- worker IP
status VARCHAR(20) DEFAULT 'starting', -- starting, active, idle, offline, terminated
started_at TIMESTAMPTZ DEFAULT NOW(),
last_heartbeat_at TIMESTAMPTZ DEFAULT NOW(),
last_task_at TIMESTAMPTZ,
tasks_completed INTEGER DEFAULT 0,
tasks_failed INTEGER DEFAULT 0,
current_task_id INTEGER,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for worker registry
CREATE INDEX IF NOT EXISTS idx_worker_registry_status ON worker_registry(status);
CREATE INDEX IF NOT EXISTS idx_worker_registry_role ON worker_registry(role);
CREATE INDEX IF NOT EXISTS idx_worker_registry_heartbeat ON worker_registry(last_heartbeat_at);
-- Function to assign a name to a new worker
CREATE OR REPLACE FUNCTION assign_worker_name(p_worker_id VARCHAR(100))
RETURNS VARCHAR(50) AS $$
DECLARE
v_name VARCHAR(50);
BEGIN
-- Try to get an unused name
UPDATE worker_name_pool
SET in_use = TRUE, assigned_to = p_worker_id, assigned_at = NOW()
WHERE id = (
SELECT id FROM worker_name_pool
WHERE in_use = FALSE
ORDER BY RANDOM()
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING name INTO v_name;
-- If no names available, generate one
IF v_name IS NULL THEN
v_name := 'Worker-' || SUBSTRING(p_worker_id FROM 1 FOR 8);
END IF;
RETURN v_name;
END;
$$ LANGUAGE plpgsql;
-- Function to release a worker's name back to the pool
CREATE OR REPLACE FUNCTION release_worker_name(p_worker_id VARCHAR(100))
RETURNS VOID AS $$
BEGIN
UPDATE worker_name_pool
SET in_use = FALSE, assigned_to = NULL, assigned_at = NULL
WHERE assigned_to = p_worker_id;
END;
$$ LANGUAGE plpgsql;
-- Function to mark stale workers as offline
CREATE OR REPLACE FUNCTION mark_stale_workers(stale_threshold_minutes INTEGER DEFAULT 5)
RETURNS INTEGER AS $$
DECLARE
v_count INTEGER;
BEGIN
UPDATE worker_registry
SET status = 'offline', updated_at = NOW()
WHERE status IN ('active', 'idle', 'starting')
AND last_heartbeat_at < NOW() - (stale_threshold_minutes || ' minutes')::INTERVAL
RETURNING COUNT(*) INTO v_count;
-- Release names from offline workers
PERFORM release_worker_name(worker_id)
FROM worker_registry
WHERE status = 'offline'
AND last_heartbeat_at < NOW() - INTERVAL '30 minutes';
RETURN COALESCE(v_count, 0);
END;
$$ LANGUAGE plpgsql;
-- View for dashboard
CREATE OR REPLACE VIEW v_active_workers AS
SELECT
wr.id,
wr.worker_id,
wr.friendly_name,
wr.role,
wr.status,
wr.pod_name,
wr.hostname,
wr.started_at,
wr.last_heartbeat_at,
wr.last_task_at,
wr.tasks_completed,
wr.tasks_failed,
wr.current_task_id,
EXTRACT(EPOCH FROM (NOW() - wr.last_heartbeat_at)) as seconds_since_heartbeat,
CASE
WHEN wr.status = 'offline' THEN 'offline'
WHEN wr.last_heartbeat_at < NOW() - INTERVAL '2 minutes' THEN 'stale'
WHEN wr.current_task_id IS NOT NULL THEN 'busy'
ELSE 'ready'
END as health_status
FROM worker_registry wr
WHERE wr.status != 'terminated'
ORDER BY wr.status = 'active' DESC, wr.last_heartbeat_at DESC;
COMMENT ON TABLE worker_registry IS 'Tracks all workers that have registered with the system';
COMMENT ON TABLE worker_name_pool IS 'Pool of friendly names for workers - expandable via API';

View File

@@ -0,0 +1,35 @@
-- Migration: Add visitor location and dispensary name to click events
-- Captures where visitors are clicking from and which dispensary
-- Add visitor location columns
ALTER TABLE product_click_events
ADD COLUMN IF NOT EXISTS visitor_city VARCHAR(100);
ALTER TABLE product_click_events
ADD COLUMN IF NOT EXISTS visitor_state VARCHAR(10);
ALTER TABLE product_click_events
ADD COLUMN IF NOT EXISTS visitor_lat DECIMAL(10, 7);
ALTER TABLE product_click_events
ADD COLUMN IF NOT EXISTS visitor_lng DECIMAL(10, 7);
-- Add dispensary name for easier reporting
ALTER TABLE product_click_events
ADD COLUMN IF NOT EXISTS dispensary_name VARCHAR(255);
-- Create index for location-based analytics
CREATE INDEX IF NOT EXISTS idx_product_click_events_visitor_state
ON product_click_events(visitor_state)
WHERE visitor_state IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_product_click_events_visitor_city
ON product_click_events(visitor_city)
WHERE visitor_city IS NOT NULL;
-- Add comments
COMMENT ON COLUMN product_click_events.visitor_city IS 'City where the visitor is located (from IP geolocation)';
COMMENT ON COLUMN product_click_events.visitor_state IS 'State where the visitor is located (from IP geolocation)';
COMMENT ON COLUMN product_click_events.visitor_lat IS 'Visitor latitude (from IP geolocation)';
COMMENT ON COLUMN product_click_events.visitor_lng IS 'Visitor longitude (from IP geolocation)';
COMMENT ON COLUMN product_click_events.dispensary_name IS 'Name of the dispensary (denormalized for easier reporting)';

View File

@@ -0,0 +1,8 @@
-- Migration 078: Add consecutive_403_count to proxies table
-- Per workflow-12102025.md: Track consecutive 403s per proxy
-- After 3 consecutive 403s with different fingerprints → disable proxy
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
-- Add comment explaining the column
COMMENT ON COLUMN proxies.consecutive_403_count IS 'Tracks consecutive 403 blocks. Reset to 0 on success. Proxy disabled at 3.';

View File

@@ -0,0 +1,49 @@
-- Migration 079: Task Schedules for Database-Driven Scheduler
-- Per TASK_WORKFLOW_2024-12-10.md: Replaces node-cron with DB-driven scheduling
--
-- 2024-12-10: Created for reliable, multi-replica-safe task scheduling
-- task_schedules: Stores schedule definitions and state
CREATE TABLE IF NOT EXISTS task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- TaskRole: product_refresh, store_discovery, etc.
description TEXT,
-- Schedule configuration
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL DEFAULT 4,
priority INTEGER DEFAULT 0,
-- Optional scope filters
state_code VARCHAR(2), -- NULL = all states
platform VARCHAR(50), -- NULL = all platforms
-- Execution state (updated by scheduler)
last_run_at TIMESTAMPTZ,
next_run_at TIMESTAMPTZ,
last_task_count INTEGER DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for scheduler queries
CREATE INDEX IF NOT EXISTS idx_task_schedules_enabled ON task_schedules(enabled) WHERE enabled = TRUE;
CREATE INDEX IF NOT EXISTS idx_task_schedules_next_run ON task_schedules(next_run_at) WHERE enabled = TRUE;
-- Insert default schedules
INSERT INTO task_schedules (name, role, interval_hours, priority, description, next_run_at)
VALUES
('product_refresh_all', 'product_refresh', 4, 0, 'Generate product refresh tasks for all crawl-enabled stores every 4 hours', NOW()),
('store_discovery_dutchie', 'store_discovery', 24, 5, 'Discover new Dutchie stores daily', NOW()),
('analytics_refresh', 'analytics_refresh', 6, 0, 'Refresh analytics materialized views every 6 hours', NOW())
ON CONFLICT (name) DO NOTHING;
-- Comment for documentation
COMMENT ON TABLE task_schedules IS 'Database-driven task scheduler configuration. Per TASK_WORKFLOW_2024-12-10.md:
- Schedules persist in DB (survive restarts)
- Uses SELECT FOR UPDATE SKIP LOCKED for multi-replica safety
- Scheduler polls every 60s and executes due schedules
- Creates tasks in worker_tasks for task-worker.ts to process';

View File

@@ -0,0 +1,58 @@
-- Migration 080: Raw Crawl Payloads Metadata Table
-- Per TASK_WORKFLOW_2024-12-10.md: Store full GraphQL payloads for historical analysis
--
-- Design Pattern: Metadata/Payload Separation
-- - Metadata (this table): Small, indexed, queryable
-- - Payload (filesystem): Gzipped JSON at storage_path
--
-- Benefits:
-- - Compare any two crawls to see what changed
-- - Replay/re-normalize historical data if logic changes
-- - Debug issues by seeing exactly what the API returned
-- - DB stays small, backups stay fast
--
-- Storage location: /storage/payloads/{year}/{month}/{day}/store_{id}_{timestamp}.json.gz
-- Compression: ~90% reduction (1.5MB -> 150KB per crawl)
CREATE TABLE IF NOT EXISTS raw_crawl_payloads (
id SERIAL PRIMARY KEY,
-- Links to crawl tracking
crawl_run_id INTEGER REFERENCES crawl_runs(id) ON DELETE SET NULL,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE,
-- File location (gzipped JSON)
storage_path TEXT NOT NULL,
-- Metadata for quick queries without loading file
product_count INTEGER NOT NULL DEFAULT 0,
size_bytes INTEGER, -- Compressed size
size_bytes_raw INTEGER, -- Uncompressed size
-- Timestamps
fetched_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Optional: checksum for integrity verification
checksum_sha256 VARCHAR(64)
);
-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_dispensary
ON raw_crawl_payloads(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_dispensary_fetched
ON raw_crawl_payloads(dispensary_id, fetched_at DESC);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_fetched
ON raw_crawl_payloads(fetched_at DESC);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_crawl_run
ON raw_crawl_payloads(crawl_run_id)
WHERE crawl_run_id IS NOT NULL;
-- Comments
COMMENT ON TABLE raw_crawl_payloads IS 'Metadata for raw GraphQL payloads stored on filesystem. Per TASK_WORKFLOW_2024-12-10.md: Full payloads enable historical diffs and replay.';
COMMENT ON COLUMN raw_crawl_payloads.storage_path IS 'Path to gzipped JSON file, e.g. /storage/payloads/2024/12/10/store_123_1702234567.json.gz';
COMMENT ON COLUMN raw_crawl_payloads.size_bytes IS 'Compressed file size in bytes';
COMMENT ON COLUMN raw_crawl_payloads.size_bytes_raw IS 'Uncompressed payload size in bytes';

View File

@@ -0,0 +1,37 @@
-- Migration 081: Payload Fetch Columns
-- Per TASK_WORKFLOW_2024-12-10.md: Separates API fetch from data processing
--
-- New architecture:
-- - payload_fetch: Hits Dutchie API, saves raw payload to disk
-- - product_refresh: Reads local payload, normalizes, upserts to DB
--
-- This migration adds:
-- 1. payload column to worker_tasks (for task chaining data)
-- 2. processed_at column to raw_crawl_payloads (track when payload was processed)
-- 3. last_fetch_at column to dispensaries (track when last payload was fetched)
-- Add payload column to worker_tasks for task chaining
-- Used by payload_fetch to pass payload_id to product_refresh
ALTER TABLE worker_tasks
ADD COLUMN IF NOT EXISTS payload JSONB DEFAULT NULL;
COMMENT ON COLUMN worker_tasks.payload IS 'Per TASK_WORKFLOW_2024-12-10.md: Task chaining data (e.g., payload_id from payload_fetch to product_refresh)';
-- Add processed_at to raw_crawl_payloads
-- Tracks when the payload was processed by product_refresh
ALTER TABLE raw_crawl_payloads
ADD COLUMN IF NOT EXISTS processed_at TIMESTAMPTZ DEFAULT NULL;
COMMENT ON COLUMN raw_crawl_payloads.processed_at IS 'When this payload was processed by product_refresh handler';
-- Index for finding unprocessed payloads
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_unprocessed
ON raw_crawl_payloads(dispensary_id, fetched_at DESC)
WHERE processed_at IS NULL;
-- Add last_fetch_at to dispensaries
-- Tracks when the last payload was fetched (separate from last_crawl_at which is when processing completed)
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS last_fetch_at TIMESTAMPTZ DEFAULT NULL;
COMMENT ON COLUMN dispensaries.last_fetch_at IS 'Per TASK_WORKFLOW_2024-12-10.md: When last payload was fetched from API (separate from last_crawl_at which is when processing completed)';

View File

@@ -0,0 +1,27 @@
-- Migration: 082_proxy_notification_trigger
-- Date: 2024-12-11
-- Description: Add PostgreSQL NOTIFY trigger to alert workers when proxies are added
-- Create function to notify workers when active proxy is added/activated
CREATE OR REPLACE FUNCTION notify_proxy_added()
RETURNS TRIGGER AS $$
BEGIN
-- Only notify if proxy is active
IF NEW.active = true THEN
PERFORM pg_notify('proxy_added', NEW.id::text);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Drop existing trigger if any
DROP TRIGGER IF EXISTS proxy_added_trigger ON proxies;
-- Create trigger on insert and update of active column
CREATE TRIGGER proxy_added_trigger
AFTER INSERT OR UPDATE OF active ON proxies
FOR EACH ROW
EXECUTE FUNCTION notify_proxy_added();
COMMENT ON FUNCTION notify_proxy_added() IS
'Sends PostgreSQL NOTIFY to proxy_added channel when an active proxy is added or activated. Workers LISTEN on this channel to wake up immediately.';

View File

@@ -0,0 +1,88 @@
-- Migration 083: Discovery Run Tracking
-- Tracks progress of store discovery runs step-by-step
-- Main discovery runs table
CREATE TABLE IF NOT EXISTS discovery_runs (
id SERIAL PRIMARY KEY,
platform VARCHAR(50) NOT NULL DEFAULT 'dutchie',
status VARCHAR(20) NOT NULL DEFAULT 'running', -- running, completed, failed
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
finished_at TIMESTAMPTZ,
task_id INTEGER REFERENCES worker_task_queue(id),
-- Totals
states_total INTEGER DEFAULT 0,
states_completed INTEGER DEFAULT 0,
locations_discovered INTEGER DEFAULT 0,
locations_promoted INTEGER DEFAULT 0,
new_store_ids INTEGER[] DEFAULT '{}',
-- Error info
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Per-state progress within a run
CREATE TABLE IF NOT EXISTS discovery_run_states (
id SERIAL PRIMARY KEY,
run_id INTEGER NOT NULL REFERENCES discovery_runs(id) ON DELETE CASCADE,
state_code VARCHAR(2) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending', -- pending, running, completed, failed
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
-- Results
cities_found INTEGER DEFAULT 0,
locations_found INTEGER DEFAULT 0,
locations_upserted INTEGER DEFAULT 0,
new_dispensary_ids INTEGER[] DEFAULT '{}',
-- Error info
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE(run_id, state_code)
);
-- Step-by-step log for detailed progress tracking
CREATE TABLE IF NOT EXISTS discovery_run_steps (
id SERIAL PRIMARY KEY,
run_id INTEGER NOT NULL REFERENCES discovery_runs(id) ON DELETE CASCADE,
state_code VARCHAR(2),
step_name VARCHAR(100) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'started', -- started, completed, failed
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
finished_at TIMESTAMPTZ,
-- Details (JSON for flexibility)
details JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Indexes for querying
CREATE INDEX IF NOT EXISTS idx_discovery_runs_status ON discovery_runs(status);
CREATE INDEX IF NOT EXISTS idx_discovery_runs_platform ON discovery_runs(platform);
CREATE INDEX IF NOT EXISTS idx_discovery_runs_started_at ON discovery_runs(started_at DESC);
CREATE INDEX IF NOT EXISTS idx_discovery_run_states_run_id ON discovery_run_states(run_id);
CREATE INDEX IF NOT EXISTS idx_discovery_run_steps_run_id ON discovery_run_steps(run_id);
-- View for latest run status per platform
CREATE OR REPLACE VIEW v_latest_discovery_runs AS
SELECT DISTINCT ON (platform)
id,
platform,
status,
started_at,
finished_at,
states_total,
states_completed,
locations_discovered,
locations_promoted,
array_length(new_store_ids, 1) as new_stores_count,
error_message,
EXTRACT(EPOCH FROM (COALESCE(finished_at, NOW()) - started_at)) as duration_seconds
FROM discovery_runs
ORDER BY platform, started_at DESC;

View File

@@ -0,0 +1,253 @@
-- Migration 084: Dual Transport Preflight System
-- Workers run both curl and http (Puppeteer) preflights on startup
-- Tasks can require a specific transport method
-- ===================================================================
-- PART 1: Add preflight columns to worker_registry
-- ===================================================================
-- Preflight status for curl/axios transport (proxy-based)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_status VARCHAR(20) DEFAULT 'pending';
-- Preflight status for http/Puppeteer transport (browser-based)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_status VARCHAR(20) DEFAULT 'pending';
-- Timestamps for when each preflight completed
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_at TIMESTAMPTZ;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_at TIMESTAMPTZ;
-- Error messages for failed preflights
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_error TEXT;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_error TEXT;
-- Response time for successful preflights (ms)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_ms INTEGER;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_ms INTEGER;
-- Constraints for preflight status values
ALTER TABLE worker_registry
DROP CONSTRAINT IF EXISTS valid_preflight_curl_status;
ALTER TABLE worker_registry
ADD CONSTRAINT valid_preflight_curl_status
CHECK (preflight_curl_status IN ('pending', 'passed', 'failed', 'skipped'));
ALTER TABLE worker_registry
DROP CONSTRAINT IF EXISTS valid_preflight_http_status;
ALTER TABLE worker_registry
ADD CONSTRAINT valid_preflight_http_status
CHECK (preflight_http_status IN ('pending', 'passed', 'failed', 'skipped'));
-- ===================================================================
-- PART 2: Add method column to worker_tasks
-- ===================================================================
-- Transport method requirement for the task
-- NULL = no preference (any worker can claim)
-- 'curl' = requires curl/axios transport (proxy-based, fast)
-- 'http' = requires http/Puppeteer transport (browser-based, anti-detect)
ALTER TABLE worker_tasks
ADD COLUMN IF NOT EXISTS method VARCHAR(10);
-- Constraint for valid method values
ALTER TABLE worker_tasks
DROP CONSTRAINT IF EXISTS valid_task_method;
ALTER TABLE worker_tasks
ADD CONSTRAINT valid_task_method
CHECK (method IS NULL OR method IN ('curl', 'http'));
-- Index for method-based task claiming
CREATE INDEX IF NOT EXISTS idx_worker_tasks_method
ON worker_tasks(method)
WHERE status = 'pending';
-- Set default method for all existing pending tasks to 'http'
-- ALL current tasks require Puppeteer/browser-based transport
UPDATE worker_tasks
SET method = 'http'
WHERE method IS NULL;
-- ===================================================================
-- PART 3: Update claim_task function for method compatibility
-- ===================================================================
CREATE OR REPLACE FUNCTION claim_task(
p_role VARCHAR(50),
p_worker_id VARCHAR(100),
p_curl_passed BOOLEAN DEFAULT TRUE,
p_http_passed BOOLEAN DEFAULT FALSE
) RETURNS worker_tasks AS $$
DECLARE
claimed_task worker_tasks;
BEGIN
UPDATE worker_tasks
SET
status = 'claimed',
worker_id = p_worker_id,
claimed_at = NOW(),
updated_at = NOW()
WHERE id = (
SELECT id FROM worker_tasks
WHERE role = p_role
AND status = 'pending'
AND (scheduled_for IS NULL OR scheduled_for <= NOW())
-- Method compatibility: worker must have passed the required preflight
AND (
method IS NULL -- No preference, any worker can claim
OR (method = 'curl' AND p_curl_passed = TRUE)
OR (method = 'http' AND p_http_passed = TRUE)
)
-- Exclude stores that already have an active task
AND (dispensary_id IS NULL OR dispensary_id NOT IN (
SELECT dispensary_id FROM worker_tasks
WHERE status IN ('claimed', 'running')
AND dispensary_id IS NOT NULL
))
ORDER BY priority DESC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING * INTO claimed_task;
RETURN claimed_task;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- PART 4: Update v_active_workers view
-- ===================================================================
DROP VIEW IF EXISTS v_active_workers;
CREATE VIEW v_active_workers AS
SELECT
wr.id,
wr.worker_id,
wr.friendly_name,
wr.role,
wr.status,
wr.pod_name,
wr.hostname,
wr.started_at,
wr.last_heartbeat_at,
wr.last_task_at,
wr.tasks_completed,
wr.tasks_failed,
wr.current_task_id,
-- Preflight status
wr.preflight_curl_status,
wr.preflight_http_status,
wr.preflight_curl_at,
wr.preflight_http_at,
wr.preflight_curl_error,
wr.preflight_http_error,
wr.preflight_curl_ms,
wr.preflight_http_ms,
-- Computed fields
EXTRACT(EPOCH FROM (NOW() - wr.last_heartbeat_at)) as seconds_since_heartbeat,
CASE
WHEN wr.status = 'offline' THEN 'offline'
WHEN wr.last_heartbeat_at < NOW() - INTERVAL '2 minutes' THEN 'stale'
WHEN wr.current_task_id IS NOT NULL THEN 'busy'
ELSE 'ready'
END as health_status,
-- Capability flags (can this worker handle curl/http tasks?)
(wr.preflight_curl_status = 'passed') as can_curl,
(wr.preflight_http_status = 'passed') as can_http
FROM worker_registry wr
WHERE wr.status != 'terminated'
ORDER BY wr.status = 'active' DESC, wr.last_heartbeat_at DESC;
-- ===================================================================
-- PART 5: View for task queue with method info
-- ===================================================================
DROP VIEW IF EXISTS v_task_history;
CREATE VIEW v_task_history AS
SELECT
t.id,
t.role,
t.dispensary_id,
d.name as dispensary_name,
t.platform,
t.status,
t.priority,
t.method,
t.worker_id,
t.scheduled_for,
t.claimed_at,
t.started_at,
t.completed_at,
t.error_message,
t.retry_count,
t.created_at,
EXTRACT(EPOCH FROM (t.completed_at - t.started_at)) as duration_sec
FROM worker_tasks t
LEFT JOIN dispensaries d ON d.id = t.dispensary_id
ORDER BY t.created_at DESC;
-- ===================================================================
-- PART 6: Helper function to update worker preflight status
-- ===================================================================
CREATE OR REPLACE FUNCTION update_worker_preflight(
p_worker_id VARCHAR(100),
p_transport VARCHAR(10), -- 'curl' or 'http'
p_status VARCHAR(20), -- 'passed', 'failed', 'skipped'
p_response_ms INTEGER DEFAULT NULL,
p_error TEXT DEFAULT NULL
) RETURNS VOID AS $$
BEGIN
IF p_transport = 'curl' THEN
UPDATE worker_registry
SET
preflight_curl_status = p_status,
preflight_curl_at = NOW(),
preflight_curl_ms = p_response_ms,
preflight_curl_error = p_error,
updated_at = NOW()
WHERE worker_id = p_worker_id;
ELSIF p_transport = 'http' THEN
UPDATE worker_registry
SET
preflight_http_status = p_status,
preflight_http_at = NOW(),
preflight_http_ms = p_response_ms,
preflight_http_error = p_error,
updated_at = NOW()
WHERE worker_id = p_worker_id;
END IF;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- Comments
-- ===================================================================
COMMENT ON COLUMN worker_registry.preflight_curl_status IS 'Status of curl/axios preflight: pending, passed, failed, skipped';
COMMENT ON COLUMN worker_registry.preflight_http_status IS 'Status of http/Puppeteer preflight: pending, passed, failed, skipped';
COMMENT ON COLUMN worker_registry.preflight_curl_at IS 'When curl preflight completed';
COMMENT ON COLUMN worker_registry.preflight_http_at IS 'When http preflight completed';
COMMENT ON COLUMN worker_registry.preflight_curl_error IS 'Error message if curl preflight failed';
COMMENT ON COLUMN worker_registry.preflight_http_error IS 'Error message if http preflight failed';
COMMENT ON COLUMN worker_registry.preflight_curl_ms IS 'Response time of successful curl preflight (ms)';
COMMENT ON COLUMN worker_registry.preflight_http_ms IS 'Response time of successful http preflight (ms)';
COMMENT ON COLUMN worker_tasks.method IS 'Transport method required: NULL=any, curl=proxy-based, http=browser-based';
COMMENT ON FUNCTION claim_task IS 'Atomically claim a task, respecting method requirements and per-store locking';
COMMENT ON FUNCTION update_worker_preflight IS 'Update a workers preflight status for a given transport';

View File

@@ -0,0 +1,168 @@
-- Migration 085: Add IP and fingerprint columns for preflight reporting
-- These columns were missing from migration 084
-- ===================================================================
-- PART 1: Add IP address columns to worker_registry
-- ===================================================================
-- IP address detected during curl/axios preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS curl_ip VARCHAR(45);
-- IP address detected during http/Puppeteer preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS http_ip VARCHAR(45);
-- ===================================================================
-- PART 2: Add fingerprint data column
-- ===================================================================
-- Browser fingerprint data captured during Puppeteer preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS fingerprint_data JSONB;
-- ===================================================================
-- PART 3: Add combined preflight status/timestamp for convenience
-- ===================================================================
-- Overall preflight status (computed from both transports)
-- Values: 'pending', 'passed', 'partial', 'failed'
-- - 'pending': neither transport tested
-- - 'passed': both transports passed (or http passed for browser-only)
-- - 'partial': at least one passed
-- - 'failed': no transport passed
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_status VARCHAR(20) DEFAULT 'pending';
-- Most recent preflight completion timestamp
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_at TIMESTAMPTZ;
-- ===================================================================
-- PART 4: Update function to set preflight status
-- ===================================================================
CREATE OR REPLACE FUNCTION update_worker_preflight(
p_worker_id VARCHAR(100),
p_transport VARCHAR(10), -- 'curl' or 'http'
p_status VARCHAR(20), -- 'passed', 'failed', 'skipped'
p_ip VARCHAR(45) DEFAULT NULL,
p_response_ms INTEGER DEFAULT NULL,
p_error TEXT DEFAULT NULL,
p_fingerprint JSONB DEFAULT NULL
) RETURNS VOID AS $$
DECLARE
v_curl_status VARCHAR(20);
v_http_status VARCHAR(20);
v_overall_status VARCHAR(20);
BEGIN
IF p_transport = 'curl' THEN
UPDATE worker_registry
SET
preflight_curl_status = p_status,
preflight_curl_at = NOW(),
preflight_curl_ms = p_response_ms,
preflight_curl_error = p_error,
curl_ip = p_ip,
updated_at = NOW()
WHERE worker_id = p_worker_id;
ELSIF p_transport = 'http' THEN
UPDATE worker_registry
SET
preflight_http_status = p_status,
preflight_http_at = NOW(),
preflight_http_ms = p_response_ms,
preflight_http_error = p_error,
http_ip = p_ip,
fingerprint_data = COALESCE(p_fingerprint, fingerprint_data),
updated_at = NOW()
WHERE worker_id = p_worker_id;
END IF;
-- Update overall preflight status
SELECT preflight_curl_status, preflight_http_status
INTO v_curl_status, v_http_status
FROM worker_registry
WHERE worker_id = p_worker_id;
-- Compute overall status
IF v_curl_status = 'passed' AND v_http_status = 'passed' THEN
v_overall_status := 'passed';
ELSIF v_curl_status = 'passed' OR v_http_status = 'passed' THEN
v_overall_status := 'partial';
ELSIF v_curl_status = 'failed' OR v_http_status = 'failed' THEN
v_overall_status := 'failed';
ELSE
v_overall_status := 'pending';
END IF;
UPDATE worker_registry
SET
preflight_status = v_overall_status,
preflight_at = NOW()
WHERE worker_id = p_worker_id;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- PART 5: Update v_active_workers view
-- ===================================================================
DROP VIEW IF EXISTS v_active_workers;
CREATE VIEW v_active_workers AS
SELECT
wr.id,
wr.worker_id,
wr.friendly_name,
wr.role,
wr.status,
wr.pod_name,
wr.hostname,
wr.started_at,
wr.last_heartbeat_at,
wr.last_task_at,
wr.tasks_completed,
wr.tasks_failed,
wr.current_task_id,
-- IP addresses from preflights
wr.curl_ip,
wr.http_ip,
-- Combined preflight status
wr.preflight_status,
wr.preflight_at,
-- Detailed preflight status per transport
wr.preflight_curl_status,
wr.preflight_http_status,
wr.preflight_curl_at,
wr.preflight_http_at,
wr.preflight_curl_error,
wr.preflight_http_error,
wr.preflight_curl_ms,
wr.preflight_http_ms,
-- Fingerprint data
wr.fingerprint_data,
-- Computed fields
EXTRACT(EPOCH FROM (NOW() - wr.last_heartbeat_at)) as seconds_since_heartbeat,
CASE
WHEN wr.status = 'offline' THEN 'offline'
WHEN wr.last_heartbeat_at < NOW() - INTERVAL '2 minutes' THEN 'stale'
WHEN wr.current_task_id IS NOT NULL THEN 'busy'
ELSE 'ready'
END as health_status,
-- Capability flags (can this worker handle curl/http tasks?)
(wr.preflight_curl_status = 'passed') as can_curl,
(wr.preflight_http_status = 'passed') as can_http
FROM worker_registry wr
WHERE wr.status != 'terminated'
ORDER BY wr.status = 'active' DESC, wr.last_heartbeat_at DESC;
-- ===================================================================
-- Comments
-- ===================================================================
COMMENT ON COLUMN worker_registry.curl_ip IS 'IP address detected during curl/axios preflight';
COMMENT ON COLUMN worker_registry.http_ip IS 'IP address detected during Puppeteer preflight';
COMMENT ON COLUMN worker_registry.fingerprint_data IS 'Browser fingerprint captured during Puppeteer preflight';
COMMENT ON COLUMN worker_registry.preflight_status IS 'Overall preflight status: pending, passed, partial, failed';
COMMENT ON COLUMN worker_registry.preflight_at IS 'Most recent preflight completion timestamp';

View File

@@ -1,6 +1,6 @@
{
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
@@ -46,6 +46,97 @@
"resolved": "https://registry.npmjs.org/@ioredis/commands/-/commands-1.4.0.tgz",
"integrity": "sha512-aFT2yemJJo+TZCmieA7qnYGQooOS7QfNmYrzGtsYd3g9j5iDP8AimYYAesf79ohjbLG12XxC4nG5DyEnC88AsQ=="
},
"node_modules/@jsep-plugin/assignment": {
"version": "1.3.0",
"resolved": "https://registry.npmjs.org/@jsep-plugin/assignment/-/assignment-1.3.0.tgz",
"integrity": "sha512-VVgV+CXrhbMI3aSusQyclHkenWSAm95WaiKrMxRFam3JSUiIaQjoMIw2sEs/OX4XifnqeQUN4DYbJjlA8EfktQ==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@jsep-plugin/regex": {
"version": "1.0.4",
"resolved": "https://registry.npmjs.org/@jsep-plugin/regex/-/regex-1.0.4.tgz",
"integrity": "sha512-q7qL4Mgjs1vByCaTnDFcBnV9HS7GVPJX5vyVoCgZHNSC9rjwIlmbXG5sUuorR5ndfHAIlJ8pVStxvjXHbNvtUg==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@kubernetes/client-node": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/@kubernetes/client-node/-/client-node-1.4.0.tgz",
"integrity": "sha512-Zge3YvF7DJi264dU1b3wb/GmzR99JhUpqTvp+VGHfwZT+g7EOOYNScDJNZwXy9cszyIGPIs0VHr+kk8e95qqrA==",
"dependencies": {
"@types/js-yaml": "^4.0.1",
"@types/node": "^24.0.0",
"@types/node-fetch": "^2.6.13",
"@types/stream-buffers": "^3.0.3",
"form-data": "^4.0.0",
"hpagent": "^1.2.0",
"isomorphic-ws": "^5.0.0",
"js-yaml": "^4.1.0",
"jsonpath-plus": "^10.3.0",
"node-fetch": "^2.7.0",
"openid-client": "^6.1.3",
"rfc4648": "^1.3.0",
"socks-proxy-agent": "^8.0.4",
"stream-buffers": "^3.0.2",
"tar-fs": "^3.0.9",
"ws": "^8.18.2"
}
},
"node_modules/@kubernetes/client-node/node_modules/@types/node": {
"version": "24.10.3",
"resolved": "https://registry.npmjs.org/@types/node/-/node-24.10.3.tgz",
"integrity": "sha512-gqkrWUsS8hcm0r44yn7/xZeV1ERva/nLgrLxFRUGb7aoNMIJfZJ3AC261zDQuOAKC7MiXai1WCpYc48jAHoShQ==",
"dependencies": {
"undici-types": "~7.16.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/tar-fs": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/tar-fs/-/tar-fs-3.1.1.tgz",
"integrity": "sha512-LZA0oaPOc2fVo82Txf3gw+AkEd38szODlptMYejQUhndHMLQ9M059uXR+AfS7DNo0NpINvSqDsvyaCrBVkptWg==",
"dependencies": {
"pump": "^3.0.0",
"tar-stream": "^3.1.5"
},
"optionalDependencies": {
"bare-fs": "^4.0.1",
"bare-path": "^3.0.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/undici-types": {
"version": "7.16.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.16.0.tgz",
"integrity": "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="
},
"node_modules/@kubernetes/client-node/node_modules/ws": {
"version": "8.18.3",
"resolved": "https://registry.npmjs.org/ws/-/ws-8.18.3.tgz",
"integrity": "sha512-PEIGCY5tSlUt50cqyMXfCzX+oOPqN0vuGqWzbcJ2xvnkzkq46oOpz7dQaTDBdfICb4N14+GARUDw2XV2N4tvzg==",
"engines": {
"node": ">=10.0.0"
},
"peerDependencies": {
"bufferutil": "^4.0.1",
"utf-8-validate": ">=5.0.2"
},
"peerDependenciesMeta": {
"bufferutil": {
"optional": true
},
"utf-8-validate": {
"optional": true
}
}
},
"node_modules/@mapbox/node-pre-gyp": {
"version": "1.0.11",
"resolved": "https://registry.npmjs.org/@mapbox/node-pre-gyp/-/node-pre-gyp-1.0.11.tgz",
@@ -251,6 +342,11 @@
"integrity": "sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==",
"dev": true
},
"node_modules/@types/js-yaml": {
"version": "4.0.9",
"resolved": "https://registry.npmjs.org/@types/js-yaml/-/js-yaml-4.0.9.tgz",
"integrity": "sha512-k4MGaQl5TGo/iipqb2UDG2UwjXziSWkh0uysQelTlJpX1qGlpUZYm8PnO4DxG1qBomtJUdYJ6qR6xdIah10JLg=="
},
"node_modules/@types/jsonwebtoken": {
"version": "9.0.10",
"resolved": "https://registry.npmjs.org/@types/jsonwebtoken/-/jsonwebtoken-9.0.10.tgz",
@@ -276,7 +372,6 @@
"version": "20.19.25",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.19.25.tgz",
"integrity": "sha512-ZsJzA5thDQMSQO788d7IocwwQbI8B5OPzmqNvpf3NY/+MHDAS759Wo0gd2WQeXYt5AAAQjzcrTVC6SKCuYgoCQ==",
"devOptional": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -287,6 +382,15 @@
"integrity": "sha512-0ikrnug3/IyneSHqCBeslAhlK2aBfYek1fGo4bP4QnZPmiqSGRK+Oy7ZMisLWkesffJvQ1cqAcBnJC+8+nxIAg==",
"dev": true
},
"node_modules/@types/node-fetch": {
"version": "2.6.13",
"resolved": "https://registry.npmjs.org/@types/node-fetch/-/node-fetch-2.6.13.tgz",
"integrity": "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw==",
"dependencies": {
"@types/node": "*",
"form-data": "^4.0.4"
}
},
"node_modules/@types/pg": {
"version": "8.15.6",
"resolved": "https://registry.npmjs.org/@types/pg/-/pg-8.15.6.tgz",
@@ -340,6 +444,14 @@
"@types/node": "*"
}
},
"node_modules/@types/stream-buffers": {
"version": "3.0.8",
"resolved": "https://registry.npmjs.org/@types/stream-buffers/-/stream-buffers-3.0.8.tgz",
"integrity": "sha512-J+7VaHKNvlNPJPEJXX/fKa9DZtR/xPMwuIbe+yNOwp1YB+ApUOBv2aUpEoBJEi8nJgbgs1x8e73ttg0r1rSUdw==",
"dependencies": {
"@types/node": "*"
}
},
"node_modules/@types/uuid": {
"version": "9.0.8",
"resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.8.tgz",
@@ -520,6 +632,78 @@
}
}
},
"node_modules/bare-fs": {
"version": "4.5.2",
"resolved": "https://registry.npmjs.org/bare-fs/-/bare-fs-4.5.2.tgz",
"integrity": "sha512-veTnRzkb6aPHOvSKIOy60KzURfBdUflr5VReI+NSaPL6xf+XLdONQgZgpYvUuZLVQ8dCqxpBAudaOM1+KpAUxw==",
"optional": true,
"dependencies": {
"bare-events": "^2.5.4",
"bare-path": "^3.0.0",
"bare-stream": "^2.6.4",
"bare-url": "^2.2.2",
"fast-fifo": "^1.3.2"
},
"engines": {
"bare": ">=1.16.0"
},
"peerDependencies": {
"bare-buffer": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
}
}
},
"node_modules/bare-os": {
"version": "3.6.2",
"resolved": "https://registry.npmjs.org/bare-os/-/bare-os-3.6.2.tgz",
"integrity": "sha512-T+V1+1srU2qYNBmJCXZkUY5vQ0B4FSlL3QDROnKQYOqeiQR8UbjNHlPa+TIbM4cuidiN9GaTaOZgSEgsvPbh5A==",
"optional": true,
"engines": {
"bare": ">=1.14.0"
}
},
"node_modules/bare-path": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/bare-path/-/bare-path-3.0.0.tgz",
"integrity": "sha512-tyfW2cQcB5NN8Saijrhqn0Zh7AnFNsnczRcuWODH0eYAXBsJ5gVxAUuNr7tsHSC6IZ77cA0SitzT+s47kot8Mw==",
"optional": true,
"dependencies": {
"bare-os": "^3.0.1"
}
},
"node_modules/bare-stream": {
"version": "2.7.0",
"resolved": "https://registry.npmjs.org/bare-stream/-/bare-stream-2.7.0.tgz",
"integrity": "sha512-oyXQNicV1y8nc2aKffH+BUHFRXmx6VrPzlnaEvMhram0nPBrKcEdcyBg5r08D0i8VxngHFAiVyn1QKXpSG0B8A==",
"optional": true,
"dependencies": {
"streamx": "^2.21.0"
},
"peerDependencies": {
"bare-buffer": "*",
"bare-events": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
},
"bare-events": {
"optional": true
}
}
},
"node_modules/bare-url": {
"version": "2.3.2",
"resolved": "https://registry.npmjs.org/bare-url/-/bare-url-2.3.2.tgz",
"integrity": "sha512-ZMq4gd9ngV5aTMa5p9+UfY0b3skwhHELaDkhEHetMdX0LRkW9kzaym4oo/Eh+Ghm0CCDuMTsRIGM/ytUc1ZYmw==",
"optional": true,
"dependencies": {
"bare-path": "^3.0.0"
}
},
"node_modules/base64-js": {
"version": "1.5.1",
"resolved": "https://registry.npmjs.org/base64-js/-/base64-js-1.5.1.tgz",
@@ -1026,6 +1210,17 @@
"url": "https://github.com/sponsors/fb55"
}
},
"node_modules/csv-parser": {
"version": "3.2.0",
"resolved": "https://registry.npmjs.org/csv-parser/-/csv-parser-3.2.0.tgz",
"integrity": "sha512-fgKbp+AJbn1h2dcAHKIdKNSSjfp43BZZykXsCjzALjKy80VXQNHPFJ6T9Afwdzoj24aMkq8GwDS7KGcDPpejrA==",
"bin": {
"csv-parser": "bin/csv-parser"
},
"engines": {
"node": ">= 10"
}
},
"node_modules/data-uri-to-buffer": {
"version": "6.0.2",
"resolved": "https://registry.npmjs.org/data-uri-to-buffer/-/data-uri-to-buffer-6.0.2.tgz",
@@ -2008,6 +2203,14 @@
"node": ">=16.0.0"
}
},
"node_modules/hpagent": {
"version": "1.2.0",
"resolved": "https://registry.npmjs.org/hpagent/-/hpagent-1.2.0.tgz",
"integrity": "sha512-A91dYTeIB6NoXG+PxTQpCCDDnfHsW9kc06Lvpu1TEe9gnd6ZFeiBoRO9JvzEv6xK7EX97/dUE8g/vBMTqTS3CA==",
"engines": {
"node": ">=14"
}
},
"node_modules/htmlparser2": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz",
@@ -2235,6 +2438,14 @@
"node": ">= 12"
}
},
"node_modules/ip2location-nodejs": {
"version": "9.7.0",
"resolved": "https://registry.npmjs.org/ip2location-nodejs/-/ip2location-nodejs-9.7.0.tgz",
"integrity": "sha512-eQ4T5TXm1cx0+pQcRycPiuaiRuoDEMd9O89Be7Ugk555qi9UY9enXSznkkqr3kQRyUaXx7zj5dORC5LGTPOttA==",
"dependencies": {
"csv-parser": "^3.0.0"
}
},
"node_modules/ipaddr.js": {
"version": "2.2.0",
"resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-2.2.0.tgz",
@@ -2363,6 +2574,22 @@
"node": ">=0.10.0"
}
},
"node_modules/isomorphic-ws": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/isomorphic-ws/-/isomorphic-ws-5.0.0.tgz",
"integrity": "sha512-muId7Zzn9ywDsyXgTIafTry2sV3nySZeUDe6YedVd1Hvuuep5AsIlqK+XefWpYTyJG5e503F2xIuT2lcU6rCSw==",
"peerDependencies": {
"ws": "*"
}
},
"node_modules/jose": {
"version": "6.1.3",
"resolved": "https://registry.npmjs.org/jose/-/jose-6.1.3.tgz",
"integrity": "sha512-0TpaTfihd4QMNwrz/ob2Bp7X04yuxJkjRGi4aKmOqwhov54i6u79oCv7T+C7lo70MKH6BesI3vscD1yb/yzKXQ==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/js-tokens": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/js-tokens/-/js-tokens-4.0.0.tgz",
@@ -2379,6 +2606,14 @@
"js-yaml": "bin/js-yaml.js"
}
},
"node_modules/jsep": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/jsep/-/jsep-1.4.0.tgz",
"integrity": "sha512-B7qPcEVE3NVkmSJbaYxvv4cHkVW7DQsZz13pUMrfS8z8Q/BuShN+gcTXrUlPiGqM2/t/EEaI030bpxMqY8gMlw==",
"engines": {
"node": ">= 10.16.0"
}
},
"node_modules/json-parse-even-better-errors": {
"version": "2.3.1",
"resolved": "https://registry.npmjs.org/json-parse-even-better-errors/-/json-parse-even-better-errors-2.3.1.tgz",
@@ -2400,6 +2635,23 @@
"graceful-fs": "^4.1.6"
}
},
"node_modules/jsonpath-plus": {
"version": "10.3.0",
"resolved": "https://registry.npmjs.org/jsonpath-plus/-/jsonpath-plus-10.3.0.tgz",
"integrity": "sha512-8TNmfeTCk2Le33A3vRRwtuworG/L5RrgMvdjhKZxvyShO+mBu2fP50OWUjRLNtvw344DdDarFh9buFAZs5ujeA==",
"dependencies": {
"@jsep-plugin/assignment": "^1.3.0",
"@jsep-plugin/regex": "^1.0.4",
"jsep": "^1.4.0"
},
"bin": {
"jsonpath": "bin/jsonpath-cli.js",
"jsonpath-plus": "bin/jsonpath-cli.js"
},
"engines": {
"node": ">=18.0.0"
}
},
"node_modules/jsonwebtoken": {
"version": "9.0.2",
"resolved": "https://registry.npmjs.org/jsonwebtoken/-/jsonwebtoken-9.0.2.tgz",
@@ -2474,6 +2726,11 @@
"resolved": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
"integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg=="
},
"node_modules/lodash.clonedeep": {
"version": "4.5.0",
"resolved": "https://registry.npmjs.org/lodash.clonedeep/-/lodash.clonedeep-4.5.0.tgz",
"integrity": "sha512-H5ZhCF25riFd9uB5UCkVKo61m3S/xZk1x4wA6yp/L3RFP6Z/eHH1ymQcGLo7J3GMPfm0V/7m1tryHuGVxpqEBQ=="
},
"node_modules/lodash.defaults": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
@@ -2923,6 +3180,14 @@
"url": "https://github.com/fb55/nth-check?sponsor=1"
}
},
"node_modules/oauth4webapi": {
"version": "3.8.3",
"resolved": "https://registry.npmjs.org/oauth4webapi/-/oauth4webapi-3.8.3.tgz",
"integrity": "sha512-pQ5BsX3QRTgnt5HxgHwgunIRaDXBdkT23tf8dfzmtTIL2LTpdmxgbpbBm0VgFWAIDlezQvQCTgnVIUmHupXHxw==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/object-assign": {
"version": "4.1.1",
"resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
@@ -2961,6 +3226,18 @@
"wrappy": "1"
}
},
"node_modules/openid-client": {
"version": "6.8.1",
"resolved": "https://registry.npmjs.org/openid-client/-/openid-client-6.8.1.tgz",
"integrity": "sha512-VoYT6enBo6Vj2j3Q5Ec0AezS+9YGzQo1f5Xc42lreMGlfP4ljiXPKVDvCADh+XHCV/bqPu/wWSiCVXbJKvrODw==",
"dependencies": {
"jose": "^6.1.0",
"oauth4webapi": "^3.8.2"
},
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/pac-proxy-agent": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/pac-proxy-agent/-/pac-proxy-agent-7.2.0.tgz",
@@ -3864,6 +4141,11 @@
"url": "https://github.com/privatenumber/resolve-pkg-maps?sponsor=1"
}
},
"node_modules/rfc4648": {
"version": "1.5.4",
"resolved": "https://registry.npmjs.org/rfc4648/-/rfc4648-1.5.4.tgz",
"integrity": "sha512-rRg/6Lb+IGfJqO05HZkN50UtY7K/JhxJag1kP23+zyMfrvoB0B7RWv06MbOzoc79RgCdNTiUaNsTT1AJZ7Z+cg=="
},
"node_modules/rimraf": {
"version": "3.0.2",
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-3.0.2.tgz",
@@ -4294,6 +4576,14 @@
"node": ">= 0.8"
}
},
"node_modules/stream-buffers": {
"version": "3.0.3",
"resolved": "https://registry.npmjs.org/stream-buffers/-/stream-buffers-3.0.3.tgz",
"integrity": "sha512-pqMqwQCso0PBJt2PQmDO0cFj0lyqmiwOMiMSkVtRokl7e+ZTRYgDHKnuZNbqjiJXgsg4nuqtD/zxuo9KqTp0Yw==",
"engines": {
"node": ">= 0.10.0"
}
},
"node_modules/streamx": {
"version": "2.23.0",
"resolved": "https://registry.npmjs.org/streamx/-/streamx-2.23.0.tgz",
@@ -4513,8 +4803,7 @@
"node_modules/undici-types": {
"version": "6.21.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
"devOptional": true
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ=="
},
"node_modules/universalify": {
"version": "2.0.1",
@@ -4537,6 +4826,14 @@
"resolved": "https://registry.npmjs.org/urlpattern-polyfill/-/urlpattern-polyfill-10.0.0.tgz",
"integrity": "sha512-H/A06tKD7sS1O1X2SshBVeA5FLycRpjqiBeqGKmBwBDBy28EnRjORxTNe269KSSr5un5qyWi1iL61wLxpd+ZOg=="
},
"node_modules/user-agents": {
"version": "1.1.669",
"resolved": "https://registry.npmjs.org/user-agents/-/user-agents-1.1.669.tgz",
"integrity": "sha512-pbIzG+AOqCaIpySKJ4IAm1l0VyE4jMnK4y1thV8lm8PYxI+7X5uWcppOK7zY79TCKKTAnJH3/4gaVIZHsjrmJA==",
"dependencies": {
"lodash.clonedeep": "^4.5.0"
}
},
"node_modules/util": {
"version": "0.12.5",
"resolved": "https://registry.npmjs.org/util/-/util-0.12.5.tgz",

View File

@@ -1,13 +1,14 @@
{
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"dependencies": {
"@kubernetes/client-node": "^1.4.0",
"@types/bcryptjs": "^3.0.0",
"axios": "^1.6.2",
"bcrypt": "^5.1.1",
@@ -21,6 +22,7 @@
"helmet": "^7.1.0",
"https-proxy-agent": "^7.0.2",
"ioredis": "^5.8.2",
"ip2location-nodejs": "^9.7.0",
"ipaddr.js": "^2.2.0",
"jsonwebtoken": "^9.0.2",
"minio": "^7.1.3",
@@ -33,6 +35,7 @@
"puppeteer-extra-plugin-stealth": "^2.11.2",
"sharp": "^0.32.0",
"socks-proxy-agent": "^8.0.2",
"user-agents": "^1.1.669",
"uuid": "^9.0.1",
"zod": "^3.22.4"
},
@@ -491,6 +494,97 @@
"resolved": "https://registry.npmjs.org/@ioredis/commands/-/commands-1.4.0.tgz",
"integrity": "sha512-aFT2yemJJo+TZCmieA7qnYGQooOS7QfNmYrzGtsYd3g9j5iDP8AimYYAesf79ohjbLG12XxC4nG5DyEnC88AsQ=="
},
"node_modules/@jsep-plugin/assignment": {
"version": "1.3.0",
"resolved": "https://registry.npmjs.org/@jsep-plugin/assignment/-/assignment-1.3.0.tgz",
"integrity": "sha512-VVgV+CXrhbMI3aSusQyclHkenWSAm95WaiKrMxRFam3JSUiIaQjoMIw2sEs/OX4XifnqeQUN4DYbJjlA8EfktQ==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@jsep-plugin/regex": {
"version": "1.0.4",
"resolved": "https://registry.npmjs.org/@jsep-plugin/regex/-/regex-1.0.4.tgz",
"integrity": "sha512-q7qL4Mgjs1vByCaTnDFcBnV9HS7GVPJX5vyVoCgZHNSC9rjwIlmbXG5sUuorR5ndfHAIlJ8pVStxvjXHbNvtUg==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@kubernetes/client-node": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/@kubernetes/client-node/-/client-node-1.4.0.tgz",
"integrity": "sha512-Zge3YvF7DJi264dU1b3wb/GmzR99JhUpqTvp+VGHfwZT+g7EOOYNScDJNZwXy9cszyIGPIs0VHr+kk8e95qqrA==",
"dependencies": {
"@types/js-yaml": "^4.0.1",
"@types/node": "^24.0.0",
"@types/node-fetch": "^2.6.13",
"@types/stream-buffers": "^3.0.3",
"form-data": "^4.0.0",
"hpagent": "^1.2.0",
"isomorphic-ws": "^5.0.0",
"js-yaml": "^4.1.0",
"jsonpath-plus": "^10.3.0",
"node-fetch": "^2.7.0",
"openid-client": "^6.1.3",
"rfc4648": "^1.3.0",
"socks-proxy-agent": "^8.0.4",
"stream-buffers": "^3.0.2",
"tar-fs": "^3.0.9",
"ws": "^8.18.2"
}
},
"node_modules/@kubernetes/client-node/node_modules/@types/node": {
"version": "24.10.3",
"resolved": "https://registry.npmjs.org/@types/node/-/node-24.10.3.tgz",
"integrity": "sha512-gqkrWUsS8hcm0r44yn7/xZeV1ERva/nLgrLxFRUGb7aoNMIJfZJ3AC261zDQuOAKC7MiXai1WCpYc48jAHoShQ==",
"dependencies": {
"undici-types": "~7.16.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/tar-fs": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/tar-fs/-/tar-fs-3.1.1.tgz",
"integrity": "sha512-LZA0oaPOc2fVo82Txf3gw+AkEd38szODlptMYejQUhndHMLQ9M059uXR+AfS7DNo0NpINvSqDsvyaCrBVkptWg==",
"dependencies": {
"pump": "^3.0.0",
"tar-stream": "^3.1.5"
},
"optionalDependencies": {
"bare-fs": "^4.0.1",
"bare-path": "^3.0.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/undici-types": {
"version": "7.16.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.16.0.tgz",
"integrity": "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="
},
"node_modules/@kubernetes/client-node/node_modules/ws": {
"version": "8.18.3",
"resolved": "https://registry.npmjs.org/ws/-/ws-8.18.3.tgz",
"integrity": "sha512-PEIGCY5tSlUt50cqyMXfCzX+oOPqN0vuGqWzbcJ2xvnkzkq46oOpz7dQaTDBdfICb4N14+GARUDw2XV2N4tvzg==",
"engines": {
"node": ">=10.0.0"
},
"peerDependencies": {
"bufferutil": "^4.0.1",
"utf-8-validate": ">=5.0.2"
},
"peerDependenciesMeta": {
"bufferutil": {
"optional": true
},
"utf-8-validate": {
"optional": true
}
}
},
"node_modules/@mapbox/node-pre-gyp": {
"version": "1.0.11",
"resolved": "https://registry.npmjs.org/@mapbox/node-pre-gyp/-/node-pre-gyp-1.0.11.tgz",
@@ -756,6 +850,11 @@
"integrity": "sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==",
"dev": true
},
"node_modules/@types/js-yaml": {
"version": "4.0.9",
"resolved": "https://registry.npmjs.org/@types/js-yaml/-/js-yaml-4.0.9.tgz",
"integrity": "sha512-k4MGaQl5TGo/iipqb2UDG2UwjXziSWkh0uysQelTlJpX1qGlpUZYm8PnO4DxG1qBomtJUdYJ6qR6xdIah10JLg=="
},
"node_modules/@types/jsonwebtoken": {
"version": "9.0.10",
"resolved": "https://registry.npmjs.org/@types/jsonwebtoken/-/jsonwebtoken-9.0.10.tgz",
@@ -781,7 +880,6 @@
"version": "20.19.25",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.19.25.tgz",
"integrity": "sha512-ZsJzA5thDQMSQO788d7IocwwQbI8B5OPzmqNvpf3NY/+MHDAS759Wo0gd2WQeXYt5AAAQjzcrTVC6SKCuYgoCQ==",
"devOptional": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -792,6 +890,15 @@
"integrity": "sha512-0ikrnug3/IyneSHqCBeslAhlK2aBfYek1fGo4bP4QnZPmiqSGRK+Oy7ZMisLWkesffJvQ1cqAcBnJC+8+nxIAg==",
"dev": true
},
"node_modules/@types/node-fetch": {
"version": "2.6.13",
"resolved": "https://registry.npmjs.org/@types/node-fetch/-/node-fetch-2.6.13.tgz",
"integrity": "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw==",
"dependencies": {
"@types/node": "*",
"form-data": "^4.0.4"
}
},
"node_modules/@types/pg": {
"version": "8.15.6",
"resolved": "https://registry.npmjs.org/@types/pg/-/pg-8.15.6.tgz",
@@ -845,6 +952,14 @@
"@types/node": "*"
}
},
"node_modules/@types/stream-buffers": {
"version": "3.0.8",
"resolved": "https://registry.npmjs.org/@types/stream-buffers/-/stream-buffers-3.0.8.tgz",
"integrity": "sha512-J+7VaHKNvlNPJPEJXX/fKa9DZtR/xPMwuIbe+yNOwp1YB+ApUOBv2aUpEoBJEi8nJgbgs1x8e73ttg0r1rSUdw==",
"dependencies": {
"@types/node": "*"
}
},
"node_modules/@types/uuid": {
"version": "9.0.8",
"resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.8.tgz",
@@ -1025,6 +1140,78 @@
}
}
},
"node_modules/bare-fs": {
"version": "4.5.2",
"resolved": "https://registry.npmjs.org/bare-fs/-/bare-fs-4.5.2.tgz",
"integrity": "sha512-veTnRzkb6aPHOvSKIOy60KzURfBdUflr5VReI+NSaPL6xf+XLdONQgZgpYvUuZLVQ8dCqxpBAudaOM1+KpAUxw==",
"optional": true,
"dependencies": {
"bare-events": "^2.5.4",
"bare-path": "^3.0.0",
"bare-stream": "^2.6.4",
"bare-url": "^2.2.2",
"fast-fifo": "^1.3.2"
},
"engines": {
"bare": ">=1.16.0"
},
"peerDependencies": {
"bare-buffer": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
}
}
},
"node_modules/bare-os": {
"version": "3.6.2",
"resolved": "https://registry.npmjs.org/bare-os/-/bare-os-3.6.2.tgz",
"integrity": "sha512-T+V1+1srU2qYNBmJCXZkUY5vQ0B4FSlL3QDROnKQYOqeiQR8UbjNHlPa+TIbM4cuidiN9GaTaOZgSEgsvPbh5A==",
"optional": true,
"engines": {
"bare": ">=1.14.0"
}
},
"node_modules/bare-path": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/bare-path/-/bare-path-3.0.0.tgz",
"integrity": "sha512-tyfW2cQcB5NN8Saijrhqn0Zh7AnFNsnczRcuWODH0eYAXBsJ5gVxAUuNr7tsHSC6IZ77cA0SitzT+s47kot8Mw==",
"optional": true,
"dependencies": {
"bare-os": "^3.0.1"
}
},
"node_modules/bare-stream": {
"version": "2.7.0",
"resolved": "https://registry.npmjs.org/bare-stream/-/bare-stream-2.7.0.tgz",
"integrity": "sha512-oyXQNicV1y8nc2aKffH+BUHFRXmx6VrPzlnaEvMhram0nPBrKcEdcyBg5r08D0i8VxngHFAiVyn1QKXpSG0B8A==",
"optional": true,
"dependencies": {
"streamx": "^2.21.0"
},
"peerDependencies": {
"bare-buffer": "*",
"bare-events": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
},
"bare-events": {
"optional": true
}
}
},
"node_modules/bare-url": {
"version": "2.3.2",
"resolved": "https://registry.npmjs.org/bare-url/-/bare-url-2.3.2.tgz",
"integrity": "sha512-ZMq4gd9ngV5aTMa5p9+UfY0b3skwhHELaDkhEHetMdX0LRkW9kzaym4oo/Eh+Ghm0CCDuMTsRIGM/ytUc1ZYmw==",
"optional": true,
"dependencies": {
"bare-path": "^3.0.0"
}
},
"node_modules/base64-js": {
"version": "1.5.1",
"resolved": "https://registry.npmjs.org/base64-js/-/base64-js-1.5.1.tgz",
@@ -1531,6 +1718,17 @@
"url": "https://github.com/sponsors/fb55"
}
},
"node_modules/csv-parser": {
"version": "3.2.0",
"resolved": "https://registry.npmjs.org/csv-parser/-/csv-parser-3.2.0.tgz",
"integrity": "sha512-fgKbp+AJbn1h2dcAHKIdKNSSjfp43BZZykXsCjzALjKy80VXQNHPFJ6T9Afwdzoj24aMkq8GwDS7KGcDPpejrA==",
"bin": {
"csv-parser": "bin/csv-parser"
},
"engines": {
"node": ">= 10"
}
},
"node_modules/data-uri-to-buffer": {
"version": "6.0.2",
"resolved": "https://registry.npmjs.org/data-uri-to-buffer/-/data-uri-to-buffer-6.0.2.tgz",
@@ -2527,6 +2725,14 @@
"node": ">=16.0.0"
}
},
"node_modules/hpagent": {
"version": "1.2.0",
"resolved": "https://registry.npmjs.org/hpagent/-/hpagent-1.2.0.tgz",
"integrity": "sha512-A91dYTeIB6NoXG+PxTQpCCDDnfHsW9kc06Lvpu1TEe9gnd6ZFeiBoRO9JvzEv6xK7EX97/dUE8g/vBMTqTS3CA==",
"engines": {
"node": ">=14"
}
},
"node_modules/htmlparser2": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz",
@@ -2754,6 +2960,14 @@
"node": ">= 12"
}
},
"node_modules/ip2location-nodejs": {
"version": "9.7.0",
"resolved": "https://registry.npmjs.org/ip2location-nodejs/-/ip2location-nodejs-9.7.0.tgz",
"integrity": "sha512-eQ4T5TXm1cx0+pQcRycPiuaiRuoDEMd9O89Be7Ugk555qi9UY9enXSznkkqr3kQRyUaXx7zj5dORC5LGTPOttA==",
"dependencies": {
"csv-parser": "^3.0.0"
}
},
"node_modules/ipaddr.js": {
"version": "2.2.0",
"resolved": "https://registry.npmjs.org/ipaddr.js/-/ipaddr.js-2.2.0.tgz",
@@ -2882,6 +3096,22 @@
"node": ">=0.10.0"
}
},
"node_modules/isomorphic-ws": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/isomorphic-ws/-/isomorphic-ws-5.0.0.tgz",
"integrity": "sha512-muId7Zzn9ywDsyXgTIafTry2sV3nySZeUDe6YedVd1Hvuuep5AsIlqK+XefWpYTyJG5e503F2xIuT2lcU6rCSw==",
"peerDependencies": {
"ws": "*"
}
},
"node_modules/jose": {
"version": "6.1.3",
"resolved": "https://registry.npmjs.org/jose/-/jose-6.1.3.tgz",
"integrity": "sha512-0TpaTfihd4QMNwrz/ob2Bp7X04yuxJkjRGi4aKmOqwhov54i6u79oCv7T+C7lo70MKH6BesI3vscD1yb/yzKXQ==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/js-tokens": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/js-tokens/-/js-tokens-4.0.0.tgz",
@@ -2898,6 +3128,14 @@
"js-yaml": "bin/js-yaml.js"
}
},
"node_modules/jsep": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/jsep/-/jsep-1.4.0.tgz",
"integrity": "sha512-B7qPcEVE3NVkmSJbaYxvv4cHkVW7DQsZz13pUMrfS8z8Q/BuShN+gcTXrUlPiGqM2/t/EEaI030bpxMqY8gMlw==",
"engines": {
"node": ">= 10.16.0"
}
},
"node_modules/json-parse-even-better-errors": {
"version": "2.3.1",
"resolved": "https://registry.npmjs.org/json-parse-even-better-errors/-/json-parse-even-better-errors-2.3.1.tgz",
@@ -2919,6 +3157,23 @@
"graceful-fs": "^4.1.6"
}
},
"node_modules/jsonpath-plus": {
"version": "10.3.0",
"resolved": "https://registry.npmjs.org/jsonpath-plus/-/jsonpath-plus-10.3.0.tgz",
"integrity": "sha512-8TNmfeTCk2Le33A3vRRwtuworG/L5RrgMvdjhKZxvyShO+mBu2fP50OWUjRLNtvw344DdDarFh9buFAZs5ujeA==",
"dependencies": {
"@jsep-plugin/assignment": "^1.3.0",
"@jsep-plugin/regex": "^1.0.4",
"jsep": "^1.4.0"
},
"bin": {
"jsonpath": "bin/jsonpath-cli.js",
"jsonpath-plus": "bin/jsonpath-cli.js"
},
"engines": {
"node": ">=18.0.0"
}
},
"node_modules/jsonwebtoken": {
"version": "9.0.2",
"resolved": "https://registry.npmjs.org/jsonwebtoken/-/jsonwebtoken-9.0.2.tgz",
@@ -2993,6 +3248,11 @@
"resolved": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
"integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg=="
},
"node_modules/lodash.clonedeep": {
"version": "4.5.0",
"resolved": "https://registry.npmjs.org/lodash.clonedeep/-/lodash.clonedeep-4.5.0.tgz",
"integrity": "sha512-H5ZhCF25riFd9uB5UCkVKo61m3S/xZk1x4wA6yp/L3RFP6Z/eHH1ymQcGLo7J3GMPfm0V/7m1tryHuGVxpqEBQ=="
},
"node_modules/lodash.defaults": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
@@ -3442,6 +3702,14 @@
"url": "https://github.com/fb55/nth-check?sponsor=1"
}
},
"node_modules/oauth4webapi": {
"version": "3.8.3",
"resolved": "https://registry.npmjs.org/oauth4webapi/-/oauth4webapi-3.8.3.tgz",
"integrity": "sha512-pQ5BsX3QRTgnt5HxgHwgunIRaDXBdkT23tf8dfzmtTIL2LTpdmxgbpbBm0VgFWAIDlezQvQCTgnVIUmHupXHxw==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/object-assign": {
"version": "4.1.1",
"resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
@@ -3480,6 +3748,18 @@
"wrappy": "1"
}
},
"node_modules/openid-client": {
"version": "6.8.1",
"resolved": "https://registry.npmjs.org/openid-client/-/openid-client-6.8.1.tgz",
"integrity": "sha512-VoYT6enBo6Vj2j3Q5Ec0AezS+9YGzQo1f5Xc42lreMGlfP4ljiXPKVDvCADh+XHCV/bqPu/wWSiCVXbJKvrODw==",
"dependencies": {
"jose": "^6.1.0",
"oauth4webapi": "^3.8.2"
},
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/pac-proxy-agent": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/pac-proxy-agent/-/pac-proxy-agent-7.2.0.tgz",
@@ -4396,6 +4676,11 @@
"url": "https://github.com/privatenumber/resolve-pkg-maps?sponsor=1"
}
},
"node_modules/rfc4648": {
"version": "1.5.4",
"resolved": "https://registry.npmjs.org/rfc4648/-/rfc4648-1.5.4.tgz",
"integrity": "sha512-rRg/6Lb+IGfJqO05HZkN50UtY7K/JhxJag1kP23+zyMfrvoB0B7RWv06MbOzoc79RgCdNTiUaNsTT1AJZ7Z+cg=="
},
"node_modules/rimraf": {
"version": "3.0.2",
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-3.0.2.tgz",
@@ -4826,6 +5111,14 @@
"node": ">= 0.8"
}
},
"node_modules/stream-buffers": {
"version": "3.0.3",
"resolved": "https://registry.npmjs.org/stream-buffers/-/stream-buffers-3.0.3.tgz",
"integrity": "sha512-pqMqwQCso0PBJt2PQmDO0cFj0lyqmiwOMiMSkVtRokl7e+ZTRYgDHKnuZNbqjiJXgsg4nuqtD/zxuo9KqTp0Yw==",
"engines": {
"node": ">= 0.10.0"
}
},
"node_modules/streamx": {
"version": "2.23.0",
"resolved": "https://registry.npmjs.org/streamx/-/streamx-2.23.0.tgz",
@@ -5045,8 +5338,7 @@
"node_modules/undici-types": {
"version": "6.21.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
"devOptional": true
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ=="
},
"node_modules/universalify": {
"version": "2.0.1",
@@ -5069,6 +5361,14 @@
"resolved": "https://registry.npmjs.org/urlpattern-polyfill/-/urlpattern-polyfill-10.0.0.tgz",
"integrity": "sha512-H/A06tKD7sS1O1X2SshBVeA5FLycRpjqiBeqGKmBwBDBy28EnRjORxTNe269KSSr5un5qyWi1iL61wLxpd+ZOg=="
},
"node_modules/user-agents": {
"version": "1.1.669",
"resolved": "https://registry.npmjs.org/user-agents/-/user-agents-1.1.669.tgz",
"integrity": "sha512-pbIzG+AOqCaIpySKJ4IAm1l0VyE4jMnK4y1thV8lm8PYxI+7X5uWcppOK7zY79TCKKTAnJH3/4gaVIZHsjrmJA==",
"dependencies": {
"lodash.clonedeep": "^4.5.0"
}
},
"node_modules/util": {
"version": "0.12.5",
"resolved": "https://registry.npmjs.org/util/-/util-0.12.5.tgz",

View File

@@ -1,6 +1,6 @@
{
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"description": "Backend API for Dutchie Menus scraper and management",
"main": "dist/index.js",
"scripts": {
@@ -22,6 +22,7 @@
"seed:dt:cities:bulk": "tsx src/scripts/seed-dt-cities-bulk.ts"
},
"dependencies": {
"@kubernetes/client-node": "^1.4.0",
"@types/bcryptjs": "^3.0.0",
"axios": "^1.6.2",
"bcrypt": "^5.1.1",
@@ -35,6 +36,7 @@
"helmet": "^7.1.0",
"https-proxy-agent": "^7.0.2",
"ioredis": "^5.8.2",
"ip2location-nodejs": "^9.7.0",
"ipaddr.js": "^2.2.0",
"jsonwebtoken": "^9.0.2",
"minio": "^7.1.3",
@@ -47,6 +49,7 @@
"puppeteer-extra-plugin-stealth": "^2.11.2",
"sharp": "^0.32.0",
"socks-proxy-agent": "^8.0.2",
"user-agents": "^1.1.669",
"uuid": "^9.0.1",
"zod": "^3.22.4"
},

Binary file not shown.

View File

@@ -0,0 +1 @@
cannaiq-menus-1.6.0.zip

View File

@@ -0,0 +1,65 @@
#!/bin/bash
# Download IP2Location LITE DB3 (City-level) database
# Free for commercial use with attribution
# https://lite.ip2location.com/database/db3-ip-country-region-city
set -e
DATA_DIR="${1:-./data/ip2location}"
DB_FILE="IP2LOCATION-LITE-DB3.BIN"
mkdir -p "$DATA_DIR"
cd "$DATA_DIR"
echo "Downloading IP2Location LITE DB3 database..."
# IP2Location LITE DB3 - includes city, region, country, lat/lng
# You need to register at https://lite.ip2location.com/ to get a download token
# Then set IP2LOCATION_TOKEN environment variable
if [ -z "$IP2LOCATION_TOKEN" ]; then
echo ""
echo "ERROR: IP2LOCATION_TOKEN not set"
echo ""
echo "To download the database:"
echo "1. Register free at https://lite.ip2location.com/"
echo "2. Get your download token from the dashboard"
echo "3. Run: IP2LOCATION_TOKEN=your_token ./scripts/download-ip2location.sh"
echo ""
exit 1
fi
# Download DB3.LITE (IPv4 + City)
DOWNLOAD_URL="https://www.ip2location.com/download/?token=${IP2LOCATION_TOKEN}&file=DB3LITEBIN"
echo "Downloading from IP2Location..."
curl -L -o ip2location.zip "$DOWNLOAD_URL"
echo "Extracting..."
unzip -o ip2location.zip
# Rename to standard name
if [ -f "IP2LOCATION-LITE-DB3.BIN" ]; then
echo "Database ready: $DATA_DIR/IP2LOCATION-LITE-DB3.BIN"
elif [ -f "IP-COUNTRY-REGION-CITY.BIN" ]; then
mv "IP-COUNTRY-REGION-CITY.BIN" "$DB_FILE"
echo "Database ready: $DATA_DIR/$DB_FILE"
else
# Find whatever BIN file was extracted
BIN_FILE=$(ls *.BIN 2>/dev/null | head -1)
if [ -n "$BIN_FILE" ]; then
mv "$BIN_FILE" "$DB_FILE"
echo "Database ready: $DATA_DIR/$DB_FILE"
else
echo "ERROR: No BIN file found in archive"
ls -la
exit 1
fi
fi
# Cleanup
rm -f ip2location.zip *.txt LICENSE* README*
echo ""
echo "Done! Database saved to: $DATA_DIR/$DB_FILE"
echo "Update monthly by re-running this script."

View File

@@ -0,0 +1,46 @@
# DEPRECATED CODE - DO NOT USE
**These directories contain OLD, ABANDONED code.**
## What's Here
| Directory | What It Was | Why Deprecated |
|-----------|-------------|----------------|
| `hydration/` | Old pipeline for processing crawl data | Replaced by `src/tasks/handlers/` |
| `scraper-v2/` | Old Puppeteer-based scraper engine | Replaced by curl-based `src/platforms/dutchie/client.ts` |
| `canonical-hydration/` | Intermediate step toward canonical schema | Merged into task handlers |
## What to Use Instead
| Old (DONT USE) | New (USE THIS) |
|----------------|----------------|
| `hydration/normalizers/dutchie.ts` | `src/tasks/handlers/product-refresh.ts` |
| `hydration/producer.ts` | `src/tasks/handlers/payload-fetch.ts` |
| `scraper-v2/engine.ts` | `src/platforms/dutchie/client.ts` |
| `scraper-v2/scheduler.ts` | `src/services/task-scheduler.ts` |
## Why Keep This Code?
- Historical reference only
- Some patterns may be useful for debugging
- Will be deleted once confirmed not needed
## Claude Instructions
**IF YOU ARE CLAUDE:**
1. NEVER import from `src/_deprecated/`
2. NEVER reference these files as examples
3. NEVER try to "fix" or "update" code in here
4. If you see imports from these directories, suggest replacing them
**Correct imports:**
```typescript
// GOOD
import { executeGraphQL } from '../platforms/dutchie/client';
import { pool } from '../db/pool';
// BAD - DO NOT USE
import { something } from '../_deprecated/hydration/...';
import { something } from '../_deprecated/scraper-v2/...';
```

View File

@@ -23,6 +23,7 @@ import {
DutchieNormalizer,
hydrateToCanonical,
} from '../hydration';
import { initializeImageStorage } from '../utils/image-storage';
dotenv.config();
@@ -137,6 +138,11 @@ async function main() {
console.log(`Test Crawl to Canonical - Dispensary ${dispensaryId}`);
console.log('============================================================\n');
// Initialize image storage
console.log('[Init] Initializing image storage...');
await initializeImageStorage();
console.log(' Image storage ready\n');
try {
// Step 1: Get dispensary info
console.log('[Step 1] Getting dispensary info...');

View File

@@ -3,7 +3,7 @@ import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { Browser, Page } from 'puppeteer';
import { SocksProxyAgent } from 'socks-proxy-agent';
import { pool } from '../db/pool';
import { uploadImageFromUrl, getImageUrl } from '../utils/minio';
import { downloadProductImageLegacy } from '../utils/image-storage';
import { logger } from './logger';
import { registerScraper, updateScraperStats, completeScraper } from '../routes/scraper-monitor';
import { incrementProxyFailure, getActiveProxy, isBotDetectionError, putProxyInTimeout } from './proxy';
@@ -767,7 +767,8 @@ export async function saveProducts(storeId: number, categoryId: number, products
if (product.imageUrl && !localImagePath) {
try {
localImagePath = await uploadImageFromUrl(product.imageUrl, productId);
const result = await downloadProductImageLegacy(product.imageUrl, 0, productId);
localImagePath = result.urls?.original || null;
await client.query(`
UPDATE products
SET local_image_path = $1

View File

@@ -0,0 +1,584 @@
/**
* System API Routes
*
* Provides REST API endpoints for system monitoring and control:
* - /api/system/sync/* - Sync orchestrator
* - /api/system/dlq/* - Dead-letter queue
* - /api/system/integrity/* - Integrity checks
* - /api/system/fix/* - Auto-fix routines
* - /api/system/alerts/* - System alerts
* - /metrics - Prometheus metrics
*
* Phase 5: Full Production Sync + Monitoring
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import {
SyncOrchestrator,
MetricsService,
DLQService,
AlertService,
IntegrityService,
AutoFixService,
} from '../services';
export function createSystemRouter(pool: Pool): Router {
const router = Router();
// Initialize services
const metrics = new MetricsService(pool);
const dlq = new DLQService(pool);
const alerts = new AlertService(pool);
const integrity = new IntegrityService(pool, alerts);
const autoFix = new AutoFixService(pool, alerts);
const orchestrator = new SyncOrchestrator(pool, metrics, dlq, alerts);
// ============================================================
// SYNC ORCHESTRATOR ENDPOINTS
// ============================================================
/**
* GET /api/system/sync/status
* Get current sync status
*/
router.get('/sync/status', async (_req: Request, res: Response) => {
try {
const status = await orchestrator.getStatus();
res.json(status);
} catch (error) {
console.error('[System] Sync status error:', error);
res.status(500).json({ error: 'Failed to get sync status' });
}
});
/**
* POST /api/system/sync/run
* Trigger a sync run
*/
router.post('/sync/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await orchestrator.runSync();
res.json({
success: true,
triggeredBy,
metrics: result,
});
} catch (error) {
console.error('[System] Sync run error:', error);
res.status(500).json({
success: false,
error: error instanceof Error ? error.message : 'Sync run failed',
});
}
});
/**
* GET /api/system/sync/queue-depth
* Get queue depth information
*/
router.get('/sync/queue-depth', async (_req: Request, res: Response) => {
try {
const depth = await orchestrator.getQueueDepth();
res.json(depth);
} catch (error) {
console.error('[System] Queue depth error:', error);
res.status(500).json({ error: 'Failed to get queue depth' });
}
});
/**
* GET /api/system/sync/health
* Get sync health status
*/
router.get('/sync/health', async (_req: Request, res: Response) => {
try {
const health = await orchestrator.getHealth();
res.status(health.healthy ? 200 : 503).json(health);
} catch (error) {
console.error('[System] Health check error:', error);
res.status(500).json({ healthy: false, error: 'Health check failed' });
}
});
/**
* POST /api/system/sync/pause
* Pause the orchestrator
*/
router.post('/sync/pause', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manual pause';
await orchestrator.pause(reason);
res.json({ success: true, message: 'Orchestrator paused' });
} catch (error) {
console.error('[System] Pause error:', error);
res.status(500).json({ error: 'Failed to pause orchestrator' });
}
});
/**
* POST /api/system/sync/resume
* Resume the orchestrator
*/
router.post('/sync/resume', async (_req: Request, res: Response) => {
try {
await orchestrator.resume();
res.json({ success: true, message: 'Orchestrator resumed' });
} catch (error) {
console.error('[System] Resume error:', error);
res.status(500).json({ error: 'Failed to resume orchestrator' });
}
});
// ============================================================
// DLQ ENDPOINTS
// ============================================================
/**
* GET /api/system/dlq
* List DLQ payloads
*/
router.get('/dlq', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as string,
errorType: req.query.errorType as string,
dispensaryId: req.query.dispensaryId ? parseInt(req.query.dispensaryId as string) : undefined,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await dlq.listPayloads(options);
res.json(result);
} catch (error) {
console.error('[System] DLQ list error:', error);
res.status(500).json({ error: 'Failed to list DLQ payloads' });
}
});
/**
* GET /api/system/dlq/stats
* Get DLQ statistics
*/
router.get('/dlq/stats', async (_req: Request, res: Response) => {
try {
const stats = await dlq.getStats();
res.json(stats);
} catch (error) {
console.error('[System] DLQ stats error:', error);
res.status(500).json({ error: 'Failed to get DLQ stats' });
}
});
/**
* GET /api/system/dlq/summary
* Get DLQ summary by error type
*/
router.get('/dlq/summary', async (_req: Request, res: Response) => {
try {
const summary = await dlq.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] DLQ summary error:', error);
res.status(500).json({ error: 'Failed to get DLQ summary' });
}
});
/**
* GET /api/system/dlq/:id
* Get a specific DLQ payload
*/
router.get('/dlq/:id', async (req: Request, res: Response) => {
try {
const payload = await dlq.getPayload(req.params.id);
if (!payload) {
return res.status(404).json({ error: 'Payload not found' });
}
res.json(payload);
} catch (error) {
console.error('[System] DLQ get error:', error);
res.status(500).json({ error: 'Failed to get DLQ payload' });
}
});
/**
* POST /api/system/dlq/:id/retry
* Retry a DLQ payload
*/
router.post('/dlq/:id/retry', async (req: Request, res: Response) => {
try {
const result = await dlq.retryPayload(req.params.id);
if (result.success) {
res.json(result);
} else {
res.status(400).json(result);
}
} catch (error) {
console.error('[System] DLQ retry error:', error);
res.status(500).json({ error: 'Failed to retry payload' });
}
});
/**
* POST /api/system/dlq/:id/abandon
* Abandon a DLQ payload
*/
router.post('/dlq/:id/abandon', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manually abandoned';
const abandonedBy = req.body.abandonedBy || 'api';
const success = await dlq.abandonPayload(req.params.id, reason, abandonedBy);
res.json({ success });
} catch (error) {
console.error('[System] DLQ abandon error:', error);
res.status(500).json({ error: 'Failed to abandon payload' });
}
});
/**
* POST /api/system/dlq/bulk-retry
* Bulk retry payloads by error type
*/
router.post('/dlq/bulk-retry', async (req: Request, res: Response) => {
try {
const { errorType } = req.body;
if (!errorType) {
return res.status(400).json({ error: 'errorType is required' });
}
const result = await dlq.bulkRetryByErrorType(errorType);
res.json(result);
} catch (error) {
console.error('[System] DLQ bulk retry error:', error);
res.status(500).json({ error: 'Failed to bulk retry' });
}
});
// ============================================================
// INTEGRITY CHECK ENDPOINTS
// ============================================================
/**
* POST /api/system/integrity/run
* Run all integrity checks
*/
router.post('/integrity/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await integrity.runAllChecks(triggeredBy);
res.json(result);
} catch (error) {
console.error('[System] Integrity run error:', error);
res.status(500).json({ error: 'Failed to run integrity checks' });
}
});
/**
* GET /api/system/integrity/runs
* Get recent integrity check runs
*/
router.get('/integrity/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 10;
const runs = await integrity.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Integrity runs error:', error);
res.status(500).json({ error: 'Failed to get integrity runs' });
}
});
/**
* GET /api/system/integrity/runs/:runId
* Get results for a specific integrity run
*/
router.get('/integrity/runs/:runId', async (req: Request, res: Response) => {
try {
const results = await integrity.getRunResults(req.params.runId);
res.json(results);
} catch (error) {
console.error('[System] Integrity run results error:', error);
res.status(500).json({ error: 'Failed to get run results' });
}
});
// ============================================================
// AUTO-FIX ENDPOINTS
// ============================================================
/**
* GET /api/system/fix/routines
* Get available fix routines
*/
router.get('/fix/routines', (_req: Request, res: Response) => {
try {
const routines = autoFix.getAvailableRoutines();
res.json(routines);
} catch (error) {
console.error('[System] Get routines error:', error);
res.status(500).json({ error: 'Failed to get routines' });
}
});
/**
* POST /api/system/fix/:routine
* Run a fix routine
*/
router.post('/fix/:routine', async (req: Request, res: Response) => {
try {
const routineName = req.params.routine;
const dryRun = req.body.dryRun === true;
const triggeredBy = req.body.triggeredBy || 'api';
const result = await autoFix.runRoutine(routineName as any, triggeredBy, { dryRun });
res.json(result);
} catch (error) {
console.error('[System] Fix routine error:', error);
res.status(500).json({ error: 'Failed to run fix routine' });
}
});
/**
* GET /api/system/fix/runs
* Get recent fix runs
*/
router.get('/fix/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 20;
const runs = await autoFix.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Fix runs error:', error);
res.status(500).json({ error: 'Failed to get fix runs' });
}
});
// ============================================================
// ALERTS ENDPOINTS
// ============================================================
/**
* GET /api/system/alerts
* List alerts
*/
router.get('/alerts', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as any,
severity: req.query.severity as any,
type: req.query.type as string,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await alerts.listAlerts(options);
res.json(result);
} catch (error) {
console.error('[System] Alerts list error:', error);
res.status(500).json({ error: 'Failed to list alerts' });
}
});
/**
* GET /api/system/alerts/active
* Get active alerts
*/
router.get('/alerts/active', async (_req: Request, res: Response) => {
try {
const activeAlerts = await alerts.getActiveAlerts();
res.json(activeAlerts);
} catch (error) {
console.error('[System] Active alerts error:', error);
res.status(500).json({ error: 'Failed to get active alerts' });
}
});
/**
* GET /api/system/alerts/summary
* Get alert summary
*/
router.get('/alerts/summary', async (_req: Request, res: Response) => {
try {
const summary = await alerts.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] Alerts summary error:', error);
res.status(500).json({ error: 'Failed to get alerts summary' });
}
});
/**
* POST /api/system/alerts/:id/acknowledge
* Acknowledge an alert
*/
router.post('/alerts/:id/acknowledge', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const acknowledgedBy = req.body.acknowledgedBy || 'api';
const success = await alerts.acknowledgeAlert(alertId, acknowledgedBy);
res.json({ success });
} catch (error) {
console.error('[System] Acknowledge alert error:', error);
res.status(500).json({ error: 'Failed to acknowledge alert' });
}
});
/**
* POST /api/system/alerts/:id/resolve
* Resolve an alert
*/
router.post('/alerts/:id/resolve', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const resolvedBy = req.body.resolvedBy || 'api';
const success = await alerts.resolveAlert(alertId, resolvedBy);
res.json({ success });
} catch (error) {
console.error('[System] Resolve alert error:', error);
res.status(500).json({ error: 'Failed to resolve alert' });
}
});
/**
* POST /api/system/alerts/bulk-acknowledge
* Bulk acknowledge alerts
*/
router.post('/alerts/bulk-acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await alerts.bulkAcknowledge(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Bulk acknowledge error:', error);
res.status(500).json({ error: 'Failed to bulk acknowledge' });
}
});
// ============================================================
// METRICS ENDPOINTS
// ============================================================
/**
* GET /api/system/metrics
* Get all current metrics
*/
router.get('/metrics', async (_req: Request, res: Response) => {
try {
const allMetrics = await metrics.getAllMetrics();
res.json(allMetrics);
} catch (error) {
console.error('[System] Metrics error:', error);
res.status(500).json({ error: 'Failed to get metrics' });
}
});
/**
* GET /api/system/metrics/:name
* Get a specific metric
*/
router.get('/metrics/:name', async (req: Request, res: Response) => {
try {
const metric = await metrics.getMetric(req.params.name);
if (!metric) {
return res.status(404).json({ error: 'Metric not found' });
}
res.json(metric);
} catch (error) {
console.error('[System] Metric error:', error);
res.status(500).json({ error: 'Failed to get metric' });
}
});
/**
* GET /api/system/metrics/:name/history
* Get metric time series
*/
router.get('/metrics/:name/history', async (req: Request, res: Response) => {
try {
const hours = req.query.hours ? parseInt(req.query.hours as string) : 24;
const history = await metrics.getMetricHistory(req.params.name, hours);
res.json(history);
} catch (error) {
console.error('[System] Metric history error:', error);
res.status(500).json({ error: 'Failed to get metric history' });
}
});
/**
* GET /api/system/errors
* Get error summary
*/
router.get('/errors', async (_req: Request, res: Response) => {
try {
const summary = await metrics.getErrorSummary();
res.json(summary);
} catch (error) {
console.error('[System] Error summary error:', error);
res.status(500).json({ error: 'Failed to get error summary' });
}
});
/**
* GET /api/system/errors/recent
* Get recent errors
*/
router.get('/errors/recent', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 50;
const errorType = req.query.type as string;
const errors = await metrics.getRecentErrors(limit, errorType);
res.json(errors);
} catch (error) {
console.error('[System] Recent errors error:', error);
res.status(500).json({ error: 'Failed to get recent errors' });
}
});
/**
* POST /api/system/errors/acknowledge
* Acknowledge errors
*/
router.post('/errors/acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await metrics.acknowledgeErrors(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Acknowledge errors error:', error);
res.status(500).json({ error: 'Failed to acknowledge errors' });
}
});
return router;
}
/**
* Create Prometheus metrics endpoint (standalone)
*/
export function createPrometheusRouter(pool: Pool): Router {
const router = Router();
const metrics = new MetricsService(pool);
/**
* GET /metrics
* Prometheus-compatible metrics endpoint
*/
router.get('/', async (_req: Request, res: Response) => {
try {
const prometheusOutput = await metrics.getPrometheusMetrics();
res.set('Content-Type', 'text/plain; version=0.0.4');
res.send(prometheusOutput);
} catch (error) {
console.error('[Prometheus] Metrics error:', error);
res.status(500).send('# Error generating metrics');
}
});
return router;
}

View File

@@ -29,6 +29,12 @@ const TRUSTED_ORIGINS = [
'http://localhost:5173',
];
// Pattern-based trusted origins (wildcards)
const TRUSTED_ORIGIN_PATTERNS = [
/^https:\/\/.*\.cannabrands\.app$/, // *.cannabrands.app
/^https:\/\/.*\.cannaiq\.co$/, // *.cannaiq.co
];
// Trusted IPs for internal pod-to-pod communication
const TRUSTED_IPS = [
'127.0.0.1',
@@ -42,8 +48,16 @@ const TRUSTED_IPS = [
function isTrustedRequest(req: Request): boolean {
// Check origin header
const origin = req.headers.origin;
if (origin && TRUSTED_ORIGINS.includes(origin)) {
return true;
if (origin) {
if (TRUSTED_ORIGINS.includes(origin)) {
return true;
}
// Check pattern-based origins (wildcards like *.cannabrands.app)
for (const pattern of TRUSTED_ORIGIN_PATTERNS) {
if (pattern.test(origin)) {
return true;
}
}
}
// Check referer header (for same-origin requests without CORS)
@@ -54,6 +68,18 @@ function isTrustedRequest(req: Request): boolean {
return true;
}
}
// Check pattern-based referers
try {
const refererUrl = new URL(referer);
const refererOrigin = refererUrl.origin;
for (const pattern of TRUSTED_ORIGIN_PATTERNS) {
if (pattern.test(refererOrigin)) {
return true;
}
}
} catch {
// Invalid referer URL, skip
}
}
// Check IP for internal requests (pod-to-pod, localhost)
@@ -127,7 +153,53 @@ export async function authenticateUser(email: string, password: string): Promise
}
export async function authMiddleware(req: AuthRequest, res: Response, next: NextFunction) {
// Allow trusted origins/IPs to bypass auth (internal services, same-origin)
const authHeader = req.headers.authorization;
// If a Bearer token is provided, always try to use it first (logged-in user)
if (authHeader && authHeader.startsWith('Bearer ')) {
const token = authHeader.substring(7);
// Try JWT first
const jwtUser = verifyToken(token);
if (jwtUser) {
req.user = jwtUser;
return next();
}
// If JWT fails, try API token
try {
const result = await pool.query(`
SELECT id, name, rate_limit, active, expires_at, allowed_endpoints
FROM api_tokens
WHERE token = $1
`, [token]);
if (result.rows.length > 0) {
const apiToken = result.rows[0];
if (!apiToken.active) {
return res.status(401).json({ error: 'API token is inactive' });
}
if (apiToken.expires_at && new Date(apiToken.expires_at) < new Date()) {
return res.status(401).json({ error: 'API token has expired' });
}
req.user = {
id: 0,
email: `api:${apiToken.name}`,
role: 'api_token'
};
req.apiToken = apiToken;
return next();
}
} catch (err) {
console.error('API token lookup error:', err);
}
// Token provided but invalid
return res.status(401).json({ error: 'Invalid token' });
}
// No token provided - check trusted origins for API access (WordPress, etc.)
if (isTrustedRequest(req)) {
req.user = {
id: 0,
@@ -137,80 +209,10 @@ export async function authMiddleware(req: AuthRequest, res: Response, next: Next
return next();
}
const authHeader = req.headers.authorization;
if (!authHeader || !authHeader.startsWith('Bearer ')) {
return res.status(401).json({ error: 'No token provided' });
}
const token = authHeader.substring(7);
// Try JWT first
const jwtUser = verifyToken(token);
if (jwtUser) {
req.user = jwtUser;
return next();
}
// If JWT fails, try API token
try {
const result = await pool.query(`
SELECT id, name, rate_limit, active, expires_at, allowed_endpoints
FROM api_tokens
WHERE token = $1
`, [token]);
if (result.rows.length === 0) {
return res.status(401).json({ error: 'Invalid token' });
}
const apiToken = result.rows[0];
// Check if token is active
if (!apiToken.active) {
return res.status(401).json({ error: 'Token is disabled' });
}
// Check if token is expired
if (apiToken.expires_at && new Date(apiToken.expires_at) < new Date()) {
return res.status(401).json({ error: 'Token has expired' });
}
// Check allowed endpoints
if (apiToken.allowed_endpoints && apiToken.allowed_endpoints.length > 0) {
const isAllowed = apiToken.allowed_endpoints.some((pattern: string) => {
// Simple wildcard matching
const regex = new RegExp('^' + pattern.replace('*', '.*') + '$');
return regex.test(req.path);
});
if (!isAllowed) {
return res.status(403).json({ error: 'Endpoint not allowed for this token' });
}
}
// Set API token on request for tracking
req.apiToken = {
id: apiToken.id,
name: apiToken.name,
rate_limit: apiToken.rate_limit
};
// Set a generic user for compatibility with existing code
req.user = {
id: apiToken.id,
email: `api-token-${apiToken.id}@system`,
role: 'api'
};
next();
} catch (error) {
console.error('Error verifying API token:', error);
return res.status(500).json({ error: 'Authentication failed' });
}
return res.status(401).json({ error: 'No token provided' });
}
/**
* Require specific role(s) to access endpoint.
*

View File

@@ -0,0 +1,141 @@
/**
* Auto-Migration System
*
* Runs SQL migration files from the migrations/ folder automatically on server startup.
* Uses a schema_migrations table to track which migrations have been applied.
*
* Safe to run multiple times - only applies new migrations.
*/
import { Pool } from 'pg';
import fs from 'fs';
import path from 'path';
const MIGRATIONS_DIR = path.join(__dirname, '../../migrations');
/**
* Ensure schema_migrations table exists
*/
async function ensureMigrationsTable(pool: Pool): Promise<void> {
await pool.query(`
CREATE TABLE IF NOT EXISTS schema_migrations (
id SERIAL PRIMARY KEY,
name VARCHAR(255) UNIQUE NOT NULL,
applied_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
)
`);
}
/**
* Get list of already-applied migrations
*/
async function getAppliedMigrations(pool: Pool): Promise<Set<string>> {
const result = await pool.query('SELECT name FROM schema_migrations');
return new Set(result.rows.map(row => row.name));
}
/**
* Get list of migration files from disk
*/
function getMigrationFiles(): string[] {
if (!fs.existsSync(MIGRATIONS_DIR)) {
console.log('[AutoMigrate] No migrations directory found');
return [];
}
return fs.readdirSync(MIGRATIONS_DIR)
.filter(f => f.endsWith('.sql'))
.sort(); // Sort alphabetically (001_, 002_, etc.)
}
/**
* Run a single migration file
*/
async function runMigration(pool: Pool, filename: string): Promise<void> {
const filepath = path.join(MIGRATIONS_DIR, filename);
const sql = fs.readFileSync(filepath, 'utf8');
const client = await pool.connect();
try {
await client.query('BEGIN');
// Run the migration SQL
await client.query(sql);
// Record that this migration was applied
await client.query(
'INSERT INTO schema_migrations (name) VALUES ($1) ON CONFLICT (name) DO NOTHING',
[filename]
);
await client.query('COMMIT');
console.log(`[AutoMigrate] ✓ Applied: ${filename}`);
} catch (error: any) {
await client.query('ROLLBACK');
console.error(`[AutoMigrate] ✗ Failed: ${filename}`);
throw error;
} finally {
client.release();
}
}
/**
* Run all pending migrations
*
* @param pool - Database connection pool
* @returns Number of migrations applied
*/
export async function runAutoMigrations(pool: Pool): Promise<number> {
console.log('[AutoMigrate] Checking for pending migrations...');
try {
// Ensure migrations table exists
await ensureMigrationsTable(pool);
// Get applied and available migrations
const applied = await getAppliedMigrations(pool);
const available = getMigrationFiles();
// Find pending migrations
const pending = available.filter(f => !applied.has(f));
if (pending.length === 0) {
console.log('[AutoMigrate] No pending migrations');
return 0;
}
console.log(`[AutoMigrate] Found ${pending.length} pending migrations`);
// Run each pending migration in order
for (const filename of pending) {
await runMigration(pool, filename);
}
console.log(`[AutoMigrate] Successfully applied ${pending.length} migrations`);
return pending.length;
} catch (error: any) {
console.error('[AutoMigrate] Migration failed:', error.message);
// Don't crash the server - log and continue
// The specific failing migration will have been rolled back
return -1;
}
}
/**
* Check migration status without running anything
*/
export async function checkMigrationStatus(pool: Pool): Promise<{
applied: string[];
pending: string[];
}> {
await ensureMigrationsTable(pool);
const applied = await getAppliedMigrations(pool);
const available = getMigrationFiles();
return {
applied: available.filter(f => applied.has(f)),
pending: available.filter(f => !applied.has(f)),
};
}

View File

@@ -0,0 +1,200 @@
#!/usr/bin/env npx tsx
/**
* Database Migration Runner
*
* Runs SQL migrations from backend/migrations/*.sql in order.
* Tracks applied migrations in schema_migrations table.
*
* Usage:
* npx tsx src/db/run-migrations.ts
*
* Environment:
* DATABASE_URL or CANNAIQ_DB_* variables
*/
import { Pool } from 'pg';
import * as fs from 'fs/promises';
import * as path from 'path';
import dotenv from 'dotenv';
dotenv.config();
function getConnectionString(): string {
if (process.env.DATABASE_URL) {
return process.env.DATABASE_URL;
}
if (process.env.CANNAIQ_DB_URL) {
return process.env.CANNAIQ_DB_URL;
}
const host = process.env.CANNAIQ_DB_HOST || 'localhost';
const port = process.env.CANNAIQ_DB_PORT || '54320';
const name = process.env.CANNAIQ_DB_NAME || 'dutchie_menus';
const user = process.env.CANNAIQ_DB_USER || 'dutchie';
const pass = process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass';
return `postgresql://${user}:${pass}@${host}:${port}/${name}`;
}
interface MigrationFile {
filename: string;
number: number;
path: string;
}
async function getMigrationFiles(migrationsDir: string): Promise<MigrationFile[]> {
const files = await fs.readdir(migrationsDir);
const migrations: MigrationFile[] = files
.filter(f => f.endsWith('.sql'))
.map(filename => {
// Extract number from filename like "005_api_tokens.sql" or "073_proxy_timezone.sql"
const match = filename.match(/^(\d+)_/);
if (!match) return null;
return {
filename,
number: parseInt(match[1], 10),
path: path.join(migrationsDir, filename),
};
})
.filter((m): m is MigrationFile => m !== null)
.sort((a, b) => a.number - b.number);
return migrations;
}
async function ensureMigrationsTable(pool: Pool): Promise<void> {
// Migrate to filename-based tracking (handles duplicate version numbers)
// Check if old version-based PK exists
const pkCheck = await pool.query(`
SELECT constraint_name FROM information_schema.table_constraints
WHERE table_name = 'schema_migrations' AND constraint_type = 'PRIMARY KEY'
`);
if (pkCheck.rows.length === 0) {
// Table doesn't exist, create with filename as PK
await pool.query(`
CREATE TABLE IF NOT EXISTS schema_migrations (
filename VARCHAR(255) NOT NULL PRIMARY KEY,
version VARCHAR(10),
name VARCHAR(255),
applied_at TIMESTAMPTZ DEFAULT NOW()
)
`);
} else {
// Table exists - add filename column if missing
await pool.query(`
ALTER TABLE schema_migrations ADD COLUMN IF NOT EXISTS filename VARCHAR(255)
`);
// Populate filename from version+name for existing rows
await pool.query(`
UPDATE schema_migrations SET filename = version || '_' || name || '.sql'
WHERE filename IS NULL
`);
}
}
async function getAppliedMigrations(pool: Pool): Promise<Set<string>> {
// Try filename first, fall back to version_name combo
const result = await pool.query(`
SELECT COALESCE(filename, version || '_' || name || '.sql') as filename
FROM schema_migrations
`);
return new Set(result.rows.map(r => r.filename));
}
async function applyMigration(pool: Pool, migration: MigrationFile): Promise<void> {
const sql = await fs.readFile(migration.path, 'utf-8');
// Extract version and name from filename like "005_api_tokens.sql"
const version = String(migration.number).padStart(3, '0');
const name = migration.filename.replace(/^\d+_/, '').replace(/\.sql$/, '');
const client = await pool.connect();
try {
await client.query('BEGIN');
// Run the migration SQL
await client.query(sql);
// Record that it was applied - use INSERT with ON CONFLICT for safety
await client.query(`
INSERT INTO schema_migrations (filename, version, name)
VALUES ($1, $2, $3)
ON CONFLICT DO NOTHING
`, [migration.filename, version, name]);
await client.query('COMMIT');
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
async function main() {
const pool = new Pool({ connectionString: getConnectionString() });
// Migrations directory relative to this file
const migrationsDir = path.resolve(__dirname, '../../migrations');
console.log('╔════════════════════════════════════════════════════════════╗');
console.log('║ DATABASE MIGRATION RUNNER ║');
console.log('╚════════════════════════════════════════════════════════════╝');
console.log(`Migrations dir: ${migrationsDir}`);
console.log('');
try {
// Ensure tracking table exists
await ensureMigrationsTable(pool);
// Get all migration files
const allMigrations = await getMigrationFiles(migrationsDir);
console.log(`Found ${allMigrations.length} migration files`);
// Get already-applied migrations
const applied = await getAppliedMigrations(pool);
console.log(`Already applied: ${applied.size} migrations`);
console.log('');
// Find pending migrations (compare by filename)
const pending = allMigrations.filter(m => !applied.has(m.filename));
if (pending.length === 0) {
console.log('✅ No pending migrations. Database is up to date.');
await pool.end();
return;
}
console.log(`Pending migrations: ${pending.length}`);
console.log('─'.repeat(60));
// Apply each pending migration
for (const migration of pending) {
process.stdout.write(` ${migration.filename}... `);
try {
await applyMigration(pool, migration);
console.log('✅');
} catch (error: any) {
console.log('❌');
console.error(`\nError applying ${migration.filename}:`);
console.error(error.message);
process.exit(1);
}
}
console.log('');
console.log('═'.repeat(60));
console.log(`✅ Applied ${pending.length} migrations successfully`);
} catch (error: any) {
console.error('Migration runner failed:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -172,6 +172,9 @@ export async function runFullDiscovery(
console.log(`Errors: ${totalErrors}`);
}
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
let newDispensaryIds: number[] = [];
// Step 4: Auto-validate and promote discovered locations
if (!dryRun && totalLocationsUpserted > 0) {
console.log('\n[Discovery] Step 4: Auto-promoting discovered locations...');
@@ -180,6 +183,13 @@ export async function runFullDiscovery(
console.log(` Created: ${promotionResult.created} new dispensaries`);
console.log(` Updated: ${promotionResult.updated} existing dispensaries`);
console.log(` Rejected: ${promotionResult.rejected} (validation failed)`);
// Per TASK_WORKFLOW_2024-12-10.md: Capture new IDs for task chaining
newDispensaryIds = promotionResult.newDispensaryIds;
if (newDispensaryIds.length > 0) {
console.log(` New store IDs for crawl: [${newDispensaryIds.join(', ')}]`);
}
if (promotionResult.rejectedRecords.length > 0) {
console.log(` Rejection reasons:`);
promotionResult.rejectedRecords.slice(0, 5).forEach(r => {
@@ -191,12 +201,132 @@ export async function runFullDiscovery(
}
}
// Step 5: Detect dropped stores (in DB but not in discovery results)
if (!dryRun) {
console.log('\n[Discovery] Step 5: Detecting dropped stores...');
const droppedResult = await detectDroppedStores(pool, stateCode);
if (droppedResult.droppedCount > 0) {
console.log(`[Discovery] Found ${droppedResult.droppedCount} dropped stores:`);
droppedResult.droppedStores.slice(0, 10).forEach(s => {
console.log(` - ${s.name} (${s.city}, ${s.state}) - last seen: ${s.lastSeenAt}`);
});
if (droppedResult.droppedCount > 10) {
console.log(` ... and ${droppedResult.droppedCount - 10} more`);
}
} else {
console.log(`[Discovery] No dropped stores detected`);
}
}
return {
cities: cityResult,
locations: locationResults,
totalLocationsFound,
totalLocationsUpserted,
durationMs,
// Per TASK_WORKFLOW_2024-12-10.md: Return new IDs for task chaining
newDispensaryIds,
};
}
// ============================================================
// DROPPED STORE DETECTION
// ============================================================
export interface DroppedStoreResult {
droppedCount: number;
droppedStores: Array<{
id: number;
name: string;
city: string;
state: string;
platformDispensaryId: string;
lastSeenAt: string;
}>;
}
/**
* Detect stores that exist in dispensaries but were not found in discovery.
* Marks them as status='dropped' for manual review.
*
* A store is considered "dropped" if:
* 1. It has a platform_dispensary_id (was verified via Dutchie)
* 2. It was NOT seen in the latest discovery crawl (last_seen_at in discovery < 24h ago)
* 3. It's currently marked as 'open' status
*/
export async function detectDroppedStores(
pool: Pool,
stateCode?: string
): Promise<DroppedStoreResult> {
// Find dispensaries that:
// 1. Have platform_dispensary_id (verified Dutchie stores)
// 2. Are currently 'open' status
// 3. Have a linked discovery record that wasn't seen in the last discovery run
// (last_seen_at in dutchie_discovery_locations is older than 24 hours)
const params: any[] = [];
let stateFilter = '';
if (stateCode) {
stateFilter = ` AND d.state = $1`;
params.push(stateCode);
}
const query = `
WITH recently_seen AS (
SELECT DISTINCT platform_location_id
FROM dutchie_discovery_locations
WHERE last_seen_at > NOW() - INTERVAL '24 hours'
AND active = true
)
SELECT
d.id,
d.name,
d.city,
d.state,
d.platform_dispensary_id,
d.updated_at as last_seen_at
FROM dispensaries d
WHERE d.platform_dispensary_id IS NOT NULL
AND d.platform = 'dutchie'
AND (d.status = 'open' OR d.status IS NULL)
AND d.crawl_enabled = true
AND d.platform_dispensary_id NOT IN (SELECT platform_location_id FROM recently_seen)
${stateFilter}
ORDER BY d.name
`;
const result = await pool.query(query, params);
const droppedStores = result.rows;
// Mark these stores as 'dropped' status
if (droppedStores.length > 0) {
const ids = droppedStores.map(s => s.id);
await pool.query(`
UPDATE dispensaries
SET status = 'dropped', updated_at = NOW()
WHERE id = ANY($1::int[])
`, [ids]);
// Log to promotion log for audit
for (const store of droppedStores) {
await pool.query(`
INSERT INTO dutchie_promotion_log
(dispensary_id, action, state_code, store_name, triggered_by)
VALUES ($1, 'dropped', $2, $3, 'discovery_detection')
`, [store.id, store.state, store.name]);
}
}
return {
droppedCount: droppedStores.length,
droppedStores: droppedStores.map(s => ({
id: s.id,
name: s.name,
city: s.city,
state: s.state,
platformDispensaryId: s.platform_dispensary_id,
lastSeenAt: s.last_seen_at,
})),
};
}

View File

@@ -127,6 +127,8 @@ export interface PromotionSummary {
errors: string[];
}>;
durationMs: number;
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
newDispensaryIds: number[];
}
/**
@@ -469,6 +471,8 @@ export async function promoteDiscoveredLocations(
const results: PromotionResult[] = [];
const rejectedRecords: PromotionSummary['rejectedRecords'] = [];
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
const newDispensaryIds: number[] = [];
let created = 0;
let updated = 0;
let skipped = 0;
@@ -525,6 +529,8 @@ export async function promoteDiscoveredLocations(
if (promotionResult.action === 'created') {
created++;
// Per TASK_WORKFLOW_2024-12-10.md: Track new IDs for task chaining
newDispensaryIds.push(promotionResult.dispensaryId);
} else {
updated++;
}
@@ -548,6 +554,8 @@ export async function promoteDiscoveredLocations(
results,
rejectedRecords,
durationMs: Date.now() - startTime,
// Per TASK_WORKFLOW_2024-12-10.md: Return new IDs for task chaining
newDispensaryIds,
};
}

View File

@@ -211,6 +211,8 @@ export interface FullDiscoveryResult {
totalLocationsFound: number;
totalLocationsUpserted: number;
durationMs: number;
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
newDispensaryIds?: number[];
}
// ============================================================

View File

@@ -16,6 +16,12 @@ import {
NormalizedBrand,
NormalizationResult,
} from './types';
import {
downloadProductImage,
ProductImageContext,
isImageStorageReady,
LocalImageSizes,
} from '../utils/image-storage';
const BATCH_SIZE = 100;
@@ -23,10 +29,21 @@ const BATCH_SIZE = 100;
// PRODUCT UPSERTS
// ============================================================
export interface NewProductInfo {
id: number; // store_products.id
externalProductId: string; // provider_product_id
name: string;
brandName: string | null;
primaryImageUrl: string | null;
hasLocalImage?: boolean; // True if local_image_path is already set
}
export interface UpsertProductsResult {
upserted: number;
new: number;
updated: number;
newProducts: NewProductInfo[]; // Details of newly created products
productsNeedingImages: NewProductInfo[]; // Products (new or updated) that need image downloads
}
/**
@@ -41,12 +58,14 @@ export async function upsertStoreProducts(
options: { dryRun?: boolean } = {}
): Promise<UpsertProductsResult> {
if (products.length === 0) {
return { upserted: 0, new: 0, updated: 0 };
return { upserted: 0, new: 0, updated: 0, newProducts: [], productsNeedingImages: [] };
}
const { dryRun = false } = options;
let newCount = 0;
let updatedCount = 0;
const newProducts: NewProductInfo[] = [];
const productsNeedingImages: NewProductInfo[] = [];
// Process in batches
for (let i = 0; i < products.length; i += BATCH_SIZE) {
@@ -71,7 +90,7 @@ export async function upsertStoreProducts(
name_raw, brand_name_raw, category_raw, subcategory_raw,
price_rec, price_med, price_rec_special, price_med_special,
is_on_special, discount_percent,
is_in_stock, stock_status,
is_in_stock, stock_status, stock_quantity, total_quantity_available,
thc_percent, cbd_percent,
image_url,
first_seen_at, last_seen_at, updated_at
@@ -80,9 +99,9 @@ export async function upsertStoreProducts(
$5, $6, $7, $8,
$9, $10, $11, $12,
$13, $14,
$15, $16,
$17, $18,
$19,
$15, $16, $17, $17,
$18, $19,
$20,
NOW(), NOW(), NOW()
)
ON CONFLICT (dispensary_id, provider, provider_product_id)
@@ -99,12 +118,14 @@ export async function upsertStoreProducts(
discount_percent = EXCLUDED.discount_percent,
is_in_stock = EXCLUDED.is_in_stock,
stock_status = EXCLUDED.stock_status,
stock_quantity = EXCLUDED.stock_quantity,
total_quantity_available = EXCLUDED.total_quantity_available,
thc_percent = EXCLUDED.thc_percent,
cbd_percent = EXCLUDED.cbd_percent,
image_url = EXCLUDED.image_url,
last_seen_at = NOW(),
updated_at = NOW()
RETURNING (xmax = 0) as is_new`,
RETURNING id, (xmax = 0) as is_new, (local_image_path IS NOT NULL) as has_local_image`,
[
product.dispensaryId,
product.platform,
@@ -122,6 +143,7 @@ export async function upsertStoreProducts(
productPricing?.discountPercent,
productAvailability?.inStock ?? true,
productAvailability?.stockStatus || 'unknown',
productAvailability?.quantity ?? null, // stock_quantity and total_quantity_available
// Clamp THC/CBD to valid percentage range (0-100) - some products report mg as %
product.thcPercent !== null && product.thcPercent <= 100 ? product.thcPercent : null,
product.cbdPercent !== null && product.cbdPercent <= 100 ? product.cbdPercent : null,
@@ -129,10 +151,30 @@ export async function upsertStoreProducts(
]
);
if (result.rows[0]?.is_new) {
const row = result.rows[0];
const productInfo: NewProductInfo = {
id: row.id,
externalProductId: product.externalProductId,
name: product.name,
brandName: product.brandName,
primaryImageUrl: product.primaryImageUrl,
hasLocalImage: row.has_local_image,
};
if (row.is_new) {
newCount++;
// Track new products
newProducts.push(productInfo);
// New products always need images (if they have a source URL)
if (product.primaryImageUrl && !row.has_local_image) {
productsNeedingImages.push(productInfo);
}
} else {
updatedCount++;
// Updated products need images only if they don't have a local image yet
if (product.primaryImageUrl && !row.has_local_image) {
productsNeedingImages.push(productInfo);
}
}
}
@@ -149,6 +191,8 @@ export async function upsertStoreProducts(
upserted: newCount + updatedCount,
new: newCount,
updated: updatedCount,
newProducts,
productsNeedingImages,
};
}
@@ -564,6 +608,19 @@ export async function upsertBrands(
// FULL HYDRATION
// ============================================================
export interface ImageDownloadResult {
downloaded: number;
skipped: number;
failed: number;
bytesTotal: number;
}
export interface DispensaryContext {
stateCode: string;
storeSlug: string;
hasExistingProducts?: boolean; // True if store already has products with local images
}
export interface HydratePayloadResult {
productsUpserted: number;
productsNew: number;
@@ -574,6 +631,154 @@ export interface HydratePayloadResult {
variantsUpserted: number;
variantsNew: number;
variantSnapshotsCreated: number;
imagesDownloaded: number;
imagesSkipped: number;
imagesFailed: number;
imagesBytesTotal: number;
}
/**
* Helper to create slug from string
*/
function slugify(str: string): string {
return str
.toLowerCase()
.replace(/[^a-z0-9]+/g, '-')
.replace(/^-+|-+$/g, '')
.substring(0, 50) || 'unknown';
}
/**
* Download images for new products and update their local paths
*/
export async function downloadProductImages(
pool: Pool,
newProducts: NewProductInfo[],
dispensaryContext: DispensaryContext,
options: { dryRun?: boolean; concurrency?: number } = {}
): Promise<ImageDownloadResult> {
const { dryRun = false, concurrency = 5 } = options;
// Filter products that have images to download
const productsWithImages = newProducts.filter(p => p.primaryImageUrl);
if (productsWithImages.length === 0) {
return { downloaded: 0, skipped: 0, failed: 0, bytesTotal: 0 };
}
// Check if image storage is ready
if (!isImageStorageReady()) {
console.warn('[ImageDownload] Image storage not initialized, skipping downloads');
return { downloaded: 0, skipped: productsWithImages.length, failed: 0, bytesTotal: 0 };
}
if (dryRun) {
console.log(`[DryRun] Would download ${productsWithImages.length} images`);
return { downloaded: 0, skipped: productsWithImages.length, failed: 0, bytesTotal: 0 };
}
let downloaded = 0;
let skipped = 0;
let failed = 0;
let bytesTotal = 0;
// Process in batches with concurrency limit
for (let i = 0; i < productsWithImages.length; i += concurrency) {
const batch = productsWithImages.slice(i, i + concurrency);
const results = await Promise.allSettled(
batch.map(async (product) => {
const ctx: ProductImageContext = {
stateCode: dispensaryContext.stateCode,
storeSlug: dispensaryContext.storeSlug,
brandSlug: slugify(product.brandName || 'unknown'),
productId: product.externalProductId,
};
const result = await downloadProductImage(product.primaryImageUrl!, ctx, { skipIfExists: true });
if (result.success) {
// Update the database with local image path
const imagesJson = JSON.stringify({
full: result.urls!.full,
medium: result.urls!.medium,
thumb: result.urls!.thumb,
});
await pool.query(
`UPDATE store_products
SET local_image_path = $1, images = $2
WHERE id = $3`,
[result.urls!.full, imagesJson, product.id]
);
}
return result;
})
);
for (const result of results) {
if (result.status === 'fulfilled') {
const downloadResult = result.value;
if (downloadResult.success) {
if (downloadResult.skipped) {
skipped++;
} else {
downloaded++;
bytesTotal += downloadResult.bytesDownloaded || 0;
}
} else {
failed++;
console.warn(`[ImageDownload] Failed: ${downloadResult.error}`);
}
} else {
failed++;
console.error(`[ImageDownload] Error:`, result.reason);
}
}
}
console.log(`[ImageDownload] Downloaded: ${downloaded}, Skipped: ${skipped}, Failed: ${failed}, Bytes: ${bytesTotal}`);
return { downloaded, skipped, failed, bytesTotal };
}
/**
* Get dispensary context for image paths
* Also checks if this dispensary already has products with local images
* to skip unnecessary filesystem checks for existing stores
*/
async function getDispensaryContext(pool: Pool, dispensaryId: number): Promise<DispensaryContext | null> {
try {
const result = await pool.query(
`SELECT
d.state,
d.slug,
d.name,
EXISTS(
SELECT 1 FROM store_products sp
WHERE sp.dispensary_id = d.id
AND sp.local_image_path IS NOT NULL
LIMIT 1
) as has_local_images
FROM dispensaries d
WHERE d.id = $1`,
[dispensaryId]
);
if (result.rows.length === 0) {
return null;
}
const row = result.rows[0];
return {
stateCode: row.state || 'unknown',
storeSlug: row.slug || slugify(row.name || `store-${dispensaryId}`),
hasExistingProducts: row.has_local_images,
};
} catch (error) {
console.error('[getDispensaryContext] Error:', error);
return null;
}
}
/**
@@ -584,9 +789,9 @@ export async function hydrateToCanonical(
dispensaryId: number,
normResult: NormalizationResult,
crawlRunId: number | null,
options: { dryRun?: boolean } = {}
options: { dryRun?: boolean; downloadImages?: boolean } = {}
): Promise<HydratePayloadResult> {
const { dryRun = false } = options;
const { dryRun = false, downloadImages: shouldDownloadImages = true } = options;
// 1. Upsert brands
const brandResult = await upsertBrands(pool, normResult.brands, { dryRun });
@@ -634,6 +839,36 @@ export async function hydrateToCanonical(
{ dryRun }
);
// 6. Download images for products that need them
// This includes:
// - New products (always need images)
// - Updated products that don't have local images yet (backfill)
// This avoids:
// - Filesystem checks for products that already have local images
// - Unnecessary HTTP requests for products with existing images
let imageResult: ImageDownloadResult = { downloaded: 0, skipped: 0, failed: 0, bytesTotal: 0 };
if (shouldDownloadImages && productResult.productsNeedingImages.length > 0) {
const dispensaryContext = await getDispensaryContext(pool, dispensaryId);
if (dispensaryContext) {
const newCount = productResult.productsNeedingImages.filter(p => !p.hasLocalImage).length;
const backfillCount = productResult.productsNeedingImages.length - newCount;
console.log(`[Hydration] Downloading images for ${productResult.productsNeedingImages.length} products (${productResult.new} new, ${backfillCount} backfill)...`);
imageResult = await downloadProductImages(
pool,
productResult.productsNeedingImages,
dispensaryContext,
{ dryRun }
);
} else {
console.warn(`[Hydration] Could not get dispensary context for ID ${dispensaryId}, skipping image downloads`);
}
} else if (productResult.productsNeedingImages.length === 0 && productResult.upserted > 0) {
// All products already have local images
console.log(`[Hydration] All ${productResult.upserted} products already have local images, skipping downloads`);
}
return {
productsUpserted: productResult.upserted,
productsNew: productResult.new,
@@ -644,5 +879,9 @@ export async function hydrateToCanonical(
variantsUpserted: variantResult.upserted,
variantsNew: variantResult.new,
variantSnapshotsCreated: variantResult.snapshotsCreated,
imagesDownloaded: imageResult.downloaded,
imagesSkipped: imageResult.skipped,
imagesFailed: imageResult.failed,
imagesBytesTotal: imageResult.bytesTotal,
};
}

View File

@@ -6,7 +6,12 @@ import { initializeMinio, isMinioEnabled } from './utils/minio';
import { initializeImageStorage } from './utils/image-storage';
import { logger } from './services/logger';
import { cleanupOrphanedJobs } from './services/proxyTestQueue';
// Per TASK_WORKFLOW_2024-12-10.md: Database-driven task scheduler
import { taskScheduler } from './services/task-scheduler';
import { runAutoMigrations } from './db/auto-migrate';
import { getPool } from './db/pool';
import healthRoutes from './routes/health';
import imageProxyRoutes from './routes/image-proxy';
dotenv.config();
@@ -29,6 +34,10 @@ app.use(express.json());
const LOCAL_IMAGES_PATH = process.env.LOCAL_IMAGES_PATH || './public/images';
app.use('/images', express.static(LOCAL_IMAGES_PATH));
// Image proxy with on-demand resizing
// Usage: /img/products/az/store/brand/product/image.webp?w=200&h=200
app.use('/img', imageProxyRoutes);
// Serve static downloads (plugin files, etc.)
// Uses ./public/downloads relative to working directory (works for both Docker and local dev)
const LOCAL_DOWNLOADS_PATH = process.env.LOCAL_DOWNLOADS_PATH || './public/downloads';
@@ -100,8 +109,9 @@ import scraperMonitorRoutes from './routes/scraper-monitor';
import apiTokensRoutes from './routes/api-tokens';
import apiPermissionsRoutes from './routes/api-permissions';
import parallelScrapeRoutes from './routes/parallel-scrape';
import crawlerSandboxRoutes from './routes/crawler-sandbox';
// crawler-sandbox moved to _deprecated
import versionRoutes from './routes/version';
import deployStatusRoutes from './routes/deploy-status';
import publicApiRoutes from './routes/public-api';
import usersRoutes from './routes/users';
import staleProcessesRoutes from './routes/stale-processes';
@@ -121,7 +131,6 @@ import { createStatesRouter } from './routes/states';
import { createAnalyticsV2Router } from './routes/analytics-v2';
import { createDiscoveryRoutes } from './discovery';
import pipelineRoutes from './routes/pipeline';
import { getPool } from './db/pool';
// Consumer API routes (findadispo.com, findagram.co)
import consumerAuthRoutes from './routes/consumer-auth';
@@ -133,6 +142,11 @@ import eventsRoutes from './routes/events';
import clickAnalyticsRoutes from './routes/click-analytics';
import seoRoutes from './routes/seo';
import priceAnalyticsRoutes from './routes/price-analytics';
import tasksRoutes from './routes/tasks';
import workerRegistryRoutes from './routes/worker-registry';
// Per TASK_WORKFLOW_2024-12-10.md: Raw payload access API
import payloadsRoutes from './routes/payloads';
import k8sRoutes from './routes/k8s';
// Mark requests from trusted domains (cannaiq.co, findagram.co, findadispo.com)
// These domains can access the API without authentication
@@ -173,8 +187,10 @@ app.use('/api/scraper-monitor', scraperMonitorRoutes);
app.use('/api/api-tokens', apiTokensRoutes);
app.use('/api/api-permissions', apiPermissionsRoutes);
app.use('/api/parallel-scrape', parallelScrapeRoutes);
app.use('/api/crawler-sandbox', crawlerSandboxRoutes);
// crawler-sandbox moved to _deprecated
app.use('/api/version', versionRoutes);
app.use('/api/admin/deploy-status', deployStatusRoutes);
console.log('[DeployStatus] Routes registered at /api/admin/deploy-status');
app.use('/api/users', usersRoutes);
app.use('/api/stale-processes', staleProcessesRoutes);
// Admin routes - orchestrator actions
@@ -203,6 +219,22 @@ app.use('/api/monitor', workersRoutes);
app.use('/api/job-queue', jobQueueRoutes);
console.log('[Workers] Routes registered at /api/workers, /api/monitor, and /api/job-queue');
// Task queue management - worker tasks with capacity planning
app.use('/api/tasks', tasksRoutes);
console.log('[Tasks] Routes registered at /api/tasks');
// Worker registry - dynamic worker registration, heartbeats, and name management
app.use('/api/worker-registry', workerRegistryRoutes);
console.log('[WorkerRegistry] Routes registered at /api/worker-registry');
// Per TASK_WORKFLOW_2024-12-10.md: Raw payload access API
app.use('/api/payloads', payloadsRoutes);
console.log('[Payloads] Routes registered at /api/payloads');
// K8s control routes - worker scaling from admin UI
app.use('/api/k8s', k8sRoutes);
console.log('[K8s] Routes registered at /api/k8s');
// Phase 3: Analytics V2 - Enhanced analytics with rec/med state segmentation
try {
const analyticsV2Router = createAnalyticsV2Router(getPool());
@@ -289,6 +321,17 @@ async function startServer() {
try {
logger.info('system', 'Starting server...');
// Run auto-migrations before anything else
const pool = getPool();
const migrationsApplied = await runAutoMigrations(pool);
if (migrationsApplied > 0) {
logger.info('system', `Applied ${migrationsApplied} database migrations`);
} else if (migrationsApplied === 0) {
logger.info('system', 'Database schema up to date');
} else {
logger.warn('system', 'Some migrations failed - check logs');
}
await initializeMinio();
await initializeImageStorage();
logger.info('system', isMinioEnabled() ? 'MinIO storage initialized' : 'Local filesystem storage initialized');
@@ -296,6 +339,17 @@ async function startServer() {
// Clean up any orphaned proxy test jobs from previous server runs
await cleanupOrphanedJobs();
// Per TASK_WORKFLOW_2024-12-10.md: Start database-driven task scheduler
// This replaces node-cron - schedules are stored in DB and survive restarts
// Uses SELECT FOR UPDATE SKIP LOCKED for multi-replica safety
try {
await taskScheduler.start();
logger.info('system', 'Task scheduler started');
} catch (err: any) {
// Non-fatal - scheduler can recover on next poll
logger.warn('system', `Task scheduler startup warning: ${err.message}`);
}
app.listen(PORT, () => {
logger.info('system', `Server running on port ${PORT}`);
console.log(`🚀 Server running on port ${PORT}`);

View File

@@ -5,8 +5,8 @@ import { Request, Response, NextFunction } from 'express';
* These are our own frontends that should have unrestricted access.
*/
const TRUSTED_DOMAINS = [
'cannaiq.co',
'www.cannaiq.co',
'*.cannaiq.co',
'*.cannabrands.app',
'findagram.co',
'www.findagram.co',
'findadispo.com',
@@ -32,6 +32,24 @@ function extractDomain(header: string): string | null {
}
}
/**
* Checks if a domain matches any trusted domain (supports *.domain.com wildcards)
*/
function isTrustedDomain(domain: string): boolean {
for (const trusted of TRUSTED_DOMAINS) {
if (trusted.startsWith('*.')) {
// Wildcard: *.example.com matches example.com and any subdomain
const baseDomain = trusted.slice(2);
if (domain === baseDomain || domain.endsWith('.' + baseDomain)) {
return true;
}
} else if (domain === trusted) {
return true;
}
}
return false;
}
/**
* Checks if the request comes from a trusted domain
*/
@@ -42,7 +60,7 @@ function isRequestFromTrustedDomain(req: Request): boolean {
// Check Origin header first (preferred for CORS requests)
if (origin) {
const domain = extractDomain(origin);
if (domain && TRUSTED_DOMAINS.includes(domain)) {
if (domain && isTrustedDomain(domain)) {
return true;
}
}
@@ -50,7 +68,7 @@ function isRequestFromTrustedDomain(req: Request): boolean {
// Fallback to Referer header
if (referer) {
const domain = extractDomain(referer);
if (domain && TRUSTED_DOMAINS.includes(domain)) {
if (domain && isTrustedDomain(domain)) {
return true;
}
}

View File

@@ -702,12 +702,10 @@ export class StateQueryService {
async getNationalSummary(): Promise<NationalSummary> {
const stateMetrics = await this.getAllStateMetrics();
// Get all states count and aggregate metrics
const result = await this.pool.query(`
SELECT
COUNT(DISTINCT s.code) AS total_states,
COUNT(DISTINCT CASE WHEN EXISTS (
SELECT 1 FROM dispensaries d WHERE d.state = s.code AND d.menu_type IS NOT NULL
) THEN s.code END) AS active_states,
(SELECT COUNT(*) FROM dispensaries WHERE state IS NOT NULL) AS total_stores,
(SELECT COUNT(*) FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
@@ -725,7 +723,7 @@ export class StateQueryService {
return {
totalStates: parseInt(data.total_states),
activeStates: parseInt(data.active_states),
activeStates: parseInt(data.total_states), // Same as totalStates - all states shown
totalStores: parseInt(data.total_stores),
totalProducts: parseInt(data.total_products),
totalBrands: parseInt(data.total_brands),

View File

@@ -5,22 +5,35 @@
*
* DO NOT MODIFY THIS FILE WITHOUT EXPLICIT AUTHORIZATION.
*
* This is the canonical HTTP client for all Dutchie communication.
* All Dutchie workers (Alice, Bella, etc.) MUST use this client.
* Updated: 2025-12-10 per workflow-12102025.md
*
* KEY BEHAVIORS (per workflow-12102025.md):
* 1. startSession() gets identity from PROXY LOCATION, not task params
* 2. On 403: immediately get new IP + new fingerprint, then retry
* 3. After 3 consecutive 403s on same proxy → disable it (burned)
* 4. Language is always English (en-US)
*
* IMPLEMENTATION:
* - Uses curl via child_process.execSync (bypasses TLS fingerprinting)
* - NO Puppeteer, NO axios, NO fetch
* - Fingerprint rotation on 403
* - Uses intoli/user-agents via CrawlRotator for realistic fingerprints
* - Residential IP compatible
*
* USAGE:
* import { curlPost, curlGet, executeGraphQL } from '@dutchie/client';
* import { curlPost, curlGet, executeGraphQL, startSession } from '@dutchie/client';
*
* ============================================================
*/
import { execSync } from 'child_process';
import {
buildOrderedHeaders,
buildRefererFromMenuUrl,
getCurlBinary,
isCurlImpersonateAvailable,
HeaderContext,
BrowserType,
} from '../../services/http-fingerprint';
// ============================================================
// TYPES
@@ -32,6 +45,8 @@ export interface CurlResponse {
error?: string;
}
// Per workflow-12102025.md: fingerprint comes from CrawlRotator's BrowserFingerprint
// We keep a simplified interface here for header building
export interface Fingerprint {
userAgent: string;
acceptLanguage: string;
@@ -57,15 +72,13 @@ export const DUTCHIE_CONFIG = {
// ============================================================
// PROXY SUPPORT
// ============================================================
// Integrates with the CrawlRotator system from proxy-rotator.ts
// On 403 errors:
// 1. Record failure on current proxy
// 2. Rotate to next proxy
// 3. Retry with new proxy
// Per workflow-12102025.md:
// - On 403: recordBlock() → increment consecutive_403_count
// - After 3 consecutive 403s → proxy disabled
// - Immediately rotate to new IP + new fingerprint on 403
// ============================================================
import type { CrawlRotator, Proxy } from '../../services/crawl-rotator';
import type { CrawlRotator, BrowserFingerprint } from '../../services/crawl-rotator';
let currentProxy: string | null = null;
let crawlRotator: CrawlRotator | null = null;
@@ -92,13 +105,12 @@ export function getProxy(): string | null {
/**
* Set CrawlRotator for proxy rotation on 403s
* This enables automatic proxy rotation when blocked
* Per workflow-12102025.md: enables automatic rotation when blocked
*/
export function setCrawlRotator(rotator: CrawlRotator | null): void {
crawlRotator = rotator;
if (rotator) {
console.log('[Dutchie Client] CrawlRotator attached - proxy rotation enabled');
// Set initial proxy from rotator
const proxy = rotator.proxy.getCurrent();
if (proxy) {
currentProxy = rotator.proxy.getProxyUrl(proxy);
@@ -115,30 +127,41 @@ export function getCrawlRotator(): CrawlRotator | null {
}
/**
* Rotate to next proxy (called on 403)
* Handle 403 block - per workflow-12102025.md:
* 1. Record block on current proxy (increments consecutive_403_count)
* 2. Immediately rotate to new proxy (new IP)
* 3. Rotate fingerprint
* Returns false if no more proxies available
*/
async function rotateProxyOn403(error?: string): Promise<boolean> {
async function handle403Block(): Promise<boolean> {
if (!crawlRotator) {
console.warn('[Dutchie Client] No CrawlRotator - cannot handle 403');
return false;
}
// Record failure on current proxy
await crawlRotator.recordFailure(error || '403 Forbidden');
// Per workflow-12102025.md: record block (tracks consecutive 403s)
const wasDisabled = await crawlRotator.recordBlock();
if (wasDisabled) {
console.log('[Dutchie Client] Current proxy was disabled (3 consecutive 403s)');
}
// Per workflow-12102025.md: immediately get new IP + new fingerprint
const { proxy: nextProxy, fingerprint } = crawlRotator.rotateBoth();
// Rotate to next proxy
const nextProxy = crawlRotator.rotateProxy();
if (nextProxy) {
currentProxy = crawlRotator.proxy.getProxyUrl(nextProxy);
console.log(`[Dutchie Client] Rotated proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
console.log(`[Dutchie Client] Rotated to new proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
console.log(`[Dutchie Client] New fingerprint: ${fingerprint.userAgent.slice(0, 50)}...`);
return true;
}
console.warn('[Dutchie Client] No more proxies available');
console.error('[Dutchie Client] No more proxies available!');
return false;
}
/**
* Record success on current proxy
* Per workflow-12102025.md: resets consecutive_403_count
*/
async function recordProxySuccess(responseTimeMs?: number): Promise<void> {
if (crawlRotator) {
@@ -163,69 +186,89 @@ export const GRAPHQL_HASHES = {
};
// ============================================================
// FINGERPRINTS - Browser profiles for anti-detect
// SESSION MANAGEMENT
// Per workflow-12102025.md:
// - Session identity comes from PROXY LOCATION
// - NOT from task params (no stateCode/timezone params)
// - Language is always English
// ============================================================
const FINGERPRINTS: Fingerprint[] = [
// Chrome Windows (latest) - typical residential user, use first
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Chrome Mac (latest)
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"macOS"',
secChUaMobile: '?0',
},
// Chrome Windows (120)
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Firefox Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0',
acceptLanguage: 'en-US,en;q=0.5',
},
// Safari Mac
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
acceptLanguage: 'en-US,en;q=0.9',
},
// Edge Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
];
let currentFingerprintIndex = 0;
export function getFingerprint(): Fingerprint {
return FINGERPRINTS[currentFingerprintIndex];
export interface CrawlSession {
sessionId: string;
fingerprint: BrowserFingerprint;
proxyUrl: string | null;
proxyTimezone?: string;
proxyState?: string;
startedAt: Date;
// Per workflow-12102025.md: Dynamic Referer per dispensary
menuUrl?: string;
referer: string;
}
export function rotateFingerprint(): Fingerprint {
currentFingerprintIndex = (currentFingerprintIndex + 1) % FINGERPRINTS.length;
const fp = FINGERPRINTS[currentFingerprintIndex];
console.log(`[Dutchie Client] Rotated to fingerprint: ${fp.userAgent.slice(0, 50)}...`);
return fp;
let currentSession: CrawlSession | null = null;
/**
* Start a new crawl session
*
* Per workflow-12102025.md:
* - NO state/timezone params - identity comes from proxy location
* - Gets fingerprint from CrawlRotator (uses intoli/user-agents)
* - Language is always English (en-US)
* - Dynamic Referer per dispensary (from menuUrl)
*
* @param menuUrl - The dispensary's menu URL for dynamic Referer header
*/
export function startSession(menuUrl?: string): CrawlSession {
if (!crawlRotator) {
throw new Error('[Dutchie Client] Cannot start session without CrawlRotator');
}
// Per workflow-12102025.md: get identity from proxy location
const proxyLocation = crawlRotator.getProxyLocation();
const fingerprint = crawlRotator.userAgent.getCurrent();
// Per workflow-12102025.md: Dynamic Referer per dispensary
const referer = buildRefererFromMenuUrl(menuUrl);
currentSession = {
sessionId: `session_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
fingerprint,
proxyUrl: currentProxy,
proxyTimezone: proxyLocation?.timezone,
proxyState: proxyLocation?.state,
startedAt: new Date(),
menuUrl,
referer,
};
console.log(`[Dutchie Client] Started session ${currentSession.sessionId}`);
console.log(`[Dutchie Client] Browser: ${fingerprint.browserName} (${fingerprint.deviceCategory})`);
console.log(`[Dutchie Client] DNT: ${fingerprint.httpFingerprint.hasDNT ? 'enabled' : 'disabled'}`);
console.log(`[Dutchie Client] TLS: ${fingerprint.httpFingerprint.curlImpersonateBinary}`);
console.log(`[Dutchie Client] Referer: ${referer}`);
if (proxyLocation?.timezone) {
console.log(`[Dutchie Client] Proxy: ${proxyLocation.state || 'unknown'} (${proxyLocation.timezone})`);
}
return currentSession;
}
export function resetFingerprint(): void {
currentFingerprintIndex = 0;
/**
* End the current crawl session
*/
export function endSession(): void {
if (currentSession) {
const duration = Math.round((Date.now() - currentSession.startedAt.getTime()) / 1000);
console.log(`[Dutchie Client] Ended session ${currentSession.sessionId} (${duration}s)`);
currentSession = null;
}
}
/**
* Get current active session
*/
export function getCurrentSession(): CrawlSession | null {
return currentSession;
}
// ============================================================
@@ -233,48 +276,80 @@ export function resetFingerprint(): void {
// ============================================================
/**
* Build headers for Dutchie requests
* Per workflow-12102025.md: Build headers using HTTP fingerprint system
* Returns headers in browser-specific order with all natural variations
*/
export function buildHeaders(refererPath: string, fingerprint?: Fingerprint): Record<string, string> {
const fp = fingerprint || getFingerprint();
const refererUrl = `https://dutchie.com${refererPath}`;
const headers: Record<string, string> = {
'accept': 'application/json, text/plain, */*',
'accept-language': fp.acceptLanguage,
'content-type': 'application/json',
'origin': 'https://dutchie.com',
'referer': refererUrl,
'user-agent': fp.userAgent,
'apollographql-client-name': 'Marketplace (production)',
};
if (fp.secChUa) {
headers['sec-ch-ua'] = fp.secChUa;
headers['sec-ch-ua-mobile'] = fp.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fp.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'empty';
headers['sec-fetch-mode'] = 'cors';
headers['sec-fetch-site'] = 'same-site';
export function buildHeaders(isPost: boolean, contentLength?: number): { headers: Record<string, string>; orderedHeaders: string[] } {
if (!currentSession || !crawlRotator) {
throw new Error('[Dutchie Client] Cannot build headers without active session');
}
return headers;
const fp = currentSession.fingerprint;
const httpFp = fp.httpFingerprint;
// Per workflow-12102025.md: Build context for ordered headers
const context: HeaderContext = {
userAgent: fp.userAgent,
secChUa: fp.secChUa,
secChUaPlatform: fp.secChUaPlatform,
secChUaMobile: fp.secChUaMobile,
referer: currentSession.referer,
isPost,
contentLength,
};
// Per workflow-12102025.md: Get ordered headers from HTTP fingerprint service
return buildOrderedHeaders(httpFp, context);
}
/**
* Execute HTTP POST using curl (bypasses TLS fingerprinting)
* Per workflow-12102025.md: Get curl binary for current session's browser
* Uses curl-impersonate for TLS fingerprint matching
*/
export function curlPost(url: string, body: any, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
function getCurlBinaryForSession(): string {
if (!currentSession) {
return 'curl'; // Fallback to standard curl
}
const browserType = currentSession.fingerprint.browserName as BrowserType;
// Per workflow-12102025.md: Check if curl-impersonate is available
if (isCurlImpersonateAvailable(browserType)) {
return getCurlBinary(browserType);
}
// Fallback to standard curl with warning
console.warn(`[Dutchie Client] curl-impersonate not available for ${browserType}, using standard curl`);
return 'curl';
}
/**
* Per workflow-12102025.md: Execute HTTP POST using curl/curl-impersonate
* - Uses browser-specific TLS fingerprint via curl-impersonate
* - Headers sent in browser-specific order
* - Dynamic Referer per dispensary
*/
export function curlPost(url: string, body: any, timeout = 30000): CurlResponse {
const bodyJson = JSON.stringify(body);
// Per workflow-12102025.md: Build ordered headers for POST request
const { headers, orderedHeaders } = buildHeaders(true, bodyJson.length);
// Per workflow-12102025.md: Build header args in browser-specific order
const headerArgs = orderedHeaders
.filter(h => h !== 'Host' && h !== 'Content-Length') // curl handles these
.map(h => `-H '${h}: ${headers[h]}'`)
.join(' ');
const bodyJson = JSON.stringify(body).replace(/'/g, "'\\''");
const bodyEscaped = bodyJson.replace(/'/g, "'\\''");
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} -d '${bodyJson}' '${url}'`;
// Per workflow-12102025.md: Use curl-impersonate for TLS fingerprint matching
const curlBinary = getCurlBinaryForSession();
const cmd = `${curlBinary} -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${headerArgs} -d '${bodyEscaped}' '${url}'`;
try {
const output = execSync(cmd, {
@@ -313,19 +388,29 @@ export function curlPost(url: string, body: any, headers: Record<string, string>
}
/**
* Execute HTTP GET using curl (bypasses TLS fingerprinting)
* Returns HTML or JSON depending on response content-type
* Per workflow-12102025.md: Execute HTTP GET using curl/curl-impersonate
* - Uses browser-specific TLS fingerprint via curl-impersonate
* - Headers sent in browser-specific order
* - Dynamic Referer per dispensary
*/
export function curlGet(url: string, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
export function curlGet(url: string, timeout = 30000): CurlResponse {
// Per workflow-12102025.md: Build ordered headers for GET request
const { headers, orderedHeaders } = buildHeaders(false);
// Per workflow-12102025.md: Build header args in browser-specific order
const headerArgs = orderedHeaders
.filter(h => h !== 'Host' && h !== 'Content-Length') // curl handles these
.map(h => `-H '${h}: ${headers[h]}'`)
.join(' ');
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} '${url}'`;
// Per workflow-12102025.md: Use curl-impersonate for TLS fingerprint matching
const curlBinary = getCurlBinaryForSession();
const cmd = `${curlBinary} -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${headerArgs} '${url}'`;
try {
const output = execSync(cmd, {
@@ -345,7 +430,6 @@ export function curlGet(url: string, headers: Record<string, string>, timeout =
const responseBody = output.slice(0, separatorIndex);
const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10);
// Try to parse as JSON, otherwise return as string (HTML)
try {
return { status: statusCode, data: JSON.parse(responseBody) };
} catch {
@@ -362,16 +446,22 @@ export function curlGet(url: string, headers: Record<string, string>, timeout =
// ============================================================
// GRAPHQL EXECUTION
// Per workflow-12102025.md:
// - On 403: immediately rotate IP + fingerprint (no delay first)
// - Then retry
// ============================================================
export interface ExecuteGraphQLOptions {
maxRetries?: number;
retryOn403?: boolean;
cName?: string; // Optional - used for Referer header, defaults to 'cities'
cName?: string;
}
/**
* Execute GraphQL query with curl (bypasses TLS fingerprinting)
* Per workflow-12102025.md: Execute GraphQL query with curl/curl-impersonate
* - Uses browser-specific TLS fingerprint
* - Headers in browser-specific order
* - On 403: immediately rotate IP + fingerprint, then retry
*/
export async function executeGraphQL(
operationName: string,
@@ -379,7 +469,12 @@ export async function executeGraphQL(
hash: string,
options: ExecuteGraphQLOptions
): Promise<any> {
const { maxRetries = 3, retryOn403 = true, cName = 'cities' } = options;
const { maxRetries = 3, retryOn403 = true } = options;
// Per workflow-12102025.md: Session must be active for requests
if (!currentSession) {
throw new Error('[Dutchie Client] Cannot execute GraphQL without active session - call startSession() first');
}
const body = {
operationName,
@@ -393,14 +488,14 @@ export async function executeGraphQL(
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers = buildHeaders(`/embedded-menu/${cName}`, fingerprint);
console.log(`[Dutchie Client] curl POST ${operationName} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, headers, DUTCHIE_CONFIG.timeout);
const startTime = Date.now();
// Per workflow-12102025.md: curlPost now uses ordered headers and curl-impersonate
const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, DUTCHIE_CONFIG.timeout);
const responseTime = Date.now() - startTime;
console.log(`[Dutchie Client] Response status: ${response.status}`);
console.log(`[Dutchie Client] Response status: ${response.status} (${responseTime}ms)`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
@@ -413,6 +508,9 @@ export async function executeGraphQL(
}
if (response.status === 200) {
// Per workflow-12102025.md: success resets consecutive 403 count
await recordProxySuccess(responseTime);
if (response.data?.errors?.length > 0) {
console.warn(`[Dutchie Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`);
}
@@ -420,10 +518,20 @@ export async function executeGraphQL(
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`);
rotateFingerprint();
// Per workflow-12102025.md: immediately rotate IP + fingerprint
console.warn(`[Dutchie Client] 403 blocked - immediately rotating proxy + fingerprint...`);
const hasMoreProxies = await handle403Block();
if (!hasMoreProxies) {
throw new Error('All proxies exhausted - no more IPs available');
}
// Per workflow-12102025.md: Update session referer after rotation
currentSession.referer = buildRefererFromMenuUrl(currentSession.menuUrl);
attempt++;
await sleep(1000 * attempt);
// Per workflow-12102025.md: small backoff after rotation
await sleep(500);
continue;
}
@@ -452,8 +560,10 @@ export interface FetchPageOptions {
}
/**
* Fetch HTML page from Dutchie (for city pages, dispensary pages, etc.)
* Returns raw HTML string
* Per workflow-12102025.md: Fetch HTML page from Dutchie
* - Uses browser-specific TLS fingerprint
* - Headers in browser-specific order
* - Same 403 handling as GraphQL
*/
export async function fetchPage(
path: string,
@@ -462,32 +572,22 @@ export async function fetchPage(
const { maxRetries = 3, retryOn403 = true } = options;
const url = `${DUTCHIE_CONFIG.baseUrl}${path}`;
// Per workflow-12102025.md: Session must be active for requests
if (!currentSession) {
throw new Error('[Dutchie Client] Cannot fetch page without active session - call startSession() first');
}
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers: Record<string, string> = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'accept-language': fingerprint.acceptLanguage,
'user-agent': fingerprint.userAgent,
};
if (fingerprint.secChUa) {
headers['sec-ch-ua'] = fingerprint.secChUa;
headers['sec-ch-ua-mobile'] = fingerprint.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fingerprint.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'document';
headers['sec-fetch-mode'] = 'navigate';
headers['sec-fetch-site'] = 'none';
headers['sec-fetch-user'] = '?1';
headers['upgrade-insecure-requests'] = '1';
}
// Per workflow-12102025.md: curlGet now uses ordered headers and curl-impersonate
console.log(`[Dutchie Client] curl GET ${path} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlGet(url, headers, DUTCHIE_CONFIG.timeout);
const startTime = Date.now();
const response = curlGet(url, DUTCHIE_CONFIG.timeout);
const responseTime = Date.now() - startTime;
console.log(`[Dutchie Client] Response status: ${response.status}`);
console.log(`[Dutchie Client] Response status: ${response.status} (${responseTime}ms)`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
@@ -499,14 +599,26 @@ export async function fetchPage(
}
if (response.status === 200) {
// Per workflow-12102025.md: success resets consecutive 403 count
await recordProxySuccess(responseTime);
return { html: response.data, status: response.status };
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`);
rotateFingerprint();
// Per workflow-12102025.md: immediately rotate IP + fingerprint
console.warn(`[Dutchie Client] 403 blocked - immediately rotating proxy + fingerprint...`);
const hasMoreProxies = await handle403Block();
if (!hasMoreProxies) {
throw new Error('All proxies exhausted - no more IPs available');
}
// Per workflow-12102025.md: Update session after rotation
currentSession.referer = buildRefererFromMenuUrl(currentSession.menuUrl);
attempt++;
await sleep(1000 * attempt);
// Per workflow-12102025.md: small backoff after rotation
await sleep(500);
continue;
}

View File

@@ -6,18 +6,20 @@
*/
export {
// HTTP Client
// HTTP Client (per workflow-12102025.md: uses curl-impersonate + ordered headers)
curlPost,
curlGet,
executeGraphQL,
fetchPage,
extractNextData,
// Headers & Fingerprints
// Headers (per workflow-12102025.md: browser-specific ordering)
buildHeaders,
getFingerprint,
rotateFingerprint,
resetFingerprint,
// Session Management (per workflow-12102025.md: menuUrl for dynamic Referer)
startSession,
endSession,
getCurrentSession,
// Proxy
setProxy,
@@ -32,6 +34,7 @@ export {
// Types
type CurlResponse,
type Fingerprint,
type CrawlSession,
type ExecuteGraphQLOptions,
type FetchPageOptions,
} from './client';

View File

@@ -7,15 +7,23 @@
* Routes are prefixed with /api/analytics/v2
*
* Phase 3: Analytics Engine + Rec/Med by State
*
* SECURITY: All routes require authentication via authMiddleware.
* Access is granted to:
* - Trusted origins (cannaiq.co, findadispo.com, etc.)
* - Trusted IPs (localhost, internal pods)
* - Valid JWT or API tokens
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import { authMiddleware } from '../auth/middleware';
import { PriceAnalyticsService } from '../services/analytics/PriceAnalyticsService';
import { BrandPenetrationService } from '../services/analytics/BrandPenetrationService';
import { CategoryAnalyticsService } from '../services/analytics/CategoryAnalyticsService';
import { StoreAnalyticsService } from '../services/analytics/StoreAnalyticsService';
import { StateAnalyticsService } from '../services/analytics/StateAnalyticsService';
import { BrandIntelligenceService } from '../services/analytics/BrandIntelligenceService';
import { TimeWindow, LegalType } from '../services/analytics/types';
function parseTimeWindow(window?: string): TimeWindow {
@@ -35,12 +43,17 @@ function parseLegalType(legalType?: string): LegalType {
export function createAnalyticsV2Router(pool: Pool): Router {
const router = Router();
// SECURITY: Apply auth middleware to ALL routes
// This gate ensures only authenticated requests can access analytics data
router.use(authMiddleware);
// Initialize services
const priceService = new PriceAnalyticsService(pool);
const brandService = new BrandPenetrationService(pool);
const categoryService = new CategoryAnalyticsService(pool);
const storeService = new StoreAnalyticsService(pool);
const stateService = new StateAnalyticsService(pool);
const brandIntelligenceService = new BrandIntelligenceService(pool);
// ============================================================
// PRICE ANALYTICS
@@ -231,6 +244,76 @@ export function createAnalyticsV2Router(pool: Pool): Router {
}
});
/**
* GET /brand/:name/promotions
* Get brand promotional history - tracks specials, discounts, duration, and sales estimates
*
* Query params:
* - window: 7d|30d|90d (default: 90d)
* - state: state code filter (e.g., AZ)
* - category: category filter (e.g., Flower)
*/
router.get('/brand/:name/promotions', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const window = parseTimeWindow(req.query.window as string) || '90d';
const stateCode = req.query.state as string | undefined;
const category = req.query.category as string | undefined;
const result = await brandService.getBrandPromotionalHistory(brandName, {
window,
stateCode,
category,
});
res.json(result);
} catch (error) {
console.error('[AnalyticsV2] Brand promotions error:', error);
res.status(500).json({ error: 'Failed to fetch brand promotional history' });
}
});
/**
* GET /brand/:name/intelligence
* Get comprehensive B2B brand intelligence dashboard data
*
* Returns all brand metrics in a single unified response:
* - Performance Snapshot (active SKUs, revenue, stores, market share)
* - Alerts/Slippage (lost stores, delisted SKUs, competitor takeovers)
* - Product Velocity (daily rates, velocity status)
* - Retail Footprint (penetration, whitespace opportunities)
* - Competitive Landscape (price position, market share trend)
* - Inventory Health (days of stock, risk levels)
* - Promotion Effectiveness (baseline vs promo velocity, ROI)
*
* Query params:
* - window: 7d|30d|90d (default: 30d)
* - state: state code filter (e.g., AZ)
* - category: category filter (e.g., Flower)
*/
router.get('/brand/:name/intelligence', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const window = parseTimeWindow(req.query.window as string);
const stateCode = req.query.state as string | undefined;
const category = req.query.category as string | undefined;
const result = await brandIntelligenceService.getBrandIntelligence(brandName, {
window,
stateCode,
category,
});
if (!result) {
return res.status(404).json({ error: 'Brand not found' });
}
res.json(result);
} catch (error) {
console.error('[AnalyticsV2] Brand intelligence error:', error);
res.status(500).json({ error: 'Failed to fetch brand intelligence' });
}
});
// ============================================================
// CATEGORY ANALYTICS
// ============================================================
@@ -400,6 +483,31 @@ export function createAnalyticsV2Router(pool: Pool): Router {
}
});
/**
* GET /store/:id/quantity-changes
* Get quantity changes for a store (increases/decreases)
* Useful for estimating sales (decreases) or restocks (increases)
*
* Query params:
* - window: 7d|30d|90d (default: 7d)
* - direction: increase|decrease|all (default: all)
* - limit: number (default: 100)
*/
router.get('/store/:id/quantity-changes', async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
const window = parseTimeWindow(req.query.window as string);
const direction = (req.query.direction as 'increase' | 'decrease' | 'all') || 'all';
const limit = req.query.limit ? parseInt(req.query.limit as string) : 100;
const result = await storeService.getQuantityChanges(dispensaryId, { window, direction, limit });
res.json(result);
} catch (error) {
console.error('[AnalyticsV2] Store quantity changes error:', error);
res.status(500).json({ error: 'Failed to fetch store quantity changes' });
}
});
/**
* GET /store/:id/inventory
* Get store inventory composition

View File

@@ -47,4 +47,27 @@ router.post('/refresh', authMiddleware, async (req: AuthRequest, res) => {
res.json({ token });
});
// Verify password for sensitive actions (requires current user to be authenticated)
router.post('/verify-password', authMiddleware, async (req: AuthRequest, res) => {
try {
const { password } = req.body;
if (!password) {
return res.status(400).json({ error: 'Password required' });
}
// Re-authenticate the current user with the provided password
const user = await authenticateUser(req.user!.email, password);
if (!user) {
return res.status(401).json({ error: 'Invalid password', verified: false });
}
res.json({ verified: true });
} catch (error) {
console.error('Password verification error:', error);
res.status(500).json({ error: 'Internal server error' });
}
});
export default router;

View File

@@ -5,31 +5,35 @@ import { pool } from '../db/pool';
const router = Router();
router.use(authMiddleware);
// Get categories (flat list)
// Get categories (flat list) - derived from actual product data
router.get('/', async (req, res) => {
try {
const { store_id } = req.query;
const { store_id, in_stock_only } = req.query;
let query = `
SELECT
c.*,
COUNT(DISTINCT p.id) as product_count,
pc.name as parent_name
FROM categories c
LEFT JOIN store_products p ON c.name = p.category_raw
LEFT JOIN categories pc ON c.parent_id = pc.id
category_raw as name,
category_raw as slug,
COUNT(*) as product_count,
COUNT(*) FILTER (WHERE is_in_stock = true) as in_stock_count
FROM store_products
WHERE category_raw IS NOT NULL
`;
const params: any[] = [];
if (store_id) {
query += ' WHERE c.store_id = $1';
params.push(store_id);
query += ` AND dispensary_id = $${params.length}`;
}
if (in_stock_only === 'true') {
query += ` AND is_in_stock = true`;
}
query += `
GROUP BY c.id, pc.name
ORDER BY c.display_order, c.name
GROUP BY category_raw
ORDER BY category_raw
`;
const result = await pool.query(query, params);
@@ -40,49 +44,85 @@ router.get('/', async (req, res) => {
}
});
// Get category tree (hierarchical)
// Get category tree (hierarchical) - category -> subcategory structure from product data
router.get('/tree', async (req, res) => {
try {
const { store_id } = req.query;
const { store_id, in_stock_only } = req.query;
if (!store_id) {
return res.status(400).json({ error: 'store_id is required' });
// Get category + subcategory combinations with counts
let query = `
SELECT
category_raw as category,
subcategory_raw as subcategory,
COUNT(*) as product_count,
COUNT(*) FILTER (WHERE is_in_stock = true) as in_stock_count
FROM store_products
WHERE category_raw IS NOT NULL
`;
const params: any[] = [];
if (store_id) {
params.push(store_id);
query += ` AND dispensary_id = $${params.length}`;
}
// Get all categories for the store
const result = await pool.query(`
SELECT
c.*,
COUNT(DISTINCT p.id) as product_count
FROM categories c
LEFT JOIN store_products p ON c.name = p.category_raw AND p.is_in_stock = true AND p.dispensary_id = $1
WHERE c.store_id = $1
GROUP BY c.id
ORDER BY c.display_order, c.name
`, [store_id]);
if (in_stock_only === 'true') {
query += ` AND is_in_stock = true`;
}
// Build tree structure
const categories = result.rows;
const categoryMap = new Map();
const tree: any[] = [];
query += `
GROUP BY category_raw, subcategory_raw
ORDER BY category_raw, subcategory_raw
`;
// First pass: create map
categories.forEach((cat: { id: number; parent_id?: number }) => {
categoryMap.set(cat.id, { ...cat, children: [] });
});
const result = await pool.query(query, params);
// Second pass: build tree
categories.forEach((cat: { id: number; parent_id?: number }) => {
const node = categoryMap.get(cat.id);
if (cat.parent_id) {
const parent = categoryMap.get(cat.parent_id);
if (parent) {
parent.children.push(node);
}
} else {
tree.push(node);
// Build tree structure: category -> subcategories
const categoryMap = new Map<string, {
name: string;
slug: string;
product_count: number;
in_stock_count: number;
subcategories: Array<{
name: string;
slug: string;
product_count: number;
in_stock_count: number;
}>;
}>();
for (const row of result.rows) {
const category = row.category;
const subcategory = row.subcategory;
const count = parseInt(row.product_count);
const inStockCount = parseInt(row.in_stock_count);
if (!categoryMap.has(category)) {
categoryMap.set(category, {
name: category,
slug: category.toLowerCase().replace(/\s+/g, '-'),
product_count: 0,
in_stock_count: 0,
subcategories: []
});
}
});
const cat = categoryMap.get(category)!;
cat.product_count += count;
cat.in_stock_count += inStockCount;
if (subcategory) {
cat.subcategories.push({
name: subcategory,
slug: subcategory.toLowerCase().replace(/\s+/g, '-'),
product_count: count,
in_stock_count: inStockCount
});
}
}
const tree = Array.from(categoryMap.values());
res.json({ tree });
} catch (error) {
@@ -91,4 +131,91 @@ router.get('/tree', async (req, res) => {
}
});
// Get all unique subcategories for a category
router.get('/:category/subcategories', async (req, res) => {
try {
const { category } = req.params;
const { store_id, in_stock_only } = req.query;
let query = `
SELECT
subcategory_raw as name,
subcategory_raw as slug,
COUNT(*) as product_count,
COUNT(*) FILTER (WHERE is_in_stock = true) as in_stock_count
FROM store_products
WHERE category_raw = $1
AND subcategory_raw IS NOT NULL
`;
const params: any[] = [category];
if (store_id) {
params.push(store_id);
query += ` AND dispensary_id = $${params.length}`;
}
if (in_stock_only === 'true') {
query += ` AND is_in_stock = true`;
}
query += `
GROUP BY subcategory_raw
ORDER BY subcategory_raw
`;
const result = await pool.query(query, params);
res.json({
category,
subcategories: result.rows
});
} catch (error) {
console.error('Error fetching subcategories:', error);
res.status(500).json({ error: 'Failed to fetch subcategories' });
}
});
// Get global category summary (across all stores)
router.get('/summary', async (req, res) => {
try {
const { state } = req.query;
let query = `
SELECT
sp.category_raw as category,
COUNT(DISTINCT sp.id) as product_count,
COUNT(DISTINCT sp.dispensary_id) as store_count,
COUNT(*) FILTER (WHERE sp.is_in_stock = true) as in_stock_count
FROM store_products sp
`;
const params: any[] = [];
if (state) {
query += `
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.category_raw IS NOT NULL
AND d.state = $1
`;
params.push(state);
} else {
query += ` WHERE sp.category_raw IS NOT NULL`;
}
query += `
GROUP BY sp.category_raw
ORDER BY product_count DESC
`;
const result = await pool.query(query, params);
res.json({
categories: result.rows,
total_categories: result.rows.length
});
} catch (error) {
console.error('Error fetching category summary:', error);
res.status(500).json({ error: 'Failed to fetch category summary' });
}
});
export default router;

View File

@@ -0,0 +1,269 @@
import { Router, Request, Response } from 'express';
import axios from 'axios';
const router = Router();
// Woodpecker API config - uses env vars or falls back
const WOODPECKER_SERVER = process.env.WOODPECKER_SERVER || 'https://ci.cannabrands.app';
const WOODPECKER_TOKEN = process.env.WOODPECKER_TOKEN;
const GITEA_SERVER = process.env.GITEA_SERVER || 'https://code.cannabrands.app';
const GITEA_TOKEN = process.env.GITEA_TOKEN;
const REPO_OWNER = 'Creationshop';
const REPO_NAME = 'dispensary-scraper';
interface PipelineStep {
name: string;
state: 'pending' | 'running' | 'success' | 'failure' | 'skipped';
started?: number;
stopped?: number;
}
interface PipelineInfo {
number: number;
status: string;
event: string;
branch: string;
message: string;
commit: string;
author: string;
created: number;
started?: number;
finished?: number;
steps?: PipelineStep[];
}
interface DeployStatusResponse {
running: {
sha: string;
sha_full: string;
build_time: string;
image_tag: string;
};
latest: {
sha: string;
sha_full: string;
message: string;
author: string;
timestamp: string;
} | null;
is_latest: boolean;
commits_behind: number;
pipeline: PipelineInfo | null;
error?: string;
}
/**
* Fetch latest commit from Gitea
*/
async function getLatestCommit(): Promise<{
sha: string;
message: string;
author: string;
timestamp: string;
} | null> {
if (!GITEA_TOKEN) {
console.warn('[DeployStatus] GITEA_TOKEN not set, skipping latest commit fetch');
return null;
}
try {
const response = await axios.get(
`${GITEA_SERVER}/api/v1/repos/${REPO_OWNER}/${REPO_NAME}/commits?limit=1`,
{
headers: { Authorization: `token ${GITEA_TOKEN}` },
timeout: 5000,
}
);
if (response.data && response.data.length > 0) {
const commit = response.data[0];
return {
sha: commit.sha,
message: commit.commit?.message?.split('\n')[0] || '',
author: commit.commit?.author?.name || commit.author?.login || 'unknown',
timestamp: commit.commit?.author?.date || commit.created,
};
}
} catch (error: any) {
console.error('[DeployStatus] Failed to fetch latest commit:', error.message);
}
return null;
}
/**
* Fetch latest pipeline from Woodpecker
*/
async function getLatestPipeline(): Promise<PipelineInfo | null> {
if (!WOODPECKER_TOKEN) {
console.warn('[DeployStatus] WOODPECKER_TOKEN not set, skipping pipeline fetch');
return null;
}
try {
// Get latest pipeline
const listResponse = await axios.get(
`${WOODPECKER_SERVER}/api/repos/${REPO_OWNER}/${REPO_NAME}/pipelines?page=1&per_page=1`,
{
headers: { Authorization: `Bearer ${WOODPECKER_TOKEN}` },
timeout: 5000,
}
);
if (!listResponse.data || listResponse.data.length === 0) {
return null;
}
const pipeline = listResponse.data[0];
// Get pipeline steps
let steps: PipelineStep[] = [];
try {
const stepsResponse = await axios.get(
`${WOODPECKER_SERVER}/api/repos/${REPO_OWNER}/${REPO_NAME}/pipelines/${pipeline.number}`,
{
headers: { Authorization: `Bearer ${WOODPECKER_TOKEN}` },
timeout: 5000,
}
);
if (stepsResponse.data?.workflows) {
for (const workflow of stepsResponse.data.workflows) {
if (workflow.children) {
for (const step of workflow.children) {
steps.push({
name: step.name,
state: step.state,
started: step.start_time,
stopped: step.end_time,
});
}
}
}
}
} catch (stepError) {
// Steps fetch failed, continue without them
}
return {
number: pipeline.number,
status: pipeline.status,
event: pipeline.event,
branch: pipeline.branch,
message: pipeline.message?.split('\n')[0] || '',
commit: pipeline.commit?.slice(0, 8) || '',
author: pipeline.author || 'unknown',
created: pipeline.created_at,
started: pipeline.started_at,
finished: pipeline.finished_at,
steps,
};
} catch (error: any) {
console.error('[DeployStatus] Failed to fetch pipeline:', error.message);
}
return null;
}
/**
* Count commits between two SHAs
*/
async function countCommitsBetween(fromSha: string, toSha: string): Promise<number> {
if (!GITEA_TOKEN || !fromSha || !toSha) return 0;
if (fromSha === toSha) return 0;
try {
const response = await axios.get(
`${GITEA_SERVER}/api/v1/repos/${REPO_OWNER}/${REPO_NAME}/commits?sha=${toSha}&limit=50`,
{
headers: { Authorization: `token ${GITEA_TOKEN}` },
timeout: 5000,
}
);
if (response.data) {
const commits = response.data;
for (let i = 0; i < commits.length; i++) {
if (commits[i].sha.startsWith(fromSha)) {
return i;
}
}
// If not found in first 50, assume more than 50 behind
return commits.length;
}
} catch (error: any) {
console.error('[DeployStatus] Failed to count commits:', error.message);
}
return 0;
}
/**
* GET /api/admin/deploy-status
* Returns deployment status with version comparison and CI info
*/
router.get('/', async (req: Request, res: Response) => {
try {
// Get running version from env vars (set during Docker build)
const runningSha = process.env.APP_GIT_SHA || 'unknown';
const running = {
sha: runningSha.slice(0, 8),
sha_full: runningSha,
build_time: process.env.APP_BUILD_TIME || new Date().toISOString(),
image_tag: process.env.CONTAINER_IMAGE_TAG?.slice(0, 8) || 'local',
};
// Fetch latest commit and pipeline in parallel
const [latestCommit, pipeline] = await Promise.all([
getLatestCommit(),
getLatestPipeline(),
]);
// Build latest info
const latest = latestCommit ? {
sha: latestCommit.sha.slice(0, 8),
sha_full: latestCommit.sha,
message: latestCommit.message,
author: latestCommit.author,
timestamp: latestCommit.timestamp,
} : null;
// Determine if running latest
const isLatest = latest
? runningSha.startsWith(latest.sha_full.slice(0, 8)) ||
latest.sha_full.startsWith(runningSha.slice(0, 8))
: true;
// Count commits behind
const commitsBehind = isLatest
? 0
: await countCommitsBetween(runningSha, latest?.sha_full || '');
const response: DeployStatusResponse = {
running,
latest,
is_latest: isLatest,
commits_behind: commitsBehind,
pipeline,
};
res.json(response);
} catch (error: any) {
console.error('[DeployStatus] Error:', error);
res.status(500).json({
error: error.message,
running: {
sha: process.env.APP_GIT_SHA?.slice(0, 8) || 'unknown',
sha_full: process.env.APP_GIT_SHA || 'unknown',
build_time: process.env.APP_BUILD_TIME || 'unknown',
image_tag: process.env.CONTAINER_IMAGE_TAG?.slice(0, 8) || 'local',
},
latest: null,
is_latest: true,
commits_behind: 0,
pipeline: null,
});
}
});
export default router;

View File

@@ -8,10 +8,12 @@ router.use(authMiddleware);
// Valid menu_type values
const VALID_MENU_TYPES = ['dutchie', 'treez', 'jane', 'weedmaps', 'leafly', 'meadow', 'blaze', 'flowhub', 'dispense', 'cova', 'other', 'unknown'];
// Get all dispensaries
// Get all dispensaries (with pagination)
router.get('/', async (req, res) => {
try {
const { menu_type, city, state, crawl_enabled, dutchie_verified } = req.query;
const { menu_type, city, state, crawl_enabled, dutchie_verified, status, limit, offset, search } = req.query;
const pageLimit = Math.min(parseInt(limit as string) || 50, 500);
const pageOffset = parseInt(offset as string) || 0;
let query = `
SELECT
@@ -98,15 +100,40 @@ router.get('/', async (req, res) => {
}
}
if (conditions.length > 0) {
query += ` WHERE ${conditions.join(' AND ')}`;
// Filter by status (e.g., 'dropped', 'open', 'closed')
if (status) {
conditions.push(`status = $${params.length + 1}`);
params.push(status);
}
// Search filter (name, dba_name, city, company_name)
if (search) {
conditions.push(`(name ILIKE $${params.length + 1} OR dba_name ILIKE $${params.length + 1} OR city ILIKE $${params.length + 1})`);
params.push(`%${search}%`);
}
// Build WHERE clause
const whereClause = conditions.length > 0 ? ` WHERE ${conditions.join(' AND ')}` : '';
// Get total count first
const countResult = await pool.query(`SELECT COUNT(*) FROM dispensaries${whereClause}`, params);
const total = parseInt(countResult.rows[0].count);
// Add pagination
query += whereClause;
query += ` ORDER BY name`;
query += ` LIMIT $${params.length + 1} OFFSET $${params.length + 2}`;
params.push(pageLimit, pageOffset);
const result = await pool.query(query, params);
res.json({ dispensaries: result.rows, total: result.rowCount });
res.json({
dispensaries: result.rows,
total,
limit: pageLimit,
offset: pageOffset,
hasMore: pageOffset + result.rows.length < total
});
} catch (error) {
console.error('Error fetching dispensaries:', error);
res.status(500).json({ error: 'Failed to fetch dispensaries' });
@@ -140,6 +167,7 @@ router.get('/stats/crawl-status', async (req, res) => {
COUNT(*) FILTER (WHERE crawl_enabled = false OR crawl_enabled IS NULL) as disabled_count,
COUNT(*) FILTER (WHERE dutchie_verified = true) as verified_count,
COUNT(*) FILTER (WHERE dutchie_verified = false OR dutchie_verified IS NULL) as unverified_count,
COUNT(*) FILTER (WHERE status = 'dropped') as dropped_count,
COUNT(*) as total_count
FROM dispensaries
`;
@@ -169,6 +197,34 @@ router.get('/stats/crawl-status', async (req, res) => {
}
});
// Get dropped stores count (for dashboard alert)
router.get('/stats/dropped', async (req, res) => {
try {
const result = await pool.query(`
SELECT
COUNT(*) as dropped_count,
json_agg(json_build_object(
'id', id,
'name', name,
'city', city,
'state', state,
'dropped_at', updated_at
) ORDER BY updated_at DESC) FILTER (WHERE status = 'dropped') as dropped_stores
FROM dispensaries
WHERE status = 'dropped'
`);
const row = result.rows[0];
res.json({
dropped_count: parseInt(row.dropped_count) || 0,
dropped_stores: row.dropped_stores || []
});
} catch (error) {
console.error('Error fetching dropped stores:', error);
res.status(500).json({ error: 'Failed to fetch dropped stores' });
}
});
// Get single dispensary by slug or ID
router.get('/:slugOrId', async (req, res) => {
try {

View File

@@ -22,11 +22,17 @@ interface ProductClickEventPayload {
store_id?: string;
brand_id?: string;
campaign_id?: string;
dispensary_name?: string;
action: 'view' | 'open_store' | 'open_product' | 'compare' | 'other';
source: string;
page_type?: string; // Page where event occurred (e.g., StoreDetailPage, BrandsIntelligence)
url_path?: string; // URL path for debugging
occurred_at?: string;
// Visitor location (from frontend IP geolocation)
visitor_city?: string;
visitor_state?: string;
visitor_lat?: number;
visitor_lng?: number;
}
/**
@@ -77,13 +83,14 @@ router.post('/product-click', optionalAuthMiddleware, async (req: Request, res:
// Insert the event with enhanced fields
await pool.query(
`INSERT INTO product_click_events
(product_id, store_id, brand_id, campaign_id, action, source, user_id, ip_address, user_agent, occurred_at, event_type, page_type, url_path, device_type)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14)`,
(product_id, store_id, brand_id, campaign_id, dispensary_name, action, source, user_id, ip_address, user_agent, occurred_at, event_type, page_type, url_path, device_type, visitor_city, visitor_state, visitor_lat, visitor_lng)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19)`,
[
payload.product_id,
payload.store_id || null,
payload.brand_id || null,
payload.campaign_id || null,
payload.dispensary_name || null,
payload.action,
payload.source,
userId,
@@ -93,7 +100,11 @@ router.post('/product-click', optionalAuthMiddleware, async (req: Request, res:
'product_click', // event_type
payload.page_type || null,
payload.url_path || null,
deviceType
deviceType,
payload.visitor_city || null,
payload.visitor_state || null,
payload.visitor_lat || null,
payload.visitor_lng || null
]
);

View File

@@ -45,6 +45,8 @@ interface ApiHealth extends HealthStatus {
uptime: number;
timestamp: string;
version: string;
build_sha: string | null;
build_time: string | null;
}
interface DbHealth extends HealthStatus {
@@ -113,6 +115,8 @@ async function getApiHealth(): Promise<ApiHealth> {
uptime: Math.floor((Date.now() - serverStartTime) / 1000),
timestamp: new Date().toISOString(),
version: packageVersion,
build_sha: process.env.APP_GIT_SHA && process.env.APP_GIT_SHA !== 'unknown' ? process.env.APP_GIT_SHA : null,
build_time: process.env.APP_BUILD_TIME && process.env.APP_BUILD_TIME !== 'unknown' ? process.env.APP_BUILD_TIME : null,
};
}
@@ -138,14 +142,16 @@ async function getDbHealth(): Promise<DbHealth> {
async function getRedisHealth(): Promise<RedisHealth> {
const start = Date.now();
const isLocal = process.env.NODE_ENV === 'development' || process.env.NODE_ENV === 'local' || !process.env.NODE_ENV;
// Check if Redis is configured
if (!process.env.REDIS_URL && !process.env.REDIS_HOST) {
// Redis is optional in local dev, required in prod/staging
return {
status: 'ok', // Redis is optional
status: isLocal ? 'ok' : 'error',
connected: false,
latency_ms: 0,
error: 'Redis not configured',
error: isLocal ? 'Redis not configured (optional in local)' : 'Redis not configured (required in production)',
};
}

View File

@@ -0,0 +1,214 @@
/**
* Image Proxy Route
*
* On-demand image resizing service. Serves images with URL-based transforms.
*
* Usage:
* /img/<path>?w=200&h=200&q=80&fit=cover
*
* Parameters:
* w - width (pixels)
* h - height (pixels)
* q - quality (1-100, default 80)
* fit - resize fit: cover, contain, fill, inside, outside (default: inside)
* blur - blur sigma (0.3-1000)
* gray - grayscale (1 = enabled)
* format - output format: webp, jpeg, png, avif (default: webp)
*
* Examples:
* /img/products/az/store/brand/product/image.webp?w=200
* /img/products/az/store/brand/product/image.webp?w=600&h=400&fit=cover
* /img/products/az/store/brand/product/image.webp?w=100&blur=5&gray=1
*/
import { Router, Request, Response } from 'express';
import * as fs from 'fs/promises';
import * as path from 'path';
// @ts-ignore
const sharp = require('sharp');
const router = Router();
// Base path for images
function getImagesBasePath(): string {
if (process.env.IMAGES_PATH) {
return process.env.IMAGES_PATH;
}
if (process.env.STORAGE_BASE_PATH) {
return path.join(process.env.STORAGE_BASE_PATH, 'images');
}
return './storage/images';
}
const IMAGES_BASE_PATH = getImagesBasePath();
// Allowed fit modes
const ALLOWED_FITS = ['cover', 'contain', 'fill', 'inside', 'outside'] as const;
type FitMode = typeof ALLOWED_FITS[number];
// Allowed formats
const ALLOWED_FORMATS = ['webp', 'jpeg', 'jpg', 'png', 'avif'] as const;
type OutputFormat = typeof ALLOWED_FORMATS[number];
// Cache headers (1 year for immutable content-addressed images)
const CACHE_MAX_AGE = 31536000; // 1 year in seconds
interface TransformParams {
width?: number;
height?: number;
quality: number;
fit: FitMode;
blur?: number;
grayscale: boolean;
format: OutputFormat;
}
function parseTransformParams(query: any): TransformParams {
return {
width: query.w ? Math.min(Math.max(parseInt(query.w, 10), 1), 4000) : undefined,
height: query.h ? Math.min(Math.max(parseInt(query.h, 10), 1), 4000) : undefined,
quality: query.q ? Math.min(Math.max(parseInt(query.q, 10), 1), 100) : 80,
fit: ALLOWED_FITS.includes(query.fit) ? query.fit : 'inside',
blur: query.blur ? Math.min(Math.max(parseFloat(query.blur), 0.3), 1000) : undefined,
grayscale: query.gray === '1' || query.grayscale === '1',
format: ALLOWED_FORMATS.includes(query.format) ? query.format : 'webp',
};
}
function getContentType(format: OutputFormat): string {
switch (format) {
case 'jpeg':
case 'jpg':
return 'image/jpeg';
case 'png':
return 'image/png';
case 'avif':
return 'image/avif';
case 'webp':
default:
return 'image/webp';
}
}
/**
* Image proxy endpoint
* GET /img/*
*/
router.get('/*', async (req: Request, res: Response) => {
try {
// Get the image path from URL (everything after /img/)
const imagePath = req.params[0];
if (!imagePath) {
return res.status(400).json({ error: 'Image path required' });
}
// Security: prevent directory traversal
const normalizedPath = path.normalize(imagePath).replace(/^(\.\.(\/|\\|$))+/, '');
const basePath = path.resolve(IMAGES_BASE_PATH);
const fullPath = path.resolve(path.join(IMAGES_BASE_PATH, normalizedPath));
// Ensure path is within base directory
if (!fullPath.startsWith(basePath)) {
console.error(`[ImageProxy] Path traversal attempt: ${fullPath} not in ${basePath}`);
return res.status(403).json({ error: 'Access denied' });
}
// Check if file exists
try {
await fs.access(fullPath);
} catch {
return res.status(404).json({ error: 'Image not found' });
}
// Parse transform parameters
const params = parseTransformParams(req.query);
// Check if any transforms are requested
const hasTransforms = params.width || params.height || params.blur || params.grayscale;
// Read the original image
const imageBuffer = await fs.readFile(fullPath);
let outputBuffer: Buffer;
if (hasTransforms) {
// Apply transforms
let pipeline = sharp(imageBuffer);
// Resize
if (params.width || params.height) {
pipeline = pipeline.resize(params.width, params.height, {
fit: params.fit,
withoutEnlargement: true,
});
}
// Blur
if (params.blur) {
pipeline = pipeline.blur(params.blur);
}
// Grayscale
if (params.grayscale) {
pipeline = pipeline.grayscale();
}
// Output format
switch (params.format) {
case 'jpeg':
case 'jpg':
pipeline = pipeline.jpeg({ quality: params.quality });
break;
case 'png':
pipeline = pipeline.png({ quality: params.quality });
break;
case 'avif':
pipeline = pipeline.avif({ quality: params.quality });
break;
case 'webp':
default:
pipeline = pipeline.webp({ quality: params.quality });
}
outputBuffer = await pipeline.toBuffer();
} else {
// No transforms - serve original (but maybe convert format)
if (params.format !== 'webp' || params.quality !== 80) {
let pipeline = sharp(imageBuffer);
switch (params.format) {
case 'jpeg':
case 'jpg':
pipeline = pipeline.jpeg({ quality: params.quality });
break;
case 'png':
pipeline = pipeline.png({ quality: params.quality });
break;
case 'avif':
pipeline = pipeline.avif({ quality: params.quality });
break;
case 'webp':
default:
pipeline = pipeline.webp({ quality: params.quality });
}
outputBuffer = await pipeline.toBuffer();
} else {
outputBuffer = imageBuffer;
}
}
// Set headers
res.setHeader('Content-Type', getContentType(params.format));
res.setHeader('Cache-Control', `public, max-age=${CACHE_MAX_AGE}, immutable`);
res.setHeader('X-Image-Size', outputBuffer.length);
// Send image
res.send(outputBuffer);
} catch (error: any) {
console.error('[ImageProxy] Error:', error.message);
res.status(500).json({ error: 'Failed to process image' });
}
});
export default router;

View File

@@ -14,35 +14,56 @@ router.use(authMiddleware);
/**
* GET /api/admin/intelligence/brands
* List all brands with state presence, store counts, and pricing
* Query params:
* - state: Filter by state (e.g., "AZ")
* - limit: Max results (default 500)
* - offset: Pagination offset
*/
router.get('/brands', async (req: Request, res: Response) => {
try {
const { limit = '500', offset = '0' } = req.query;
const { limit = '500', offset = '0', state } = req.query;
const limitNum = Math.min(parseInt(limit as string, 10), 1000);
const offsetNum = parseInt(offset as string, 10);
// Build WHERE clause based on state filter
let stateFilter = '';
const params: any[] = [limitNum, offsetNum];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $3';
params.push(state);
}
const { rows } = await pool.query(`
SELECT
sp.brand_name_raw as brand_name,
array_agg(DISTINCT d.state) FILTER (WHERE d.state IS NOT NULL) as states,
COUNT(DISTINCT d.id) as store_count,
COUNT(DISTINCT sp.id) as sku_count,
ROUND(AVG(sp.price_rec)::numeric, 2) FILTER (WHERE sp.price_rec > 0) as avg_price_rec,
ROUND(AVG(sp.price_med)::numeric, 2) FILTER (WHERE sp.price_med > 0) as avg_price_med
ROUND(AVG(sp.price_rec) FILTER (WHERE sp.price_rec > 0)::numeric, 2) as avg_price_rec,
ROUND(AVG(sp.price_med) FILTER (WHERE sp.price_med > 0)::numeric, 2) as avg_price_med
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.brand_name_raw IS NOT NULL AND sp.brand_name_raw != ''
${stateFilter}
GROUP BY sp.brand_name_raw
ORDER BY store_count DESC, sku_count DESC
LIMIT $1 OFFSET $2
`, [limitNum, offsetNum]);
`, params);
// Get total count
// Get total count with same state filter
const countParams: any[] = [];
let countStateFilter = '';
if (state && state !== 'all') {
countStateFilter = 'AND d.state = $1';
countParams.push(state);
}
const { rows: countRows } = await pool.query(`
SELECT COUNT(DISTINCT brand_name_raw) as total
FROM store_products
WHERE brand_name_raw IS NOT NULL AND brand_name_raw != ''
`);
SELECT COUNT(DISTINCT sp.brand_name_raw) as total
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.brand_name_raw IS NOT NULL AND sp.brand_name_raw != ''
${countStateFilter}
`, countParams);
res.json({
brands: rows.map((r: any) => ({
@@ -147,29 +168,63 @@ router.get('/brands/:brandName/penetration', async (req: Request, res: Response)
/**
* GET /api/admin/intelligence/pricing
* Get pricing analytics by category
* Query params:
* - state: Filter by state (e.g., "AZ")
*/
router.get('/pricing', async (req: Request, res: Response) => {
try {
const { rows: categoryRows } = await pool.query(`
SELECT
sp.category_raw as category,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) FILTER (WHERE sp.price_rec > 0) as min_price,
MAX(sp.price_rec) as max_price,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)::numeric, 2)
FILTER (WHERE sp.price_rec > 0) as median_price,
COUNT(*) as product_count
FROM store_products sp
WHERE sp.category_raw IS NOT NULL AND sp.price_rec > 0
GROUP BY sp.category_raw
ORDER BY product_count DESC
`);
const { state } = req.query;
// Build WHERE clause based on state filter
let stateFilter = '';
const categoryParams: any[] = [];
const stateQueryParams: any[] = [];
const overallParams: any[] = [];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $1';
categoryParams.push(state);
overallParams.push(state);
}
// Category pricing with optional state filter
const categoryQuery = state && state !== 'all'
? `
SELECT
sp.category_raw as category,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)::numeric, 2) as median_price,
COUNT(*) as product_count
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.category_raw IS NOT NULL AND sp.price_rec > 0 ${stateFilter}
GROUP BY sp.category_raw
ORDER BY product_count DESC
`
: `
SELECT
sp.category_raw as category,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)::numeric, 2) as median_price,
COUNT(*) as product_count
FROM store_products sp
WHERE sp.category_raw IS NOT NULL AND sp.price_rec > 0
GROUP BY sp.category_raw
ORDER BY product_count DESC
`;
const { rows: categoryRows } = await pool.query(categoryQuery, categoryParams);
// State pricing
const { rows: stateRows } = await pool.query(`
SELECT
d.state,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) FILTER (WHERE sp.price_rec > 0) as min_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
COUNT(DISTINCT sp.id) as product_count
FROM store_products sp
@@ -179,6 +234,31 @@ router.get('/pricing', async (req: Request, res: Response) => {
ORDER BY avg_price DESC
`);
// Overall stats with optional state filter
const overallQuery = state && state !== 'all'
? `
SELECT
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
COUNT(*) as total_products
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.price_rec > 0 ${stateFilter}
`
: `
SELECT
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
COUNT(*) as total_products
FROM store_products sp
WHERE sp.price_rec > 0
`;
const { rows: overallRows } = await pool.query(overallQuery, overallParams);
const overall = overallRows[0];
res.json({
byCategory: categoryRows.map((r: any) => ({
category: r.category,
@@ -195,6 +275,12 @@ router.get('/pricing', async (req: Request, res: Response) => {
maxPrice: r.max_price ? parseFloat(r.max_price) : null,
productCount: parseInt(r.product_count, 10),
})),
overall: {
avgPrice: overall?.avg_price ? parseFloat(overall.avg_price) : null,
minPrice: overall?.min_price ? parseFloat(overall.min_price) : null,
maxPrice: overall?.max_price ? parseFloat(overall.max_price) : null,
totalProducts: parseInt(overall?.total_products || '0', 10),
},
});
} catch (error: any) {
console.error('[Intelligence] Error fetching pricing:', error.message);
@@ -205,9 +291,23 @@ router.get('/pricing', async (req: Request, res: Response) => {
/**
* GET /api/admin/intelligence/stores
* Get store intelligence summary
* Query params:
* - state: Filter by state (e.g., "AZ")
* - limit: Max results (default 200)
*/
router.get('/stores', async (req: Request, res: Response) => {
try {
const { state, limit = '200' } = req.query;
const limitNum = Math.min(parseInt(limit as string, 10), 500);
// Build WHERE clause based on state filter
let stateFilter = '';
const params: any[] = [limitNum];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $2';
params.push(state);
}
const { rows: storeRows } = await pool.query(`
SELECT
d.id,
@@ -217,17 +317,22 @@ router.get('/stores', async (req: Request, res: Response) => {
d.state,
d.menu_type,
d.crawl_enabled,
COUNT(DISTINCT sp.id) as product_count,
c.name as chain_name,
COUNT(DISTINCT sp.id) as sku_count,
COUNT(DISTINCT sp.brand_name_raw) as brand_count,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MAX(sp.updated_at) as last_product_update
MAX(sp.updated_at) as last_crawl,
(SELECT COUNT(*) FROM store_product_snapshots sps
WHERE sps.store_product_id IN (SELECT id FROM store_products WHERE dispensary_id = d.id)) as snapshot_count
FROM dispensaries d
LEFT JOIN store_products sp ON sp.dispensary_id = d.id
WHERE d.state IS NOT NULL
GROUP BY d.id, d.name, d.dba_name, d.city, d.state, d.menu_type, d.crawl_enabled
ORDER BY product_count DESC
LIMIT 200
`);
LEFT JOIN chains c ON d.chain_id = c.id
WHERE d.state IS NOT NULL AND d.crawl_enabled = true
${stateFilter}
GROUP BY d.id, d.name, d.dba_name, d.city, d.state, d.menu_type, d.crawl_enabled, c.name
ORDER BY sku_count DESC
LIMIT $1
`, params);
res.json({
stores: storeRows.map((r: any) => ({
@@ -238,10 +343,13 @@ router.get('/stores', async (req: Request, res: Response) => {
state: r.state,
menuType: r.menu_type,
crawlEnabled: r.crawl_enabled,
productCount: parseInt(r.product_count || '0', 10),
chainName: r.chain_name || null,
skuCount: parseInt(r.sku_count || '0', 10),
snapshotCount: parseInt(r.snapshot_count || '0', 10),
brandCount: parseInt(r.brand_count || '0', 10),
avgPrice: r.avg_price ? parseFloat(r.avg_price) : null,
lastProductUpdate: r.last_product_update,
lastCrawl: r.last_crawl,
crawlFrequencyHours: 4, // Default crawl frequency
})),
total: storeRows.length,
});

Some files were not shown because too many files have changed in this diff Show More