Compare commits

...

40 Commits

Author SHA1 Message Date
Kelly
cdab44c757 fix(ui): Show git hash instead of version number in sidebar
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 08:09:06 -07:00
Kelly
d6c602c567 fix(ui): Remove Cleanup Stale button from Workers page
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:29:25 -07:00
Kelly
a252a7fefd feat(tasks): 25 workers, pool starts paused by default
- Increase worker replicas from 5 to 25
- Task pool now starts PAUSED on deploy, admin must click Start Pool
- Prevents workers from grabbing tasks before system is ready

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:19:02 -07:00
Kelly
83b06c21cc fix(tasks): Stop spinner in status cards when pool is paused
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 01:14:25 -07:00
Kelly
e3d4dd0127 fix(ci): Fix Woodpecker config - remove invalid top-level when
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 00:45:08 -07:00
kelly
d0ee0d72f5 Merge pull request 'feat(tasks): Add task pool start/stop toggle' (#29) from feat/task-pool-toggle into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/29
2025-12-11 07:21:02 +00:00
kelly
521f0550cd Merge pull request 'feat: Admin UI cleanup and dispensary schedule page' (#28) from fix/worker-proxy-wait into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/28
2025-12-11 07:08:03 +00:00
Kelly
8a09691e91 feat(tasks): Add task pool start/stop toggle
- Add task-pool-state.ts for shared pause/resume state
- Add /api/tasks/pool/status, pause, resume endpoints
- Add Start/Stop Pool toggle button to TasksDashboard
- Spinner stops when pool is closed
- Fix is_active column name in store-discovery.ts
- Fix missing active column in task-service.ts claimTask
- Auto-refresh every 15 seconds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 00:07:14 -07:00
Kelly
459ad7d9c9 fix(tasks): Fix missing column errors in task queries
- Change 'active' to 'is_active' in states table query (store-discovery.ts)
- Remove non-existent 'active' column check from worker_tasks query (task-service.ts)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:54:28 -07:00
Kelly
d102d27731 feat(admin): Dispensary schedule page and UI cleanup
- Add DispensarySchedule page showing crawl history and upcoming schedule
- Add /dispensaries/:state/:city/:slug/schedule route
- Add API endpoint for store crawl history
- Update View Schedule link to use dispensary-specific route
- Remove colored badges from DispensaryDetail product table (plain text)
- Make Details button ghost style in product table
- Add "Sort by States" option to IntelligenceBrands
- Remove status filter dropdown from Dispensaries page

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:50:47 -07:00
kelly
01810c40a1 Merge pull request 'fix(worker): Wait for proxies instead of crashing' (#27) from fix/worker-proxy-wait into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/27
2025-12-11 06:43:29 +00:00
Kelly
b7d33e1cbf fix(admin): Clean up store detail and intelligence pages
- Remove Update dropdown from DispensaryDetail page
- Remove Crawl Now button from StoreDetailPage
- Change "Last Crawl" to "Last Updated" on both detail pages
- Tone down emerald colors on StoreDetailPage (use gray borders/tabs)
- Simplify THC/CBD/Stock badges to plain text
- Remove duplicate state dropdown from IntelligenceStores filters
- Make store rows clickable in IntelligenceStores

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:42:50 -07:00
Kelly
5b34b5a78c fix(admin): Consistent navigation across Intelligence pages
- Add state selector dropdown to all three Intelligence pages (Brands, Stores, Pricing)
- Use consistent emerald-styled page navigation badges with current page highlighted
- Remove Refresh buttons from all Intelligence pages
- Update chart styling to use emerald gradient bars (matching Pricing page)
- Load all available states from orchestrator API instead of extracting from local data
- Fix z-index and styling on state dropdown for better visibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:32:57 -07:00
Kelly
c091d2316b fix(dashboard): Remove refresh button and HealthPanel
- Removed refresh button and refreshing state from Dashboard
- Removed HealthPanel component (deploy status auto-refresh)
- Simplified header layout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:28:18 -07:00
Kelly
e8862b8a8b fix(national): Remove Refresh Metrics button
Removed unused refresh button and related state/handlers from
National Dashboard.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:27:01 -07:00
Kelly
1b46ab699d fix(national): Show all states count, not filtered "active" states
The "Active States" metric was arbitrary and confusing. Changed to
show total states count - all states in the system regardless of
whether they have data or not.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:25:50 -07:00
Kelly
ac1995f63f fix(pricing): Simplify category chart to prevent overflow
- Replace complex price range bars with simple horizontal bars
- Use overflow-hidden to prevent bars extending beyond container
- Calculate bar width as percentage of max avg price
- Limit to top 12 categories for cleaner display
- Fixed-width labels prevent layout shift

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:24:31 -07:00
Kelly
de93669652 fix(national): Count active states by product data, not crawl status
Active states should count states with actual product data, not just
states where crawling is enabled. A state can have historical data
even if crawling is currently disabled.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:23:20 -07:00
Kelly
dffc124920 fix(national): Fix active states count and remove StateBadge
- Change active_states to count states with crawl_enabled=true dispensaries
- Filter all national summary queries by crawl_enabled=true
- Remove unused StateBadge from National Dashboard header
- StateBadge was showing "Arizona" with no way to change it

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:22:19 -07:00
Kelly
932ceb0287 feat(intelligence): Add state filter to all Intelligence pages
- Add state filter to Intelligence Brands API and frontend
- Add state filter to Intelligence Pricing API and frontend
- Add state filter to Intelligence Stores API and frontend
- Fix null safety issues with toLocaleString() calls
- Update backend /stores endpoint to return skuCount, snapshotCount, chainName
- Add overall stats to pricing endpoint

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:19:54 -07:00
Kelly
824d48fd85 fix: Add curl to Docker, add active flag to worker_tasks
- Install curl in Docker container for Dutchie HTTP requests
- Add 'active' column to worker_tasks (default false) to prevent
  accidental task execution on startup
- Update task-service to only claim tasks where active=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:12:09 -07:00
Kelly
47fdab0382 fix: Filter orchestrator states by crawl_enabled
The states dropdown was showing count of ALL dispensaries instead of
just crawl-enabled ones. Now correctly filters to match the actual
stores that would be displayed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:09:04 -07:00
Kelly
ed7ddc6375 ci: Add database migration step to deploy pipeline
Migrations now run automatically before deployments:
1. Build new Docker image
2. Run migrations using the new image
3. Deploy to Kubernetes

Requires new secrets: db_host, db_port, db_name, db_user, db_pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 23:07:25 -07:00
Kelly
cf06f4a8c0 feat(worker): Listen for proxy_added notifications
- Workers now use PostgreSQL LISTEN/NOTIFY to wake up immediately when proxies are added
- Added trigger on proxies table to NOTIFY 'proxy_added' when active proxy inserted/updated
- Falls back to 30s polling if LISTEN fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:58:00 -07:00
Kelly
a2fa21f65c fix(worker): Wait for proxies instead of crashing on startup
- Task worker now waits up to 60 minutes for active proxies
- Retries every 30 seconds with clear logging
- Updated K8s scraper-worker.yaml with Deployment definition
- Deployment uses task-worker.js entrypoint with correct liveness probe

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:55:04 -07:00
kelly
61e915968f Merge pull request 'feat(tasks): Refactor task workflow with payload/refresh separation' (#26) from feat/task-workflow-refactor into master 2025-12-11 05:24:11 +00:00
Kelly
4949b22457 feat(tasks): Refactor task workflow with payload/refresh separation
Major changes:
- Split crawl into payload_fetch (API → disk) and product_refresh (disk → DB)
- Add task chaining: store_discovery → product_discovery → payload_fetch → product_refresh
- Add payload storage utilities for gzipped JSON on filesystem
- Add /api/payloads endpoints for payload access and diffing
- Add DB-driven TaskScheduler with schedule persistence
- Track newDispensaryIds through discovery promotion for chaining
- Add stealth improvements: HTTP fingerprinting, proxy rotation enhancements
- Add Workers dashboard K8s scaling controls

New files:
- src/tasks/handlers/payload-fetch.ts - Fetches from API, saves to disk
- src/services/task-scheduler.ts - DB-driven schedule management
- src/utils/payload-storage.ts - Payload save/load utilities
- src/routes/payloads.ts - Payload API endpoints
- src/services/http-fingerprint.ts - Browser fingerprint generation
- docs/TASK_WORKFLOW_2024-12-10.md - Complete workflow documentation

Migrations:
- 078: Proxy consecutive 403 tracking
- 079: task_schedules table
- 080: raw_crawl_payloads table
- 081: payload column and last_fetch_at

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 22:15:35 -07:00
Kelly
1fb0eb94c2 security: Add authMiddleware to analytics-v2 routes
- Add authMiddleware to analytics-v2.ts to require authentication
- Add permanent rule #6 to CLAUDE.md: "ALL API ROUTES REQUIRE AUTHENTICATION"
- Add forbidden action #19: "Creating API routes without authMiddleware"
- Document authentication flow and trusted origins

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 19:01:44 -07:00
Kelly
9aefb554bc fix: Correct Analytics V2 SQL queries for schema alignment
- Fix JOIN path: store_products -> dispensaries -> states (was incorrectly joining sp.state_id which doesn't exist)
- Fix column names to use *_raw suffixes (category_raw, brand_name_raw, name_raw)
- Fix row mappings to read correct column names from query results
- Add ::timestamp casts for interval arithmetic in StoreAnalyticsService

All Analytics V2 endpoints now work correctly:
- /state/legal-breakdown
- /state/recreational
- /category/all
- /category/rec-vs-med
- /state/:code/summary
- /store/:id/summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 18:52:57 -07:00
kelly
a4338669a9 Merge pull request 'fix(auth): Prioritize JWT token over trusted origin bypass' (#24) from fix/auth-token-priority into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/24
2025-12-11 01:34:10 +00:00
Kelly
1fa9ea496c fix(auth): Prioritize JWT token over trusted origin bypass
When a user logs in and has a Bearer token, use their actual identity
instead of falling back to internal@system. This ensures logged-in
users see their real email in the admin UI.

Order of auth:
1. If Bearer token provided → use JWT/API token (real user identity)
2. If no token → check trusted origins (for API access like WordPress)
3. Otherwise → 401 unauthorized

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 18:21:50 -07:00
kelly
31756a2233 Merge pull request 'chore: Add WordPress plugin v1.6.0 download files' (#23) from chore/wordpress-plugin-downloads into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/23
2025-12-11 00:40:53 +00:00
Kelly
166583621b chore: Add WordPress plugin v1.6.0 download files
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 17:23:25 -07:00
kelly
ca952c4674 Merge pull request 'fix(ci): Use YAML map format for docker-buildx build_args' (#21) from fix/ci-build-args-format into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/21
2025-12-10 23:54:33 +00:00
kelly
4054778b6c Merge pull request 'feat: Add wildcard support for trusted domains' (#20) from fix/trusted-origins-wildcards into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/20
2025-12-10 23:54:11 +00:00
Kelly
56a5f00015 fix(ci): Use YAML map format for docker-buildx build_args
The woodpeckerci/plugin-docker-buildx plugin expects build_args as a
YAML map (key: value), not a list. This was causing build args to not
be passed to the Docker build, resulting in unknown git SHA and build
info in the deployed application.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 16:42:05 -07:00
Kelly
a96d50c481 docs(wordpress): Add deprecation comments for legacy shortcode/migration code
Clarifies that crawlsy_* and dutchie_* shortcodes are deprecated aliases
for backward compatibility only. New implementations should use cannaiq_*.

Also documents the token migration logic that preserves old API tokens.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 16:24:56 -07:00
kelly
4806212f46 Merge pull request 'fix(ci): Use YAML list format for docker-buildx build_args' (#18) from fix/ci-build-args into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/18
2025-12-10 22:29:41 +00:00
kelly
2486f3c6b2 Merge pull request 'feat(analytics): Add Brand Intelligence API endpoint' (#19) from feat/brand-intelligence-api into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/19
2025-12-10 22:29:26 +00:00
Kelly
97b1ab23d8 fix(ci): Use YAML list format for docker-buildx build_args
The woodpecker docker-buildx plugin expects build_args as a YAML list,
not a comma-separated string. The previous format resulted in all args
being passed as a single malformed arg with "*=" prefix.

This fix ensures APP_GIT_SHA, APP_BUILD_TIME, etc. are properly passed
to the Dockerfile so the /api/version endpoint returns correct values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 14:56:18 -07:00
69 changed files with 6298 additions and 1489 deletions

View File

@@ -1,6 +1,3 @@
when:
- event: [push, pull_request]
steps:
# ===========================================
# PR VALIDATION: Parallel type checks (PRs only)
@@ -89,7 +86,11 @@ steps:
from_secret: registry_password
platforms: linux/amd64
provenance: false
build_args: APP_BUILD_VERSION=${CI_COMMIT_SHA:0:8},APP_GIT_SHA=${CI_COMMIT_SHA},APP_BUILD_TIME=${CI_PIPELINE_CREATED},CONTAINER_IMAGE_TAG=${CI_COMMIT_SHA:0:8}
build_args:
APP_BUILD_VERSION: ${CI_COMMIT_SHA:0:8}
APP_GIT_SHA: ${CI_COMMIT_SHA}
APP_BUILD_TIME: ${CI_PIPELINE_CREATED}
CONTAINER_IMAGE_TAG: ${CI_COMMIT_SHA:0:8}
depends_on: []
when:
branch: master
@@ -159,7 +160,32 @@ steps:
event: push
# ===========================================
# STAGE 3: Deploy (after Docker builds)
# STAGE 3: Run Database Migrations (before deploy)
# ===========================================
migrate:
image: code.cannabrands.app/creationshop/dispensary-scraper:${CI_COMMIT_SHA:0:8}
environment:
CANNAIQ_DB_HOST:
from_secret: db_host
CANNAIQ_DB_PORT:
from_secret: db_port
CANNAIQ_DB_NAME:
from_secret: db_name
CANNAIQ_DB_USER:
from_secret: db_user
CANNAIQ_DB_PASS:
from_secret: db_pass
commands:
- cd /app
- node dist/db/migrate.js
depends_on:
- docker-backend
when:
branch: master
event: push
# ===========================================
# STAGE 4: Deploy (after migrations)
# ===========================================
deploy:
image: bitnami/kubectl:latest
@@ -178,7 +204,7 @@ steps:
- kubectl rollout status deployment/scraper -n dispensary-scraper --timeout=300s
- kubectl rollout status deployment/cannaiq-frontend -n dispensary-scraper --timeout=120s
depends_on:
- docker-backend
- migrate
- docker-cannaiq
- docker-findadispo
- docker-findagram

View File

@@ -119,7 +119,42 @@ npx tsx src/db/migrate.ts
- Importing it at runtime causes startup crashes if env vars aren't perfect
- `pool.ts` uses lazy initialization - only validates when first query is made
### 6. LOCAL DEVELOPMENT BY DEFAULT
### 6. ALL API ROUTES REQUIRE AUTHENTICATION — NO EXCEPTIONS
**Every API router MUST apply `authMiddleware` at the router level.**
```typescript
import { authMiddleware } from '../auth/middleware';
const router = Router();
router.use(authMiddleware); // REQUIRED - first line after router creation
```
**Authentication flow (see `src/auth/middleware.ts`):**
1. Check Bearer token (JWT or API token) → grant access if valid
2. Check trusted origins (cannaiq.co, findadispo.com, localhost, etc.) → grant access
3. Check trusted IPs (127.0.0.1, ::1, internal pod IPs) → grant access
4. **Return 401 Unauthorized** if none of the above
**NEVER create API routes without auth middleware:**
- No "public" endpoints that bypass authentication
- No "read-only" exceptions
- No "analytics-only" exceptions
- If an endpoint exists under `/api/*`, it MUST be protected
**When creating new route files:**
1. Import `authMiddleware` from `../auth/middleware`
2. Add `router.use(authMiddleware)` immediately after creating the router
3. Document security requirements in file header comments
**Trusted origins (defined in middleware):**
- `https://cannaiq.co`
- `https://findadispo.com`
- `https://findagram.co`
- `*.cannabrands.app` domains
- `localhost:*` for development
### 7. LOCAL DEVELOPMENT BY DEFAULT
**Quick Start:**
```bash
@@ -452,6 +487,7 @@ const result = await pool.query(`
16. **Running `lsof -ti:PORT | xargs kill`** or similar process-killing commands
17. **Using hardcoded database names** in code or comments
18. **Creating or connecting to a second database**
19. **Creating API routes without authMiddleware** (all `/api/*` routes MUST be protected)
---

View File

@@ -25,8 +25,9 @@ ENV APP_GIT_SHA=${APP_GIT_SHA}
ENV APP_BUILD_TIME=${APP_BUILD_TIME}
ENV CONTAINER_IMAGE_TAG=${CONTAINER_IMAGE_TAG}
# Install Chromium dependencies
# Install Chromium dependencies and curl for HTTP requests
RUN apt-get update && apt-get install -y \
curl \
chromium \
fonts-liberation \
libnss3 \

View File

@@ -500,17 +500,18 @@ CREATE TABLE proxies (
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
### Fingerprints Available
### User-Agent Generation
The client includes 6 browser fingerprints:
- Chrome 131 on Windows
- Chrome 131 on macOS
- Chrome 120 on Windows
- Firefox 133 on Windows
- Safari 17.2 on macOS
- Edge 131 on Windows
See `workflow-12102025.md` for full specification.
Each includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
**Summary:**
- Uses `intoli/user-agents` library (daily-updated market share data)
- Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
- Browser whitelist: Chrome, Safari, Edge, Firefox only
- UA sticks until IP rotates (403 or manual rotation)
- Failure = alert admin + stop crawl (no fallback)
Each fingerprint includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
---

View File

@@ -0,0 +1,584 @@
# Task Workflow Documentation
**Date: 2024-12-10**
This document describes the complete task/job processing architecture after the 2024-12-10 rewrite.
---
## Complete Architecture
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ API SERVER POD (scraper) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌────────────────────────────────────────┐ │ │
│ │ │ Express API │ │ TaskScheduler │ │ │
│ │ │ │ │ (src/services/task-scheduler.ts) │ │ │
│ │ │ /api/job-queue │ │ │ │ │
│ │ │ /api/tasks │ │ • Polls every 60s │ │ │
│ │ │ /api/schedules │ │ • Checks task_schedules table │ │ │
│ │ └────────┬─────────┘ │ • SELECT FOR UPDATE SKIP LOCKED │ │ │
│ │ │ │ • Generates tasks when due │ │ │
│ │ │ └──────────────────┬─────────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┼──────────────────────────────────┼──────────────────────────┘ │
│ │ │ │
│ │ ┌────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ POSTGRESQL DATABASE │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ task_schedules │ │ worker_tasks │ │ │
│ │ │ │ │ │ │ │
│ │ │ • product_refresh │───────►│ • pending tasks │ │ │
│ │ │ • store_discovery │ create │ • claimed tasks │ │ │
│ │ │ • analytics_refresh │ tasks │ • running tasks │ │ │
│ │ │ │ │ • completed tasks │ │ │
│ │ │ next_run_at │ │ │ │ │
│ │ │ last_run_at │ │ role, dispensary_id │ │ │
│ │ │ interval_hours │ │ priority, status │ │ │
│ │ └─────────────────────┘ └──────────┬──────────┘ │ │
│ │ │ │ │
│ └─────────────────────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┘ │
│ │ Workers poll for tasks │
│ │ (SELECT FOR UPDATE SKIP LOCKED) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WORKER PODS (StatefulSet: scraper-worker) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ task-worker │ │ task-worker │ │ task-worker │ │ task-worker │ │ │
│ │ │ .ts │ │ .ts │ │ .ts │ │ .ts │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
```
---
## Startup Sequence
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ API SERVER STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Express app initializes │
│ │ │
│ ▼ │
│ 2. runAutoMigrations() │
│ • Runs pending migrations (including 079_task_schedules.sql) │
│ │ │
│ ▼ │
│ 3. initializeMinio() / initializeImageStorage() │
│ │ │
│ ▼ │
│ 4. cleanupOrphanedJobs() │
│ │ │
│ ▼ │
│ 5. taskScheduler.start() ◄─── NEW (per TASK_WORKFLOW_2024-12-10.md) │
│ │ │
│ ├── Recover stale tasks (workers that died) │
│ ├── Ensure default schedules exist in task_schedules │
│ ├── Check and run any due schedules immediately │
│ └── Start 60-second poll interval │
│ │ │
│ ▼ │
│ 6. app.listen(PORT) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POD STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. K8s starts pod from StatefulSet │
│ │ │
│ ▼ │
│ 2. TaskWorker.constructor() │
│ • Create DB pool │
│ • Create CrawlRotator │
│ │ │
│ ▼ │
│ 3. initializeStealth() │
│ • Load proxies from DB (REQUIRED - fails if none) │
│ • Wire rotator to Dutchie client │
│ │ │
│ ▼ │
│ 4. register() with API │
│ • Optional - continues if fails │
│ │ │
│ ▼ │
│ 5. startRegistryHeartbeat() every 30s │
│ │ │
│ ▼ │
│ 6. processNextTask() loop │
│ │ │
│ ├── Poll for pending task (FOR UPDATE SKIP LOCKED) │
│ ├── Claim task atomically │
│ ├── Execute handler (product_refresh, store_discovery, etc.) │
│ ├── Mark complete/failed │
│ ├── Chain next task if applicable │
│ └── Loop │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Schedule Flow
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER POLL (every 60 seconds) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEGIN TRANSACTION │
│ │ │
│ ▼ │
│ SELECT * FROM task_schedules │
│ WHERE enabled = true AND next_run_at <= NOW() │
│ FOR UPDATE SKIP LOCKED ◄─── Prevents duplicate execution across replicas │
│ │ │
│ ▼ │
│ For each due schedule: │
│ │ │
│ ├── product_refresh_all │
│ │ └─► Query dispensaries needing crawl │
│ │ └─► Create product_refresh tasks in worker_tasks │
│ │ │
│ ├── store_discovery_dutchie │
│ │ └─► Create single store_discovery task │
│ │ │
│ └── analytics_refresh │
│ └─► Create single analytics_refresh task │
│ │ │
│ ▼ │
│ UPDATE task_schedules SET │
│ last_run_at = NOW(), │
│ next_run_at = NOW() + interval_hours │
│ │ │
│ ▼ │
│ COMMIT │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Task Lifecycle
```
┌──────────┐
│ SCHEDULE │
│ DUE │
└────┬─────┘
┌──────────────┐ claim ┌──────────────┐ start ┌──────────────┐
│ PENDING │────────────►│ CLAIMED │────────────►│ RUNNING │
└──────────────┘ └──────────────┘ └──────┬───────┘
▲ │
│ ┌──────────────┼──────────────┐
│ retry │ │ │
│ (if retries < max) ▼ ▼ ▼
│ ┌──────────┐ ┌──────────┐ ┌──────────┐
└──────────────────────────────────│ FAILED │ │ COMPLETED│ │ STALE │
└──────────┘ └──────────┘ └────┬─────┘
recover_stale_tasks()
┌──────────┐
│ PENDING │
└──────────┘
```
---
## Database Tables
### task_schedules (NEW - migration 079)
Stores schedule definitions. Survives restarts.
```sql
CREATE TABLE task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- product_refresh, store_discovery, etc.
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL, -- How often to run
priority INTEGER DEFAULT 0, -- Task priority when created
state_code VARCHAR(2), -- Optional filter
last_run_at TIMESTAMPTZ, -- When it last ran
next_run_at TIMESTAMPTZ, -- When it's due next
last_task_count INTEGER, -- Tasks created last run
last_error TEXT -- Error message if failed
);
```
### worker_tasks (migration 074)
The task queue. Workers pull from here.
```sql
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3
);
```
---
## Default Schedules
| Name | Role | Interval | Priority | Description |
|------|------|----------|----------|-------------|
| `payload_fetch_all` | payload_fetch | 4 hours | 0 | Fetch payloads from Dutchie API (chains to product_refresh) |
| `store_discovery_dutchie` | store_discovery | 24 hours | 5 | Find new Dutchie stores |
| `analytics_refresh` | analytics_refresh | 6 hours | 0 | Refresh MVs |
---
## Task Roles
| Role | Description | Creates Tasks For |
|------|-------------|-------------------|
| `payload_fetch` | **NEW** - Fetch from Dutchie API, save to disk | Each dispensary needing crawl |
| `product_refresh` | **CHANGED** - Read local payload, normalize, upsert to DB | Chained from payload_fetch |
| `store_discovery` | Find new dispensaries, returns newStoreIds[] | Single task per platform |
| `entry_point_discovery` | **DEPRECATED** - Resolve platform IDs | No longer used |
| `product_discovery` | Initial product fetch for new stores | Chained from store_discovery |
| `analytics_refresh` | Refresh MVs | Single global task |
### Payload/Refresh Separation (2024-12-10)
The crawl workflow is now split into two phases:
```
payload_fetch (scheduled every 4h)
└─► Hit Dutchie GraphQL API
└─► Save raw JSON to /storage/payloads/{year}/{month}/{day}/store_{id}_{ts}.json.gz
└─► Record metadata in raw_crawl_payloads table
└─► Queue product_refresh task with payload_id
product_refresh (chained from payload_fetch)
└─► Load payload from filesystem (NOT from API)
└─► Normalize via DutchieNormalizer
└─► Upsert to store_products
└─► Create snapshots
└─► Track missing products
└─► Download images
```
**Benefits:**
- **Retry-friendly**: If normalize fails, re-run product_refresh without re-crawling
- **Replay-able**: Run product_refresh against any historical payload
- **Faster refreshes**: Local file read vs network call
- **Historical diffs**: Compare payloads to see what changed between crawls
- **Less API pressure**: Only payload_fetch hits Dutchie
---
## Task Chaining
Tasks automatically queue follow-up tasks upon successful completion. This creates two main flows:
### Discovery Flow (New Stores)
When `store_discovery` finds new dispensaries, they automatically get their initial product data:
```
store_discovery
└─► Discovers new locations via Dutchie GraphQL
└─► Auto-promotes valid locations to dispensaries table
└─► Collects newDispensaryIds[] from promotions
└─► Returns { newStoreIds: [...] } in result
chainNextTask() detects newStoreIds
└─► Creates product_discovery task for each new store
product_discovery
└─► Calls handlePayloadFetch() internally
└─► payload_fetch hits Dutchie API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem
└─► Normalizes and upserts to store_products
└─► Creates snapshots, downloads images
```
**Complete Discovery Chain:**
```
store_discovery → product_discovery → payload_fetch → product_refresh
(internal call) (queues next)
```
### Scheduled Flow (Existing Stores)
For existing stores, `payload_fetch_all` schedule runs every 4 hours:
```
TaskScheduler (every 60s)
└─► Checks task_schedules for due schedules
└─► payload_fetch_all is due
└─► Generates payload_fetch task for each dispensary
payload_fetch
└─► Hits Dutchie GraphQL API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem (NOT API)
└─► Normalizes via DutchieNormalizer
└─► Upserts to store_products
└─► Creates snapshots
```
**Complete Scheduled Chain:**
```
payload_fetch → product_refresh
(queues) (reads local)
```
### Chaining Implementation
Task chaining is handled in two places:
1. **Internal chaining (handler calls handler):**
- `product_discovery` calls `handlePayloadFetch()` directly
2. **External chaining (chainNextTask() in task-service.ts):**
- Called after task completion
- `store_discovery` → queues `product_discovery` for each newStoreId
3. **Queue-based chaining (taskService.createTask):**
- `payload_fetch` queues `product_refresh` with `payload: { payload_id }`
---
## Payload API Endpoints
Raw crawl payloads can be accessed via the Payloads API:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/payloads` | GET | List payload metadata (paginated) |
| `GET /api/payloads/:id` | GET | Get payload metadata by ID |
| `GET /api/payloads/:id/data` | GET | Get full payload JSON (decompressed) |
| `GET /api/payloads/store/:dispensaryId` | GET | List payloads for a store |
| `GET /api/payloads/store/:dispensaryId/latest` | GET | Get latest payload for a store |
| `GET /api/payloads/store/:dispensaryId/diff` | GET | Diff two payloads for changes |
### Payload Diff Response
The diff endpoint returns:
```json
{
"success": true,
"from": { "id": 123, "fetchedAt": "...", "productCount": 100 },
"to": { "id": 456, "fetchedAt": "...", "productCount": 105 },
"diff": {
"added": 10,
"removed": 5,
"priceChanges": 8,
"stockChanges": 12
},
"details": {
"added": [...],
"removed": [...],
"priceChanges": [...],
"stockChanges": [...]
}
}
```
---
## API Endpoints
### Schedules (NEW)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/schedules` | GET | List all schedules |
| `PUT /api/schedules/:id` | PUT | Update schedule |
| `POST /api/schedules/:id/trigger` | POST | Run schedule immediately |
### Task Creation (rewired 2024-12-10)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `POST /api/job-queue/enqueue` | POST | Create single task |
| `POST /api/job-queue/enqueue-batch` | POST | Create batch tasks |
| `POST /api/job-queue/enqueue-state` | POST | Create tasks for state |
| `POST /api/tasks` | POST | Direct task creation |
### Task Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/tasks` | GET | List tasks |
| `GET /api/tasks/:id` | GET | Get single task |
| `GET /api/tasks/counts` | GET | Task counts by status |
| `POST /api/tasks/recover-stale` | POST | Recover stale tasks |
---
## Key Files
| File | Purpose |
|------|---------|
| `src/services/task-scheduler.ts` | **NEW** - DB-driven scheduler |
| `src/tasks/task-worker.ts` | Worker that processes tasks |
| `src/tasks/task-service.ts` | Task CRUD operations |
| `src/tasks/handlers/payload-fetch.ts` | **NEW** - Fetches from API, saves to disk |
| `src/tasks/handlers/product-refresh.ts` | **CHANGED** - Reads from disk, processes to DB |
| `src/utils/payload-storage.ts` | **NEW** - Payload save/load utilities |
| `src/routes/tasks.ts` | Task API endpoints |
| `src/routes/job-queue.ts` | Job Queue UI endpoints (rewired) |
| `migrations/079_task_schedules.sql` | Schedule table |
| `migrations/080_raw_crawl_payloads.sql` | Payload metadata table |
| `migrations/081_payload_fetch_columns.sql` | payload, last_fetch_at columns |
| `migrations/074_worker_task_queue.sql` | Task queue table |
---
## Legacy Code (DEPRECATED)
| File | Status | Replacement |
|------|--------|-------------|
| `src/services/scheduler.ts` | DEPRECATED | `task-scheduler.ts` |
| `dispensary_crawl_jobs` table | ORPHANED | `worker_tasks` |
| `job_schedules` table | LEGACY | `task_schedules` |
---
## Dashboard Integration
Both pages remain wired to the dashboard:
| Page | Data Source | Actions |
|------|-------------|---------|
| **Job Queue** | `worker_tasks`, `task_schedules` | Create tasks, view schedules |
| **Task Queue** | `worker_tasks` | View tasks, recover stale |
---
## Multi-Replica Safety
The scheduler uses `SELECT FOR UPDATE SKIP LOCKED` to ensure:
1. **Only one replica** executes a schedule at a time
2. **No duplicate tasks** created
3. **Survives pod restarts** - state in DB, not memory
4. **Self-healing** - recovers stale tasks on startup
```sql
-- This query is atomic across all API server replicas
SELECT * FROM task_schedules
WHERE enabled = true AND next_run_at <= NOW()
FOR UPDATE SKIP LOCKED
```
---
## Worker Scaling (K8s)
Workers run as a StatefulSet in Kubernetes. You can scale from the admin UI or CLI.
### From Admin UI
The Workers page (`/admin/workers`) provides:
- Current replica count display
- Scale up/down buttons
- Target replica input
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/workers/k8s/replicas` | GET | Get current/desired replica counts |
| `POST /api/workers/k8s/scale` | POST | Scale to N replicas (body: `{ replicas: N }`) |
### From CLI
```bash
# View current replicas
kubectl get statefulset scraper-worker -n dispensary-scraper
# Scale to 10 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=10
# Scale down to 3 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=3
```
### Configuration
Environment variables for the API server:
| Variable | Default | Description |
|----------|---------|-------------|
| `K8S_NAMESPACE` | `dispensary-scraper` | Kubernetes namespace |
| `K8S_WORKER_STATEFULSET` | `scraper-worker` | StatefulSet name |
### RBAC Requirements
The API server pod needs these K8s permissions:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: worker-scaler
namespace: dispensary-scraper
rules:
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scraper-worker-scaler
namespace: dispensary-scraper
subjects:
- kind: ServiceAccount
name: default
namespace: dispensary-scraper
roleRef:
kind: Role
name: worker-scaler
apiGroup: rbac.authorization.k8s.io
```

View File

@@ -0,0 +1,8 @@
-- Migration 078: Add consecutive_403_count to proxies table
-- Per workflow-12102025.md: Track consecutive 403s per proxy
-- After 3 consecutive 403s with different fingerprints → disable proxy
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
-- Add comment explaining the column
COMMENT ON COLUMN proxies.consecutive_403_count IS 'Tracks consecutive 403 blocks. Reset to 0 on success. Proxy disabled at 3.';

View File

@@ -0,0 +1,49 @@
-- Migration 079: Task Schedules for Database-Driven Scheduler
-- Per TASK_WORKFLOW_2024-12-10.md: Replaces node-cron with DB-driven scheduling
--
-- 2024-12-10: Created for reliable, multi-replica-safe task scheduling
-- task_schedules: Stores schedule definitions and state
CREATE TABLE IF NOT EXISTS task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- TaskRole: product_refresh, store_discovery, etc.
description TEXT,
-- Schedule configuration
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL DEFAULT 4,
priority INTEGER DEFAULT 0,
-- Optional scope filters
state_code VARCHAR(2), -- NULL = all states
platform VARCHAR(50), -- NULL = all platforms
-- Execution state (updated by scheduler)
last_run_at TIMESTAMPTZ,
next_run_at TIMESTAMPTZ,
last_task_count INTEGER DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes for scheduler queries
CREATE INDEX IF NOT EXISTS idx_task_schedules_enabled ON task_schedules(enabled) WHERE enabled = TRUE;
CREATE INDEX IF NOT EXISTS idx_task_schedules_next_run ON task_schedules(next_run_at) WHERE enabled = TRUE;
-- Insert default schedules
INSERT INTO task_schedules (name, role, interval_hours, priority, description, next_run_at)
VALUES
('product_refresh_all', 'product_refresh', 4, 0, 'Generate product refresh tasks for all crawl-enabled stores every 4 hours', NOW()),
('store_discovery_dutchie', 'store_discovery', 24, 5, 'Discover new Dutchie stores daily', NOW()),
('analytics_refresh', 'analytics_refresh', 6, 0, 'Refresh analytics materialized views every 6 hours', NOW())
ON CONFLICT (name) DO NOTHING;
-- Comment for documentation
COMMENT ON TABLE task_schedules IS 'Database-driven task scheduler configuration. Per TASK_WORKFLOW_2024-12-10.md:
- Schedules persist in DB (survive restarts)
- Uses SELECT FOR UPDATE SKIP LOCKED for multi-replica safety
- Scheduler polls every 60s and executes due schedules
- Creates tasks in worker_tasks for task-worker.ts to process';

View File

@@ -0,0 +1,58 @@
-- Migration 080: Raw Crawl Payloads Metadata Table
-- Per TASK_WORKFLOW_2024-12-10.md: Store full GraphQL payloads for historical analysis
--
-- Design Pattern: Metadata/Payload Separation
-- - Metadata (this table): Small, indexed, queryable
-- - Payload (filesystem): Gzipped JSON at storage_path
--
-- Benefits:
-- - Compare any two crawls to see what changed
-- - Replay/re-normalize historical data if logic changes
-- - Debug issues by seeing exactly what the API returned
-- - DB stays small, backups stay fast
--
-- Storage location: /storage/payloads/{year}/{month}/{day}/store_{id}_{timestamp}.json.gz
-- Compression: ~90% reduction (1.5MB -> 150KB per crawl)
CREATE TABLE IF NOT EXISTS raw_crawl_payloads (
id SERIAL PRIMARY KEY,
-- Links to crawl tracking
crawl_run_id INTEGER REFERENCES crawl_runs(id) ON DELETE SET NULL,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE,
-- File location (gzipped JSON)
storage_path TEXT NOT NULL,
-- Metadata for quick queries without loading file
product_count INTEGER NOT NULL DEFAULT 0,
size_bytes INTEGER, -- Compressed size
size_bytes_raw INTEGER, -- Uncompressed size
-- Timestamps
fetched_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Optional: checksum for integrity verification
checksum_sha256 VARCHAR(64)
);
-- Indexes for common queries
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_dispensary
ON raw_crawl_payloads(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_dispensary_fetched
ON raw_crawl_payloads(dispensary_id, fetched_at DESC);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_fetched
ON raw_crawl_payloads(fetched_at DESC);
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_crawl_run
ON raw_crawl_payloads(crawl_run_id)
WHERE crawl_run_id IS NOT NULL;
-- Comments
COMMENT ON TABLE raw_crawl_payloads IS 'Metadata for raw GraphQL payloads stored on filesystem. Per TASK_WORKFLOW_2024-12-10.md: Full payloads enable historical diffs and replay.';
COMMENT ON COLUMN raw_crawl_payloads.storage_path IS 'Path to gzipped JSON file, e.g. /storage/payloads/2024/12/10/store_123_1702234567.json.gz';
COMMENT ON COLUMN raw_crawl_payloads.size_bytes IS 'Compressed file size in bytes';
COMMENT ON COLUMN raw_crawl_payloads.size_bytes_raw IS 'Uncompressed payload size in bytes';

View File

@@ -0,0 +1,37 @@
-- Migration 081: Payload Fetch Columns
-- Per TASK_WORKFLOW_2024-12-10.md: Separates API fetch from data processing
--
-- New architecture:
-- - payload_fetch: Hits Dutchie API, saves raw payload to disk
-- - product_refresh: Reads local payload, normalizes, upserts to DB
--
-- This migration adds:
-- 1. payload column to worker_tasks (for task chaining data)
-- 2. processed_at column to raw_crawl_payloads (track when payload was processed)
-- 3. last_fetch_at column to dispensaries (track when last payload was fetched)
-- Add payload column to worker_tasks for task chaining
-- Used by payload_fetch to pass payload_id to product_refresh
ALTER TABLE worker_tasks
ADD COLUMN IF NOT EXISTS payload JSONB DEFAULT NULL;
COMMENT ON COLUMN worker_tasks.payload IS 'Per TASK_WORKFLOW_2024-12-10.md: Task chaining data (e.g., payload_id from payload_fetch to product_refresh)';
-- Add processed_at to raw_crawl_payloads
-- Tracks when the payload was processed by product_refresh
ALTER TABLE raw_crawl_payloads
ADD COLUMN IF NOT EXISTS processed_at TIMESTAMPTZ DEFAULT NULL;
COMMENT ON COLUMN raw_crawl_payloads.processed_at IS 'When this payload was processed by product_refresh handler';
-- Index for finding unprocessed payloads
CREATE INDEX IF NOT EXISTS idx_raw_crawl_payloads_unprocessed
ON raw_crawl_payloads(dispensary_id, fetched_at DESC)
WHERE processed_at IS NULL;
-- Add last_fetch_at to dispensaries
-- Tracks when the last payload was fetched (separate from last_crawl_at which is when processing completed)
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS last_fetch_at TIMESTAMPTZ DEFAULT NULL;
COMMENT ON COLUMN dispensaries.last_fetch_at IS 'Per TASK_WORKFLOW_2024-12-10.md: When last payload was fetched from API (separate from last_crawl_at which is when processing completed)';

View File

@@ -0,0 +1,27 @@
-- Migration: 082_proxy_notification_trigger
-- Date: 2024-12-11
-- Description: Add PostgreSQL NOTIFY trigger to alert workers when proxies are added
-- Create function to notify workers when active proxy is added/activated
CREATE OR REPLACE FUNCTION notify_proxy_added()
RETURNS TRIGGER AS $$
BEGIN
-- Only notify if proxy is active
IF NEW.active = true THEN
PERFORM pg_notify('proxy_added', NEW.id::text);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Drop existing trigger if any
DROP TRIGGER IF EXISTS proxy_added_trigger ON proxies;
-- Create trigger on insert and update of active column
CREATE TRIGGER proxy_added_trigger
AFTER INSERT OR UPDATE OF active ON proxies
FOR EACH ROW
EXECUTE FUNCTION notify_proxy_added();
COMMENT ON FUNCTION notify_proxy_added() IS
'Sends PostgreSQL NOTIFY to proxy_added channel when an active proxy is added or activated. Workers LISTEN on this channel to wake up immediately.';

View File

@@ -1,6 +1,6 @@
{
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
@@ -46,6 +46,97 @@
"resolved": "https://registry.npmjs.org/@ioredis/commands/-/commands-1.4.0.tgz",
"integrity": "sha512-aFT2yemJJo+TZCmieA7qnYGQooOS7QfNmYrzGtsYd3g9j5iDP8AimYYAesf79ohjbLG12XxC4nG5DyEnC88AsQ=="
},
"node_modules/@jsep-plugin/assignment": {
"version": "1.3.0",
"resolved": "https://registry.npmjs.org/@jsep-plugin/assignment/-/assignment-1.3.0.tgz",
"integrity": "sha512-VVgV+CXrhbMI3aSusQyclHkenWSAm95WaiKrMxRFam3JSUiIaQjoMIw2sEs/OX4XifnqeQUN4DYbJjlA8EfktQ==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@jsep-plugin/regex": {
"version": "1.0.4",
"resolved": "https://registry.npmjs.org/@jsep-plugin/regex/-/regex-1.0.4.tgz",
"integrity": "sha512-q7qL4Mgjs1vByCaTnDFcBnV9HS7GVPJX5vyVoCgZHNSC9rjwIlmbXG5sUuorR5ndfHAIlJ8pVStxvjXHbNvtUg==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@kubernetes/client-node": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/@kubernetes/client-node/-/client-node-1.4.0.tgz",
"integrity": "sha512-Zge3YvF7DJi264dU1b3wb/GmzR99JhUpqTvp+VGHfwZT+g7EOOYNScDJNZwXy9cszyIGPIs0VHr+kk8e95qqrA==",
"dependencies": {
"@types/js-yaml": "^4.0.1",
"@types/node": "^24.0.0",
"@types/node-fetch": "^2.6.13",
"@types/stream-buffers": "^3.0.3",
"form-data": "^4.0.0",
"hpagent": "^1.2.0",
"isomorphic-ws": "^5.0.0",
"js-yaml": "^4.1.0",
"jsonpath-plus": "^10.3.0",
"node-fetch": "^2.7.0",
"openid-client": "^6.1.3",
"rfc4648": "^1.3.0",
"socks-proxy-agent": "^8.0.4",
"stream-buffers": "^3.0.2",
"tar-fs": "^3.0.9",
"ws": "^8.18.2"
}
},
"node_modules/@kubernetes/client-node/node_modules/@types/node": {
"version": "24.10.3",
"resolved": "https://registry.npmjs.org/@types/node/-/node-24.10.3.tgz",
"integrity": "sha512-gqkrWUsS8hcm0r44yn7/xZeV1ERva/nLgrLxFRUGb7aoNMIJfZJ3AC261zDQuOAKC7MiXai1WCpYc48jAHoShQ==",
"dependencies": {
"undici-types": "~7.16.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/tar-fs": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/tar-fs/-/tar-fs-3.1.1.tgz",
"integrity": "sha512-LZA0oaPOc2fVo82Txf3gw+AkEd38szODlptMYejQUhndHMLQ9M059uXR+AfS7DNo0NpINvSqDsvyaCrBVkptWg==",
"dependencies": {
"pump": "^3.0.0",
"tar-stream": "^3.1.5"
},
"optionalDependencies": {
"bare-fs": "^4.0.1",
"bare-path": "^3.0.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/undici-types": {
"version": "7.16.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.16.0.tgz",
"integrity": "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="
},
"node_modules/@kubernetes/client-node/node_modules/ws": {
"version": "8.18.3",
"resolved": "https://registry.npmjs.org/ws/-/ws-8.18.3.tgz",
"integrity": "sha512-PEIGCY5tSlUt50cqyMXfCzX+oOPqN0vuGqWzbcJ2xvnkzkq46oOpz7dQaTDBdfICb4N14+GARUDw2XV2N4tvzg==",
"engines": {
"node": ">=10.0.0"
},
"peerDependencies": {
"bufferutil": "^4.0.1",
"utf-8-validate": ">=5.0.2"
},
"peerDependenciesMeta": {
"bufferutil": {
"optional": true
},
"utf-8-validate": {
"optional": true
}
}
},
"node_modules/@mapbox/node-pre-gyp": {
"version": "1.0.11",
"resolved": "https://registry.npmjs.org/@mapbox/node-pre-gyp/-/node-pre-gyp-1.0.11.tgz",
@@ -251,6 +342,11 @@
"integrity": "sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==",
"dev": true
},
"node_modules/@types/js-yaml": {
"version": "4.0.9",
"resolved": "https://registry.npmjs.org/@types/js-yaml/-/js-yaml-4.0.9.tgz",
"integrity": "sha512-k4MGaQl5TGo/iipqb2UDG2UwjXziSWkh0uysQelTlJpX1qGlpUZYm8PnO4DxG1qBomtJUdYJ6qR6xdIah10JLg=="
},
"node_modules/@types/jsonwebtoken": {
"version": "9.0.10",
"resolved": "https://registry.npmjs.org/@types/jsonwebtoken/-/jsonwebtoken-9.0.10.tgz",
@@ -276,7 +372,6 @@
"version": "20.19.25",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.19.25.tgz",
"integrity": "sha512-ZsJzA5thDQMSQO788d7IocwwQbI8B5OPzmqNvpf3NY/+MHDAS759Wo0gd2WQeXYt5AAAQjzcrTVC6SKCuYgoCQ==",
"devOptional": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -287,6 +382,15 @@
"integrity": "sha512-0ikrnug3/IyneSHqCBeslAhlK2aBfYek1fGo4bP4QnZPmiqSGRK+Oy7ZMisLWkesffJvQ1cqAcBnJC+8+nxIAg==",
"dev": true
},
"node_modules/@types/node-fetch": {
"version": "2.6.13",
"resolved": "https://registry.npmjs.org/@types/node-fetch/-/node-fetch-2.6.13.tgz",
"integrity": "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw==",
"dependencies": {
"@types/node": "*",
"form-data": "^4.0.4"
}
},
"node_modules/@types/pg": {
"version": "8.15.6",
"resolved": "https://registry.npmjs.org/@types/pg/-/pg-8.15.6.tgz",
@@ -340,6 +444,14 @@
"@types/node": "*"
}
},
"node_modules/@types/stream-buffers": {
"version": "3.0.8",
"resolved": "https://registry.npmjs.org/@types/stream-buffers/-/stream-buffers-3.0.8.tgz",
"integrity": "sha512-J+7VaHKNvlNPJPEJXX/fKa9DZtR/xPMwuIbe+yNOwp1YB+ApUOBv2aUpEoBJEi8nJgbgs1x8e73ttg0r1rSUdw==",
"dependencies": {
"@types/node": "*"
}
},
"node_modules/@types/uuid": {
"version": "9.0.8",
"resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.8.tgz",
@@ -520,6 +632,78 @@
}
}
},
"node_modules/bare-fs": {
"version": "4.5.2",
"resolved": "https://registry.npmjs.org/bare-fs/-/bare-fs-4.5.2.tgz",
"integrity": "sha512-veTnRzkb6aPHOvSKIOy60KzURfBdUflr5VReI+NSaPL6xf+XLdONQgZgpYvUuZLVQ8dCqxpBAudaOM1+KpAUxw==",
"optional": true,
"dependencies": {
"bare-events": "^2.5.4",
"bare-path": "^3.0.0",
"bare-stream": "^2.6.4",
"bare-url": "^2.2.2",
"fast-fifo": "^1.3.2"
},
"engines": {
"bare": ">=1.16.0"
},
"peerDependencies": {
"bare-buffer": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
}
}
},
"node_modules/bare-os": {
"version": "3.6.2",
"resolved": "https://registry.npmjs.org/bare-os/-/bare-os-3.6.2.tgz",
"integrity": "sha512-T+V1+1srU2qYNBmJCXZkUY5vQ0B4FSlL3QDROnKQYOqeiQR8UbjNHlPa+TIbM4cuidiN9GaTaOZgSEgsvPbh5A==",
"optional": true,
"engines": {
"bare": ">=1.14.0"
}
},
"node_modules/bare-path": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/bare-path/-/bare-path-3.0.0.tgz",
"integrity": "sha512-tyfW2cQcB5NN8Saijrhqn0Zh7AnFNsnczRcuWODH0eYAXBsJ5gVxAUuNr7tsHSC6IZ77cA0SitzT+s47kot8Mw==",
"optional": true,
"dependencies": {
"bare-os": "^3.0.1"
}
},
"node_modules/bare-stream": {
"version": "2.7.0",
"resolved": "https://registry.npmjs.org/bare-stream/-/bare-stream-2.7.0.tgz",
"integrity": "sha512-oyXQNicV1y8nc2aKffH+BUHFRXmx6VrPzlnaEvMhram0nPBrKcEdcyBg5r08D0i8VxngHFAiVyn1QKXpSG0B8A==",
"optional": true,
"dependencies": {
"streamx": "^2.21.0"
},
"peerDependencies": {
"bare-buffer": "*",
"bare-events": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
},
"bare-events": {
"optional": true
}
}
},
"node_modules/bare-url": {
"version": "2.3.2",
"resolved": "https://registry.npmjs.org/bare-url/-/bare-url-2.3.2.tgz",
"integrity": "sha512-ZMq4gd9ngV5aTMa5p9+UfY0b3skwhHELaDkhEHetMdX0LRkW9kzaym4oo/Eh+Ghm0CCDuMTsRIGM/ytUc1ZYmw==",
"optional": true,
"dependencies": {
"bare-path": "^3.0.0"
}
},
"node_modules/base64-js": {
"version": "1.5.1",
"resolved": "https://registry.npmjs.org/base64-js/-/base64-js-1.5.1.tgz",
@@ -2019,6 +2203,14 @@
"node": ">=16.0.0"
}
},
"node_modules/hpagent": {
"version": "1.2.0",
"resolved": "https://registry.npmjs.org/hpagent/-/hpagent-1.2.0.tgz",
"integrity": "sha512-A91dYTeIB6NoXG+PxTQpCCDDnfHsW9kc06Lvpu1TEe9gnd6ZFeiBoRO9JvzEv6xK7EX97/dUE8g/vBMTqTS3CA==",
"engines": {
"node": ">=14"
}
},
"node_modules/htmlparser2": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz",
@@ -2382,6 +2574,22 @@
"node": ">=0.10.0"
}
},
"node_modules/isomorphic-ws": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/isomorphic-ws/-/isomorphic-ws-5.0.0.tgz",
"integrity": "sha512-muId7Zzn9ywDsyXgTIafTry2sV3nySZeUDe6YedVd1Hvuuep5AsIlqK+XefWpYTyJG5e503F2xIuT2lcU6rCSw==",
"peerDependencies": {
"ws": "*"
}
},
"node_modules/jose": {
"version": "6.1.3",
"resolved": "https://registry.npmjs.org/jose/-/jose-6.1.3.tgz",
"integrity": "sha512-0TpaTfihd4QMNwrz/ob2Bp7X04yuxJkjRGi4aKmOqwhov54i6u79oCv7T+C7lo70MKH6BesI3vscD1yb/yzKXQ==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/js-tokens": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/js-tokens/-/js-tokens-4.0.0.tgz",
@@ -2398,6 +2606,14 @@
"js-yaml": "bin/js-yaml.js"
}
},
"node_modules/jsep": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/jsep/-/jsep-1.4.0.tgz",
"integrity": "sha512-B7qPcEVE3NVkmSJbaYxvv4cHkVW7DQsZz13pUMrfS8z8Q/BuShN+gcTXrUlPiGqM2/t/EEaI030bpxMqY8gMlw==",
"engines": {
"node": ">= 10.16.0"
}
},
"node_modules/json-parse-even-better-errors": {
"version": "2.3.1",
"resolved": "https://registry.npmjs.org/json-parse-even-better-errors/-/json-parse-even-better-errors-2.3.1.tgz",
@@ -2419,6 +2635,23 @@
"graceful-fs": "^4.1.6"
}
},
"node_modules/jsonpath-plus": {
"version": "10.3.0",
"resolved": "https://registry.npmjs.org/jsonpath-plus/-/jsonpath-plus-10.3.0.tgz",
"integrity": "sha512-8TNmfeTCk2Le33A3vRRwtuworG/L5RrgMvdjhKZxvyShO+mBu2fP50OWUjRLNtvw344DdDarFh9buFAZs5ujeA==",
"dependencies": {
"@jsep-plugin/assignment": "^1.3.0",
"@jsep-plugin/regex": "^1.0.4",
"jsep": "^1.4.0"
},
"bin": {
"jsonpath": "bin/jsonpath-cli.js",
"jsonpath-plus": "bin/jsonpath-cli.js"
},
"engines": {
"node": ">=18.0.0"
}
},
"node_modules/jsonwebtoken": {
"version": "9.0.2",
"resolved": "https://registry.npmjs.org/jsonwebtoken/-/jsonwebtoken-9.0.2.tgz",
@@ -2493,6 +2726,11 @@
"resolved": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
"integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg=="
},
"node_modules/lodash.clonedeep": {
"version": "4.5.0",
"resolved": "https://registry.npmjs.org/lodash.clonedeep/-/lodash.clonedeep-4.5.0.tgz",
"integrity": "sha512-H5ZhCF25riFd9uB5UCkVKo61m3S/xZk1x4wA6yp/L3RFP6Z/eHH1ymQcGLo7J3GMPfm0V/7m1tryHuGVxpqEBQ=="
},
"node_modules/lodash.defaults": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
@@ -2942,6 +3180,14 @@
"url": "https://github.com/fb55/nth-check?sponsor=1"
}
},
"node_modules/oauth4webapi": {
"version": "3.8.3",
"resolved": "https://registry.npmjs.org/oauth4webapi/-/oauth4webapi-3.8.3.tgz",
"integrity": "sha512-pQ5BsX3QRTgnt5HxgHwgunIRaDXBdkT23tf8dfzmtTIL2LTpdmxgbpbBm0VgFWAIDlezQvQCTgnVIUmHupXHxw==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/object-assign": {
"version": "4.1.1",
"resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
@@ -2980,6 +3226,18 @@
"wrappy": "1"
}
},
"node_modules/openid-client": {
"version": "6.8.1",
"resolved": "https://registry.npmjs.org/openid-client/-/openid-client-6.8.1.tgz",
"integrity": "sha512-VoYT6enBo6Vj2j3Q5Ec0AezS+9YGzQo1f5Xc42lreMGlfP4ljiXPKVDvCADh+XHCV/bqPu/wWSiCVXbJKvrODw==",
"dependencies": {
"jose": "^6.1.0",
"oauth4webapi": "^3.8.2"
},
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/pac-proxy-agent": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/pac-proxy-agent/-/pac-proxy-agent-7.2.0.tgz",
@@ -3883,6 +4141,11 @@
"url": "https://github.com/privatenumber/resolve-pkg-maps?sponsor=1"
}
},
"node_modules/rfc4648": {
"version": "1.5.4",
"resolved": "https://registry.npmjs.org/rfc4648/-/rfc4648-1.5.4.tgz",
"integrity": "sha512-rRg/6Lb+IGfJqO05HZkN50UtY7K/JhxJag1kP23+zyMfrvoB0B7RWv06MbOzoc79RgCdNTiUaNsTT1AJZ7Z+cg=="
},
"node_modules/rimraf": {
"version": "3.0.2",
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-3.0.2.tgz",
@@ -4313,6 +4576,14 @@
"node": ">= 0.8"
}
},
"node_modules/stream-buffers": {
"version": "3.0.3",
"resolved": "https://registry.npmjs.org/stream-buffers/-/stream-buffers-3.0.3.tgz",
"integrity": "sha512-pqMqwQCso0PBJt2PQmDO0cFj0lyqmiwOMiMSkVtRokl7e+ZTRYgDHKnuZNbqjiJXgsg4nuqtD/zxuo9KqTp0Yw==",
"engines": {
"node": ">= 0.10.0"
}
},
"node_modules/streamx": {
"version": "2.23.0",
"resolved": "https://registry.npmjs.org/streamx/-/streamx-2.23.0.tgz",
@@ -4532,8 +4803,7 @@
"node_modules/undici-types": {
"version": "6.21.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
"devOptional": true
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ=="
},
"node_modules/universalify": {
"version": "2.0.1",
@@ -4556,6 +4826,14 @@
"resolved": "https://registry.npmjs.org/urlpattern-polyfill/-/urlpattern-polyfill-10.0.0.tgz",
"integrity": "sha512-H/A06tKD7sS1O1X2SshBVeA5FLycRpjqiBeqGKmBwBDBy28EnRjORxTNe269KSSr5un5qyWi1iL61wLxpd+ZOg=="
},
"node_modules/user-agents": {
"version": "1.1.669",
"resolved": "https://registry.npmjs.org/user-agents/-/user-agents-1.1.669.tgz",
"integrity": "sha512-pbIzG+AOqCaIpySKJ4IAm1l0VyE4jMnK4y1thV8lm8PYxI+7X5uWcppOK7zY79TCKKTAnJH3/4gaVIZHsjrmJA==",
"dependencies": {
"lodash.clonedeep": "^4.5.0"
}
},
"node_modules/util": {
"version": "0.12.5",
"resolved": "https://registry.npmjs.org/util/-/util-0.12.5.tgz",

View File

@@ -1,13 +1,14 @@
{
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "dutchie-menus-backend",
"version": "1.5.1",
"version": "1.6.0",
"dependencies": {
"@kubernetes/client-node": "^1.4.0",
"@types/bcryptjs": "^3.0.0",
"axios": "^1.6.2",
"bcrypt": "^5.1.1",
@@ -34,6 +35,7 @@
"puppeteer-extra-plugin-stealth": "^2.11.2",
"sharp": "^0.32.0",
"socks-proxy-agent": "^8.0.2",
"user-agents": "^1.1.669",
"uuid": "^9.0.1",
"zod": "^3.22.4"
},
@@ -492,6 +494,97 @@
"resolved": "https://registry.npmjs.org/@ioredis/commands/-/commands-1.4.0.tgz",
"integrity": "sha512-aFT2yemJJo+TZCmieA7qnYGQooOS7QfNmYrzGtsYd3g9j5iDP8AimYYAesf79ohjbLG12XxC4nG5DyEnC88AsQ=="
},
"node_modules/@jsep-plugin/assignment": {
"version": "1.3.0",
"resolved": "https://registry.npmjs.org/@jsep-plugin/assignment/-/assignment-1.3.0.tgz",
"integrity": "sha512-VVgV+CXrhbMI3aSusQyclHkenWSAm95WaiKrMxRFam3JSUiIaQjoMIw2sEs/OX4XifnqeQUN4DYbJjlA8EfktQ==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@jsep-plugin/regex": {
"version": "1.0.4",
"resolved": "https://registry.npmjs.org/@jsep-plugin/regex/-/regex-1.0.4.tgz",
"integrity": "sha512-q7qL4Mgjs1vByCaTnDFcBnV9HS7GVPJX5vyVoCgZHNSC9rjwIlmbXG5sUuorR5ndfHAIlJ8pVStxvjXHbNvtUg==",
"engines": {
"node": ">= 10.16.0"
},
"peerDependencies": {
"jsep": "^0.4.0||^1.0.0"
}
},
"node_modules/@kubernetes/client-node": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/@kubernetes/client-node/-/client-node-1.4.0.tgz",
"integrity": "sha512-Zge3YvF7DJi264dU1b3wb/GmzR99JhUpqTvp+VGHfwZT+g7EOOYNScDJNZwXy9cszyIGPIs0VHr+kk8e95qqrA==",
"dependencies": {
"@types/js-yaml": "^4.0.1",
"@types/node": "^24.0.0",
"@types/node-fetch": "^2.6.13",
"@types/stream-buffers": "^3.0.3",
"form-data": "^4.0.0",
"hpagent": "^1.2.0",
"isomorphic-ws": "^5.0.0",
"js-yaml": "^4.1.0",
"jsonpath-plus": "^10.3.0",
"node-fetch": "^2.7.0",
"openid-client": "^6.1.3",
"rfc4648": "^1.3.0",
"socks-proxy-agent": "^8.0.4",
"stream-buffers": "^3.0.2",
"tar-fs": "^3.0.9",
"ws": "^8.18.2"
}
},
"node_modules/@kubernetes/client-node/node_modules/@types/node": {
"version": "24.10.3",
"resolved": "https://registry.npmjs.org/@types/node/-/node-24.10.3.tgz",
"integrity": "sha512-gqkrWUsS8hcm0r44yn7/xZeV1ERva/nLgrLxFRUGb7aoNMIJfZJ3AC261zDQuOAKC7MiXai1WCpYc48jAHoShQ==",
"dependencies": {
"undici-types": "~7.16.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/tar-fs": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/tar-fs/-/tar-fs-3.1.1.tgz",
"integrity": "sha512-LZA0oaPOc2fVo82Txf3gw+AkEd38szODlptMYejQUhndHMLQ9M059uXR+AfS7DNo0NpINvSqDsvyaCrBVkptWg==",
"dependencies": {
"pump": "^3.0.0",
"tar-stream": "^3.1.5"
},
"optionalDependencies": {
"bare-fs": "^4.0.1",
"bare-path": "^3.0.0"
}
},
"node_modules/@kubernetes/client-node/node_modules/undici-types": {
"version": "7.16.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-7.16.0.tgz",
"integrity": "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="
},
"node_modules/@kubernetes/client-node/node_modules/ws": {
"version": "8.18.3",
"resolved": "https://registry.npmjs.org/ws/-/ws-8.18.3.tgz",
"integrity": "sha512-PEIGCY5tSlUt50cqyMXfCzX+oOPqN0vuGqWzbcJ2xvnkzkq46oOpz7dQaTDBdfICb4N14+GARUDw2XV2N4tvzg==",
"engines": {
"node": ">=10.0.0"
},
"peerDependencies": {
"bufferutil": "^4.0.1",
"utf-8-validate": ">=5.0.2"
},
"peerDependenciesMeta": {
"bufferutil": {
"optional": true
},
"utf-8-validate": {
"optional": true
}
}
},
"node_modules/@mapbox/node-pre-gyp": {
"version": "1.0.11",
"resolved": "https://registry.npmjs.org/@mapbox/node-pre-gyp/-/node-pre-gyp-1.0.11.tgz",
@@ -757,6 +850,11 @@
"integrity": "sha512-r8Tayk8HJnX0FztbZN7oVqGccWgw98T/0neJphO91KkmOzug1KkofZURD4UaD5uH8AqcFLfdPErnBod0u71/qg==",
"dev": true
},
"node_modules/@types/js-yaml": {
"version": "4.0.9",
"resolved": "https://registry.npmjs.org/@types/js-yaml/-/js-yaml-4.0.9.tgz",
"integrity": "sha512-k4MGaQl5TGo/iipqb2UDG2UwjXziSWkh0uysQelTlJpX1qGlpUZYm8PnO4DxG1qBomtJUdYJ6qR6xdIah10JLg=="
},
"node_modules/@types/jsonwebtoken": {
"version": "9.0.10",
"resolved": "https://registry.npmjs.org/@types/jsonwebtoken/-/jsonwebtoken-9.0.10.tgz",
@@ -782,7 +880,6 @@
"version": "20.19.25",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.19.25.tgz",
"integrity": "sha512-ZsJzA5thDQMSQO788d7IocwwQbI8B5OPzmqNvpf3NY/+MHDAS759Wo0gd2WQeXYt5AAAQjzcrTVC6SKCuYgoCQ==",
"devOptional": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -793,6 +890,15 @@
"integrity": "sha512-0ikrnug3/IyneSHqCBeslAhlK2aBfYek1fGo4bP4QnZPmiqSGRK+Oy7ZMisLWkesffJvQ1cqAcBnJC+8+nxIAg==",
"dev": true
},
"node_modules/@types/node-fetch": {
"version": "2.6.13",
"resolved": "https://registry.npmjs.org/@types/node-fetch/-/node-fetch-2.6.13.tgz",
"integrity": "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw==",
"dependencies": {
"@types/node": "*",
"form-data": "^4.0.4"
}
},
"node_modules/@types/pg": {
"version": "8.15.6",
"resolved": "https://registry.npmjs.org/@types/pg/-/pg-8.15.6.tgz",
@@ -846,6 +952,14 @@
"@types/node": "*"
}
},
"node_modules/@types/stream-buffers": {
"version": "3.0.8",
"resolved": "https://registry.npmjs.org/@types/stream-buffers/-/stream-buffers-3.0.8.tgz",
"integrity": "sha512-J+7VaHKNvlNPJPEJXX/fKa9DZtR/xPMwuIbe+yNOwp1YB+ApUOBv2aUpEoBJEi8nJgbgs1x8e73ttg0r1rSUdw==",
"dependencies": {
"@types/node": "*"
}
},
"node_modules/@types/uuid": {
"version": "9.0.8",
"resolved": "https://registry.npmjs.org/@types/uuid/-/uuid-9.0.8.tgz",
@@ -1026,6 +1140,78 @@
}
}
},
"node_modules/bare-fs": {
"version": "4.5.2",
"resolved": "https://registry.npmjs.org/bare-fs/-/bare-fs-4.5.2.tgz",
"integrity": "sha512-veTnRzkb6aPHOvSKIOy60KzURfBdUflr5VReI+NSaPL6xf+XLdONQgZgpYvUuZLVQ8dCqxpBAudaOM1+KpAUxw==",
"optional": true,
"dependencies": {
"bare-events": "^2.5.4",
"bare-path": "^3.0.0",
"bare-stream": "^2.6.4",
"bare-url": "^2.2.2",
"fast-fifo": "^1.3.2"
},
"engines": {
"bare": ">=1.16.0"
},
"peerDependencies": {
"bare-buffer": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
}
}
},
"node_modules/bare-os": {
"version": "3.6.2",
"resolved": "https://registry.npmjs.org/bare-os/-/bare-os-3.6.2.tgz",
"integrity": "sha512-T+V1+1srU2qYNBmJCXZkUY5vQ0B4FSlL3QDROnKQYOqeiQR8UbjNHlPa+TIbM4cuidiN9GaTaOZgSEgsvPbh5A==",
"optional": true,
"engines": {
"bare": ">=1.14.0"
}
},
"node_modules/bare-path": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/bare-path/-/bare-path-3.0.0.tgz",
"integrity": "sha512-tyfW2cQcB5NN8Saijrhqn0Zh7AnFNsnczRcuWODH0eYAXBsJ5gVxAUuNr7tsHSC6IZ77cA0SitzT+s47kot8Mw==",
"optional": true,
"dependencies": {
"bare-os": "^3.0.1"
}
},
"node_modules/bare-stream": {
"version": "2.7.0",
"resolved": "https://registry.npmjs.org/bare-stream/-/bare-stream-2.7.0.tgz",
"integrity": "sha512-oyXQNicV1y8nc2aKffH+BUHFRXmx6VrPzlnaEvMhram0nPBrKcEdcyBg5r08D0i8VxngHFAiVyn1QKXpSG0B8A==",
"optional": true,
"dependencies": {
"streamx": "^2.21.0"
},
"peerDependencies": {
"bare-buffer": "*",
"bare-events": "*"
},
"peerDependenciesMeta": {
"bare-buffer": {
"optional": true
},
"bare-events": {
"optional": true
}
}
},
"node_modules/bare-url": {
"version": "2.3.2",
"resolved": "https://registry.npmjs.org/bare-url/-/bare-url-2.3.2.tgz",
"integrity": "sha512-ZMq4gd9ngV5aTMa5p9+UfY0b3skwhHELaDkhEHetMdX0LRkW9kzaym4oo/Eh+Ghm0CCDuMTsRIGM/ytUc1ZYmw==",
"optional": true,
"dependencies": {
"bare-path": "^3.0.0"
}
},
"node_modules/base64-js": {
"version": "1.5.1",
"resolved": "https://registry.npmjs.org/base64-js/-/base64-js-1.5.1.tgz",
@@ -2539,6 +2725,14 @@
"node": ">=16.0.0"
}
},
"node_modules/hpagent": {
"version": "1.2.0",
"resolved": "https://registry.npmjs.org/hpagent/-/hpagent-1.2.0.tgz",
"integrity": "sha512-A91dYTeIB6NoXG+PxTQpCCDDnfHsW9kc06Lvpu1TEe9gnd6ZFeiBoRO9JvzEv6xK7EX97/dUE8g/vBMTqTS3CA==",
"engines": {
"node": ">=14"
}
},
"node_modules/htmlparser2": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-10.0.0.tgz",
@@ -2902,6 +3096,22 @@
"node": ">=0.10.0"
}
},
"node_modules/isomorphic-ws": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/isomorphic-ws/-/isomorphic-ws-5.0.0.tgz",
"integrity": "sha512-muId7Zzn9ywDsyXgTIafTry2sV3nySZeUDe6YedVd1Hvuuep5AsIlqK+XefWpYTyJG5e503F2xIuT2lcU6rCSw==",
"peerDependencies": {
"ws": "*"
}
},
"node_modules/jose": {
"version": "6.1.3",
"resolved": "https://registry.npmjs.org/jose/-/jose-6.1.3.tgz",
"integrity": "sha512-0TpaTfihd4QMNwrz/ob2Bp7X04yuxJkjRGi4aKmOqwhov54i6u79oCv7T+C7lo70MKH6BesI3vscD1yb/yzKXQ==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/js-tokens": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/js-tokens/-/js-tokens-4.0.0.tgz",
@@ -2918,6 +3128,14 @@
"js-yaml": "bin/js-yaml.js"
}
},
"node_modules/jsep": {
"version": "1.4.0",
"resolved": "https://registry.npmjs.org/jsep/-/jsep-1.4.0.tgz",
"integrity": "sha512-B7qPcEVE3NVkmSJbaYxvv4cHkVW7DQsZz13pUMrfS8z8Q/BuShN+gcTXrUlPiGqM2/t/EEaI030bpxMqY8gMlw==",
"engines": {
"node": ">= 10.16.0"
}
},
"node_modules/json-parse-even-better-errors": {
"version": "2.3.1",
"resolved": "https://registry.npmjs.org/json-parse-even-better-errors/-/json-parse-even-better-errors-2.3.1.tgz",
@@ -2939,6 +3157,23 @@
"graceful-fs": "^4.1.6"
}
},
"node_modules/jsonpath-plus": {
"version": "10.3.0",
"resolved": "https://registry.npmjs.org/jsonpath-plus/-/jsonpath-plus-10.3.0.tgz",
"integrity": "sha512-8TNmfeTCk2Le33A3vRRwtuworG/L5RrgMvdjhKZxvyShO+mBu2fP50OWUjRLNtvw344DdDarFh9buFAZs5ujeA==",
"dependencies": {
"@jsep-plugin/assignment": "^1.3.0",
"@jsep-plugin/regex": "^1.0.4",
"jsep": "^1.4.0"
},
"bin": {
"jsonpath": "bin/jsonpath-cli.js",
"jsonpath-plus": "bin/jsonpath-cli.js"
},
"engines": {
"node": ">=18.0.0"
}
},
"node_modules/jsonwebtoken": {
"version": "9.0.2",
"resolved": "https://registry.npmjs.org/jsonwebtoken/-/jsonwebtoken-9.0.2.tgz",
@@ -3013,6 +3248,11 @@
"resolved": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
"integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg=="
},
"node_modules/lodash.clonedeep": {
"version": "4.5.0",
"resolved": "https://registry.npmjs.org/lodash.clonedeep/-/lodash.clonedeep-4.5.0.tgz",
"integrity": "sha512-H5ZhCF25riFd9uB5UCkVKo61m3S/xZk1x4wA6yp/L3RFP6Z/eHH1ymQcGLo7J3GMPfm0V/7m1tryHuGVxpqEBQ=="
},
"node_modules/lodash.defaults": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/lodash.defaults/-/lodash.defaults-4.2.0.tgz",
@@ -3462,6 +3702,14 @@
"url": "https://github.com/fb55/nth-check?sponsor=1"
}
},
"node_modules/oauth4webapi": {
"version": "3.8.3",
"resolved": "https://registry.npmjs.org/oauth4webapi/-/oauth4webapi-3.8.3.tgz",
"integrity": "sha512-pQ5BsX3QRTgnt5HxgHwgunIRaDXBdkT23tf8dfzmtTIL2LTpdmxgbpbBm0VgFWAIDlezQvQCTgnVIUmHupXHxw==",
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/object-assign": {
"version": "4.1.1",
"resolved": "https://registry.npmjs.org/object-assign/-/object-assign-4.1.1.tgz",
@@ -3500,6 +3748,18 @@
"wrappy": "1"
}
},
"node_modules/openid-client": {
"version": "6.8.1",
"resolved": "https://registry.npmjs.org/openid-client/-/openid-client-6.8.1.tgz",
"integrity": "sha512-VoYT6enBo6Vj2j3Q5Ec0AezS+9YGzQo1f5Xc42lreMGlfP4ljiXPKVDvCADh+XHCV/bqPu/wWSiCVXbJKvrODw==",
"dependencies": {
"jose": "^6.1.0",
"oauth4webapi": "^3.8.2"
},
"funding": {
"url": "https://github.com/sponsors/panva"
}
},
"node_modules/pac-proxy-agent": {
"version": "7.2.0",
"resolved": "https://registry.npmjs.org/pac-proxy-agent/-/pac-proxy-agent-7.2.0.tgz",
@@ -4416,6 +4676,11 @@
"url": "https://github.com/privatenumber/resolve-pkg-maps?sponsor=1"
}
},
"node_modules/rfc4648": {
"version": "1.5.4",
"resolved": "https://registry.npmjs.org/rfc4648/-/rfc4648-1.5.4.tgz",
"integrity": "sha512-rRg/6Lb+IGfJqO05HZkN50UtY7K/JhxJag1kP23+zyMfrvoB0B7RWv06MbOzoc79RgCdNTiUaNsTT1AJZ7Z+cg=="
},
"node_modules/rimraf": {
"version": "3.0.2",
"resolved": "https://registry.npmjs.org/rimraf/-/rimraf-3.0.2.tgz",
@@ -4846,6 +5111,14 @@
"node": ">= 0.8"
}
},
"node_modules/stream-buffers": {
"version": "3.0.3",
"resolved": "https://registry.npmjs.org/stream-buffers/-/stream-buffers-3.0.3.tgz",
"integrity": "sha512-pqMqwQCso0PBJt2PQmDO0cFj0lyqmiwOMiMSkVtRokl7e+ZTRYgDHKnuZNbqjiJXgsg4nuqtD/zxuo9KqTp0Yw==",
"engines": {
"node": ">= 0.10.0"
}
},
"node_modules/streamx": {
"version": "2.23.0",
"resolved": "https://registry.npmjs.org/streamx/-/streamx-2.23.0.tgz",
@@ -5065,8 +5338,7 @@
"node_modules/undici-types": {
"version": "6.21.0",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.21.0.tgz",
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ==",
"devOptional": true
"integrity": "sha512-iwDZqg0QAGrg9Rav5H4n0M64c3mkR59cJ6wQp+7C4nI0gsmExaedaYLNO44eT4AtBBwjbTiGPMlt2Md0T9H9JQ=="
},
"node_modules/universalify": {
"version": "2.0.1",
@@ -5089,6 +5361,14 @@
"resolved": "https://registry.npmjs.org/urlpattern-polyfill/-/urlpattern-polyfill-10.0.0.tgz",
"integrity": "sha512-H/A06tKD7sS1O1X2SshBVeA5FLycRpjqiBeqGKmBwBDBy28EnRjORxTNe269KSSr5un5qyWi1iL61wLxpd+ZOg=="
},
"node_modules/user-agents": {
"version": "1.1.669",
"resolved": "https://registry.npmjs.org/user-agents/-/user-agents-1.1.669.tgz",
"integrity": "sha512-pbIzG+AOqCaIpySKJ4IAm1l0VyE4jMnK4y1thV8lm8PYxI+7X5uWcppOK7zY79TCKKTAnJH3/4gaVIZHsjrmJA==",
"dependencies": {
"lodash.clonedeep": "^4.5.0"
}
},
"node_modules/util": {
"version": "0.12.5",
"resolved": "https://registry.npmjs.org/util/-/util-0.12.5.tgz",

View File

@@ -22,6 +22,7 @@
"seed:dt:cities:bulk": "tsx src/scripts/seed-dt-cities-bulk.ts"
},
"dependencies": {
"@kubernetes/client-node": "^1.4.0",
"@types/bcryptjs": "^3.0.0",
"axios": "^1.6.2",
"bcrypt": "^5.1.1",
@@ -48,6 +49,7 @@
"puppeteer-extra-plugin-stealth": "^2.11.2",
"sharp": "^0.32.0",
"socks-proxy-agent": "^8.0.2",
"user-agents": "^1.1.669",
"uuid": "^9.0.1",
"zod": "^3.22.4"
},

Binary file not shown.

View File

@@ -0,0 +1 @@
cannaiq-menus-1.6.0.zip

View File

@@ -153,22 +153,10 @@ export async function authenticateUser(email: string, password: string): Promise
}
export async function authMiddleware(req: AuthRequest, res: Response, next: NextFunction) {
// Allow trusted origins/IPs to bypass auth (internal services, same-origin)
if (isTrustedRequest(req)) {
req.user = {
id: 0,
email: 'internal@system',
role: 'internal'
};
return next();
}
const authHeader = req.headers.authorization;
if (!authHeader || !authHeader.startsWith('Bearer ')) {
return res.status(401).json({ error: 'No token provided' });
}
// If a Bearer token is provided, always try to use it first (logged-in user)
if (authHeader && authHeader.startsWith('Bearer ')) {
const token = authHeader.substring(7);
// Try JWT first
@@ -187,56 +175,44 @@ export async function authMiddleware(req: AuthRequest, res: Response, next: Next
WHERE token = $1
`, [token]);
if (result.rows.length === 0) {
if (result.rows.length > 0) {
const apiToken = result.rows[0];
if (!apiToken.active) {
return res.status(401).json({ error: 'API token is inactive' });
}
if (apiToken.expires_at && new Date(apiToken.expires_at) < new Date()) {
return res.status(401).json({ error: 'API token has expired' });
}
req.user = {
id: 0,
email: `api:${apiToken.name}`,
role: 'api_token'
};
req.apiToken = apiToken;
return next();
}
} catch (err) {
console.error('API token lookup error:', err);
}
// Token provided but invalid
return res.status(401).json({ error: 'Invalid token' });
}
const apiToken = result.rows[0];
// Check if token is active
if (!apiToken.active) {
return res.status(401).json({ error: 'Token is disabled' });
}
// Check if token is expired
if (apiToken.expires_at && new Date(apiToken.expires_at) < new Date()) {
return res.status(401).json({ error: 'Token has expired' });
}
// Check allowed endpoints
if (apiToken.allowed_endpoints && apiToken.allowed_endpoints.length > 0) {
const isAllowed = apiToken.allowed_endpoints.some((pattern: string) => {
// Simple wildcard matching
const regex = new RegExp('^' + pattern.replace('*', '.*') + '$');
return regex.test(req.path);
});
if (!isAllowed) {
return res.status(403).json({ error: 'Endpoint not allowed for this token' });
}
}
// Set API token on request for tracking
req.apiToken = {
id: apiToken.id,
name: apiToken.name,
rate_limit: apiToken.rate_limit
};
// Set a generic user for compatibility with existing code
// No token provided - check trusted origins for API access (WordPress, etc.)
if (isTrustedRequest(req)) {
req.user = {
id: apiToken.id,
email: `api-token-${apiToken.id}@system`,
role: 'api'
id: 0,
email: 'internal@system',
role: 'internal'
};
next();
} catch (error) {
console.error('Error verifying API token:', error);
return res.status(500).json({ error: 'Authentication failed' });
return next();
}
return res.status(401).json({ error: 'No token provided' });
}
/**
* Require specific role(s) to access endpoint.
*

View File

@@ -172,6 +172,9 @@ export async function runFullDiscovery(
console.log(`Errors: ${totalErrors}`);
}
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
let newDispensaryIds: number[] = [];
// Step 4: Auto-validate and promote discovered locations
if (!dryRun && totalLocationsUpserted > 0) {
console.log('\n[Discovery] Step 4: Auto-promoting discovered locations...');
@@ -180,6 +183,13 @@ export async function runFullDiscovery(
console.log(` Created: ${promotionResult.created} new dispensaries`);
console.log(` Updated: ${promotionResult.updated} existing dispensaries`);
console.log(` Rejected: ${promotionResult.rejected} (validation failed)`);
// Per TASK_WORKFLOW_2024-12-10.md: Capture new IDs for task chaining
newDispensaryIds = promotionResult.newDispensaryIds;
if (newDispensaryIds.length > 0) {
console.log(` New store IDs for crawl: [${newDispensaryIds.join(', ')}]`);
}
if (promotionResult.rejectedRecords.length > 0) {
console.log(` Rejection reasons:`);
promotionResult.rejectedRecords.slice(0, 5).forEach(r => {
@@ -214,6 +224,8 @@ export async function runFullDiscovery(
totalLocationsFound,
totalLocationsUpserted,
durationMs,
// Per TASK_WORKFLOW_2024-12-10.md: Return new IDs for task chaining
newDispensaryIds,
};
}

View File

@@ -127,6 +127,8 @@ export interface PromotionSummary {
errors: string[];
}>;
durationMs: number;
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
newDispensaryIds: number[];
}
/**
@@ -469,6 +471,8 @@ export async function promoteDiscoveredLocations(
const results: PromotionResult[] = [];
const rejectedRecords: PromotionSummary['rejectedRecords'] = [];
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
const newDispensaryIds: number[] = [];
let created = 0;
let updated = 0;
let skipped = 0;
@@ -525,6 +529,8 @@ export async function promoteDiscoveredLocations(
if (promotionResult.action === 'created') {
created++;
// Per TASK_WORKFLOW_2024-12-10.md: Track new IDs for task chaining
newDispensaryIds.push(promotionResult.dispensaryId);
} else {
updated++;
}
@@ -548,6 +554,8 @@ export async function promoteDiscoveredLocations(
results,
rejectedRecords,
durationMs: Date.now() - startTime,
// Per TASK_WORKFLOW_2024-12-10.md: Return new IDs for task chaining
newDispensaryIds,
};
}

View File

@@ -211,6 +211,8 @@ export interface FullDiscoveryResult {
totalLocationsFound: number;
totalLocationsUpserted: number;
durationMs: number;
// Per TASK_WORKFLOW_2024-12-10.md: Track new dispensary IDs for task chaining
newDispensaryIds?: number[];
}
// ============================================================

View File

@@ -6,6 +6,8 @@ import { initializeMinio, isMinioEnabled } from './utils/minio';
import { initializeImageStorage } from './utils/image-storage';
import { logger } from './services/logger';
import { cleanupOrphanedJobs } from './services/proxyTestQueue';
// Per TASK_WORKFLOW_2024-12-10.md: Database-driven task scheduler
import { taskScheduler } from './services/task-scheduler';
import { runAutoMigrations } from './db/auto-migrate';
import { getPool } from './db/pool';
import healthRoutes from './routes/health';
@@ -142,6 +144,8 @@ import seoRoutes from './routes/seo';
import priceAnalyticsRoutes from './routes/price-analytics';
import tasksRoutes from './routes/tasks';
import workerRegistryRoutes from './routes/worker-registry';
// Per TASK_WORKFLOW_2024-12-10.md: Raw payload access API
import payloadsRoutes from './routes/payloads';
// Mark requests from trusted domains (cannaiq.co, findagram.co, findadispo.com)
// These domains can access the API without authentication
@@ -222,6 +226,10 @@ console.log('[Tasks] Routes registered at /api/tasks');
app.use('/api/worker-registry', workerRegistryRoutes);
console.log('[WorkerRegistry] Routes registered at /api/worker-registry');
// Per TASK_WORKFLOW_2024-12-10.md: Raw payload access API
app.use('/api/payloads', payloadsRoutes);
console.log('[Payloads] Routes registered at /api/payloads');
// Phase 3: Analytics V2 - Enhanced analytics with rec/med state segmentation
try {
const analyticsV2Router = createAnalyticsV2Router(getPool());
@@ -326,6 +334,17 @@ async function startServer() {
// Clean up any orphaned proxy test jobs from previous server runs
await cleanupOrphanedJobs();
// Per TASK_WORKFLOW_2024-12-10.md: Start database-driven task scheduler
// This replaces node-cron - schedules are stored in DB and survive restarts
// Uses SELECT FOR UPDATE SKIP LOCKED for multi-replica safety
try {
await taskScheduler.start();
logger.info('system', 'Task scheduler started');
} catch (err: any) {
// Non-fatal - scheduler can recover on next poll
logger.warn('system', `Task scheduler startup warning: ${err.message}`);
}
app.listen(PORT, () => {
logger.info('system', `Server running on port ${PORT}`);
console.log(`🚀 Server running on port ${PORT}`);

View File

@@ -702,12 +702,10 @@ export class StateQueryService {
async getNationalSummary(): Promise<NationalSummary> {
const stateMetrics = await this.getAllStateMetrics();
// Get all states count and aggregate metrics
const result = await this.pool.query(`
SELECT
COUNT(DISTINCT s.code) AS total_states,
COUNT(DISTINCT CASE WHEN EXISTS (
SELECT 1 FROM dispensaries d WHERE d.state = s.code AND d.menu_type IS NOT NULL
) THEN s.code END) AS active_states,
(SELECT COUNT(*) FROM dispensaries WHERE state IS NOT NULL) AS total_stores,
(SELECT COUNT(*) FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
@@ -725,7 +723,7 @@ export class StateQueryService {
return {
totalStates: parseInt(data.total_states),
activeStates: parseInt(data.active_states),
activeStates: parseInt(data.total_states), // Same as totalStates - all states shown
totalStores: parseInt(data.total_stores),
totalProducts: parseInt(data.total_products),
totalBrands: parseInt(data.total_brands),

View File

@@ -5,22 +5,35 @@
*
* DO NOT MODIFY THIS FILE WITHOUT EXPLICIT AUTHORIZATION.
*
* This is the canonical HTTP client for all Dutchie communication.
* All Dutchie workers (Alice, Bella, etc.) MUST use this client.
* Updated: 2025-12-10 per workflow-12102025.md
*
* KEY BEHAVIORS (per workflow-12102025.md):
* 1. startSession() gets identity from PROXY LOCATION, not task params
* 2. On 403: immediately get new IP + new fingerprint, then retry
* 3. After 3 consecutive 403s on same proxy → disable it (burned)
* 4. Language is always English (en-US)
*
* IMPLEMENTATION:
* - Uses curl via child_process.execSync (bypasses TLS fingerprinting)
* - NO Puppeteer, NO axios, NO fetch
* - Fingerprint rotation on 403
* - Uses intoli/user-agents via CrawlRotator for realistic fingerprints
* - Residential IP compatible
*
* USAGE:
* import { curlPost, curlGet, executeGraphQL } from '@dutchie/client';
* import { curlPost, curlGet, executeGraphQL, startSession } from '@dutchie/client';
*
* ============================================================
*/
import { execSync } from 'child_process';
import {
buildOrderedHeaders,
buildRefererFromMenuUrl,
getCurlBinary,
isCurlImpersonateAvailable,
HeaderContext,
BrowserType,
} from '../../services/http-fingerprint';
// ============================================================
// TYPES
@@ -32,6 +45,8 @@ export interface CurlResponse {
error?: string;
}
// Per workflow-12102025.md: fingerprint comes from CrawlRotator's BrowserFingerprint
// We keep a simplified interface here for header building
export interface Fingerprint {
userAgent: string;
acceptLanguage: string;
@@ -57,15 +72,13 @@ export const DUTCHIE_CONFIG = {
// ============================================================
// PROXY SUPPORT
// ============================================================
// Integrates with the CrawlRotator system from proxy-rotator.ts
// On 403 errors:
// 1. Record failure on current proxy
// 2. Rotate to next proxy
// 3. Retry with new proxy
// Per workflow-12102025.md:
// - On 403: recordBlock() → increment consecutive_403_count
// - After 3 consecutive 403s → proxy disabled
// - Immediately rotate to new IP + new fingerprint on 403
// ============================================================
import type { CrawlRotator, Proxy } from '../../services/crawl-rotator';
import type { CrawlRotator, BrowserFingerprint } from '../../services/crawl-rotator';
let currentProxy: string | null = null;
let crawlRotator: CrawlRotator | null = null;
@@ -92,13 +105,12 @@ export function getProxy(): string | null {
/**
* Set CrawlRotator for proxy rotation on 403s
* This enables automatic proxy rotation when blocked
* Per workflow-12102025.md: enables automatic rotation when blocked
*/
export function setCrawlRotator(rotator: CrawlRotator | null): void {
crawlRotator = rotator;
if (rotator) {
console.log('[Dutchie Client] CrawlRotator attached - proxy rotation enabled');
// Set initial proxy from rotator
const proxy = rotator.proxy.getCurrent();
if (proxy) {
currentProxy = rotator.proxy.getProxyUrl(proxy);
@@ -115,30 +127,41 @@ export function getCrawlRotator(): CrawlRotator | null {
}
/**
* Rotate to next proxy (called on 403)
* Handle 403 block - per workflow-12102025.md:
* 1. Record block on current proxy (increments consecutive_403_count)
* 2. Immediately rotate to new proxy (new IP)
* 3. Rotate fingerprint
* Returns false if no more proxies available
*/
async function rotateProxyOn403(error?: string): Promise<boolean> {
async function handle403Block(): Promise<boolean> {
if (!crawlRotator) {
console.warn('[Dutchie Client] No CrawlRotator - cannot handle 403');
return false;
}
// Record failure on current proxy
await crawlRotator.recordFailure(error || '403 Forbidden');
// Per workflow-12102025.md: record block (tracks consecutive 403s)
const wasDisabled = await crawlRotator.recordBlock();
if (wasDisabled) {
console.log('[Dutchie Client] Current proxy was disabled (3 consecutive 403s)');
}
// Per workflow-12102025.md: immediately get new IP + new fingerprint
const { proxy: nextProxy, fingerprint } = crawlRotator.rotateBoth();
// Rotate to next proxy
const nextProxy = crawlRotator.rotateProxy();
if (nextProxy) {
currentProxy = crawlRotator.proxy.getProxyUrl(nextProxy);
console.log(`[Dutchie Client] Rotated proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
console.log(`[Dutchie Client] Rotated to new proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
console.log(`[Dutchie Client] New fingerprint: ${fingerprint.userAgent.slice(0, 50)}...`);
return true;
}
console.warn('[Dutchie Client] No more proxies available');
console.error('[Dutchie Client] No more proxies available!');
return false;
}
/**
* Record success on current proxy
* Per workflow-12102025.md: resets consecutive_403_count
*/
async function recordProxySuccess(responseTimeMs?: number): Promise<void> {
if (crawlRotator) {
@@ -162,163 +185,69 @@ export const GRAPHQL_HASHES = {
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};
// ============================================================
// FINGERPRINTS - Browser profiles for anti-detect
// ============================================================
const FINGERPRINTS: Fingerprint[] = [
// Chrome Windows (latest) - typical residential user, use first
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Chrome Mac (latest)
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"macOS"',
secChUaMobile: '?0',
},
// Chrome Windows (120)
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Firefox Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0',
acceptLanguage: 'en-US,en;q=0.5',
},
// Safari Mac
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
acceptLanguage: 'en-US,en;q=0.9',
},
// Edge Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
];
let currentFingerprintIndex = 0;
// Forward declaration for session (actual CrawlSession interface defined later)
let currentSession: {
sessionId: string;
fingerprint: Fingerprint;
proxyUrl: string | null;
stateCode?: string;
timezone?: string;
startedAt: Date;
} | null = null;
/**
* Get current fingerprint - returns session fingerprint if active, otherwise default
*/
export function getFingerprint(): Fingerprint {
// Use session fingerprint if a session is active
if (currentSession) {
return currentSession.fingerprint;
}
return FINGERPRINTS[currentFingerprintIndex];
}
export function rotateFingerprint(): Fingerprint {
currentFingerprintIndex = (currentFingerprintIndex + 1) % FINGERPRINTS.length;
const fp = FINGERPRINTS[currentFingerprintIndex];
console.log(`[Dutchie Client] Rotated to fingerprint: ${fp.userAgent.slice(0, 50)}...`);
return fp;
}
export function resetFingerprint(): void {
currentFingerprintIndex = 0;
}
/**
* Get a random fingerprint from the pool
*/
export function getRandomFingerprint(): Fingerprint {
const index = Math.floor(Math.random() * FINGERPRINTS.length);
return FINGERPRINTS[index];
}
// ============================================================
// SESSION MANAGEMENT
// Per-session fingerprint rotation for stealth
// Per workflow-12102025.md:
// - Session identity comes from PROXY LOCATION
// - NOT from task params (no stateCode/timezone params)
// - Language is always English
// ============================================================
export interface CrawlSession {
sessionId: string;
fingerprint: Fingerprint;
fingerprint: BrowserFingerprint;
proxyUrl: string | null;
stateCode?: string;
timezone?: string;
proxyTimezone?: string;
proxyState?: string;
startedAt: Date;
// Per workflow-12102025.md: Dynamic Referer per dispensary
menuUrl?: string;
referer: string;
}
// Note: currentSession variable declared earlier in file for proper scoping
let currentSession: CrawlSession | null = null;
/**
* Timezone to Accept-Language mapping
* US timezones all use en-US but this can be extended for international
* Start a new crawl session
*
* Per workflow-12102025.md:
* - NO state/timezone params - identity comes from proxy location
* - Gets fingerprint from CrawlRotator (uses intoli/user-agents)
* - Language is always English (en-US)
* - Dynamic Referer per dispensary (from menuUrl)
*
* @param menuUrl - The dispensary's menu URL for dynamic Referer header
*/
const TIMEZONE_TO_LOCALE: Record<string, string> = {
'America/Phoenix': 'en-US,en;q=0.9',
'America/Los_Angeles': 'en-US,en;q=0.9',
'America/Denver': 'en-US,en;q=0.9',
'America/Chicago': 'en-US,en;q=0.9',
'America/New_York': 'en-US,en;q=0.9',
'America/Detroit': 'en-US,en;q=0.9',
'America/Anchorage': 'en-US,en;q=0.9',
'Pacific/Honolulu': 'en-US,en;q=0.9',
};
export function startSession(menuUrl?: string): CrawlSession {
if (!crawlRotator) {
throw new Error('[Dutchie Client] Cannot start session without CrawlRotator');
}
/**
* Get Accept-Language header for a given timezone
*/
export function getLocaleForTimezone(timezone?: string): string {
if (!timezone) return 'en-US,en;q=0.9';
return TIMEZONE_TO_LOCALE[timezone] || 'en-US,en;q=0.9';
}
// Per workflow-12102025.md: get identity from proxy location
const proxyLocation = crawlRotator.getProxyLocation();
const fingerprint = crawlRotator.userAgent.getCurrent();
/**
* Start a new crawl session with a random fingerprint
* Call this before crawling a store to get a fresh identity
*/
export function startSession(stateCode?: string, timezone?: string): CrawlSession {
const baseFp = getRandomFingerprint();
// Override Accept-Language based on timezone for geographic consistency
const fingerprint: Fingerprint = {
...baseFp,
acceptLanguage: getLocaleForTimezone(timezone),
};
// Per workflow-12102025.md: Dynamic Referer per dispensary
const referer = buildRefererFromMenuUrl(menuUrl);
currentSession = {
sessionId: `session_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
fingerprint,
proxyUrl: currentProxy,
stateCode,
timezone,
proxyTimezone: proxyLocation?.timezone,
proxyState: proxyLocation?.state,
startedAt: new Date(),
menuUrl,
referer,
};
console.log(`[Dutchie Client] Started session ${currentSession.sessionId}`);
console.log(`[Dutchie Client] Fingerprint: ${fingerprint.userAgent.slice(0, 50)}...`);
console.log(`[Dutchie Client] Accept-Language: ${fingerprint.acceptLanguage}`);
if (timezone) {
console.log(`[Dutchie Client] Timezone: ${timezone}`);
console.log(`[Dutchie Client] Browser: ${fingerprint.browserName} (${fingerprint.deviceCategory})`);
console.log(`[Dutchie Client] DNT: ${fingerprint.httpFingerprint.hasDNT ? 'enabled' : 'disabled'}`);
console.log(`[Dutchie Client] TLS: ${fingerprint.httpFingerprint.curlImpersonateBinary}`);
console.log(`[Dutchie Client] Referer: ${referer}`);
if (proxyLocation?.timezone) {
console.log(`[Dutchie Client] Proxy: ${proxyLocation.state || 'unknown'} (${proxyLocation.timezone})`);
}
return currentSession;
@@ -347,48 +276,80 @@ export function getCurrentSession(): CrawlSession | null {
// ============================================================
/**
* Build headers for Dutchie requests
* Per workflow-12102025.md: Build headers using HTTP fingerprint system
* Returns headers in browser-specific order with all natural variations
*/
export function buildHeaders(refererPath: string, fingerprint?: Fingerprint): Record<string, string> {
const fp = fingerprint || getFingerprint();
const refererUrl = `https://dutchie.com${refererPath}`;
const headers: Record<string, string> = {
'accept': 'application/json, text/plain, */*',
'accept-language': fp.acceptLanguage,
'content-type': 'application/json',
'origin': 'https://dutchie.com',
'referer': refererUrl,
'user-agent': fp.userAgent,
'apollographql-client-name': 'Marketplace (production)',
};
if (fp.secChUa) {
headers['sec-ch-ua'] = fp.secChUa;
headers['sec-ch-ua-mobile'] = fp.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fp.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'empty';
headers['sec-fetch-mode'] = 'cors';
headers['sec-fetch-site'] = 'same-site';
export function buildHeaders(isPost: boolean, contentLength?: number): { headers: Record<string, string>; orderedHeaders: string[] } {
if (!currentSession || !crawlRotator) {
throw new Error('[Dutchie Client] Cannot build headers without active session');
}
return headers;
const fp = currentSession.fingerprint;
const httpFp = fp.httpFingerprint;
// Per workflow-12102025.md: Build context for ordered headers
const context: HeaderContext = {
userAgent: fp.userAgent,
secChUa: fp.secChUa,
secChUaPlatform: fp.secChUaPlatform,
secChUaMobile: fp.secChUaMobile,
referer: currentSession.referer,
isPost,
contentLength,
};
// Per workflow-12102025.md: Get ordered headers from HTTP fingerprint service
return buildOrderedHeaders(httpFp, context);
}
/**
* Execute HTTP POST using curl (bypasses TLS fingerprinting)
* Per workflow-12102025.md: Get curl binary for current session's browser
* Uses curl-impersonate for TLS fingerprint matching
*/
export function curlPost(url: string, body: any, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
function getCurlBinaryForSession(): string {
if (!currentSession) {
return 'curl'; // Fallback to standard curl
}
const browserType = currentSession.fingerprint.browserName as BrowserType;
// Per workflow-12102025.md: Check if curl-impersonate is available
if (isCurlImpersonateAvailable(browserType)) {
return getCurlBinary(browserType);
}
// Fallback to standard curl with warning
console.warn(`[Dutchie Client] curl-impersonate not available for ${browserType}, using standard curl`);
return 'curl';
}
/**
* Per workflow-12102025.md: Execute HTTP POST using curl/curl-impersonate
* - Uses browser-specific TLS fingerprint via curl-impersonate
* - Headers sent in browser-specific order
* - Dynamic Referer per dispensary
*/
export function curlPost(url: string, body: any, timeout = 30000): CurlResponse {
const bodyJson = JSON.stringify(body);
// Per workflow-12102025.md: Build ordered headers for POST request
const { headers, orderedHeaders } = buildHeaders(true, bodyJson.length);
// Per workflow-12102025.md: Build header args in browser-specific order
const headerArgs = orderedHeaders
.filter(h => h !== 'Host' && h !== 'Content-Length') // curl handles these
.map(h => `-H '${h}: ${headers[h]}'`)
.join(' ');
const bodyJson = JSON.stringify(body).replace(/'/g, "'\\''");
const bodyEscaped = bodyJson.replace(/'/g, "'\\''");
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} -d '${bodyJson}' '${url}'`;
// Per workflow-12102025.md: Use curl-impersonate for TLS fingerprint matching
const curlBinary = getCurlBinaryForSession();
const cmd = `${curlBinary} -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${headerArgs} -d '${bodyEscaped}' '${url}'`;
try {
const output = execSync(cmd, {
@@ -427,19 +388,29 @@ export function curlPost(url: string, body: any, headers: Record<string, string>
}
/**
* Execute HTTP GET using curl (bypasses TLS fingerprinting)
* Returns HTML or JSON depending on response content-type
* Per workflow-12102025.md: Execute HTTP GET using curl/curl-impersonate
* - Uses browser-specific TLS fingerprint via curl-impersonate
* - Headers sent in browser-specific order
* - Dynamic Referer per dispensary
*/
export function curlGet(url: string, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
export function curlGet(url: string, timeout = 30000): CurlResponse {
// Per workflow-12102025.md: Build ordered headers for GET request
const { headers, orderedHeaders } = buildHeaders(false);
// Per workflow-12102025.md: Build header args in browser-specific order
const headerArgs = orderedHeaders
.filter(h => h !== 'Host' && h !== 'Content-Length') // curl handles these
.map(h => `-H '${h}: ${headers[h]}'`)
.join(' ');
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} '${url}'`;
// Per workflow-12102025.md: Use curl-impersonate for TLS fingerprint matching
const curlBinary = getCurlBinaryForSession();
const cmd = `${curlBinary} -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${headerArgs} '${url}'`;
try {
const output = execSync(cmd, {
@@ -459,7 +430,6 @@ export function curlGet(url: string, headers: Record<string, string>, timeout =
const responseBody = output.slice(0, separatorIndex);
const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10);
// Try to parse as JSON, otherwise return as string (HTML)
try {
return { status: statusCode, data: JSON.parse(responseBody) };
} catch {
@@ -476,16 +446,22 @@ export function curlGet(url: string, headers: Record<string, string>, timeout =
// ============================================================
// GRAPHQL EXECUTION
// Per workflow-12102025.md:
// - On 403: immediately rotate IP + fingerprint (no delay first)
// - Then retry
// ============================================================
export interface ExecuteGraphQLOptions {
maxRetries?: number;
retryOn403?: boolean;
cName?: string; // Optional - used for Referer header, defaults to 'cities'
cName?: string;
}
/**
* Execute GraphQL query with curl (bypasses TLS fingerprinting)
* Per workflow-12102025.md: Execute GraphQL query with curl/curl-impersonate
* - Uses browser-specific TLS fingerprint
* - Headers in browser-specific order
* - On 403: immediately rotate IP + fingerprint, then retry
*/
export async function executeGraphQL(
operationName: string,
@@ -493,7 +469,12 @@ export async function executeGraphQL(
hash: string,
options: ExecuteGraphQLOptions
): Promise<any> {
const { maxRetries = 3, retryOn403 = true, cName = 'cities' } = options;
const { maxRetries = 3, retryOn403 = true } = options;
// Per workflow-12102025.md: Session must be active for requests
if (!currentSession) {
throw new Error('[Dutchie Client] Cannot execute GraphQL without active session - call startSession() first');
}
const body = {
operationName,
@@ -507,14 +488,14 @@ export async function executeGraphQL(
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers = buildHeaders(`/embedded-menu/${cName}`, fingerprint);
console.log(`[Dutchie Client] curl POST ${operationName} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, headers, DUTCHIE_CONFIG.timeout);
const startTime = Date.now();
// Per workflow-12102025.md: curlPost now uses ordered headers and curl-impersonate
const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, DUTCHIE_CONFIG.timeout);
const responseTime = Date.now() - startTime;
console.log(`[Dutchie Client] Response status: ${response.status}`);
console.log(`[Dutchie Client] Response status: ${response.status} (${responseTime}ms)`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
@@ -527,6 +508,9 @@ export async function executeGraphQL(
}
if (response.status === 200) {
// Per workflow-12102025.md: success resets consecutive 403 count
await recordProxySuccess(responseTime);
if (response.data?.errors?.length > 0) {
console.warn(`[Dutchie Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`);
}
@@ -534,11 +518,20 @@ export async function executeGraphQL(
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating proxy and fingerprint...`);
await rotateProxyOn403('403 Forbidden on GraphQL');
rotateFingerprint();
// Per workflow-12102025.md: immediately rotate IP + fingerprint
console.warn(`[Dutchie Client] 403 blocked - immediately rotating proxy + fingerprint...`);
const hasMoreProxies = await handle403Block();
if (!hasMoreProxies) {
throw new Error('All proxies exhausted - no more IPs available');
}
// Per workflow-12102025.md: Update session referer after rotation
currentSession.referer = buildRefererFromMenuUrl(currentSession.menuUrl);
attempt++;
await sleep(1000 * attempt);
// Per workflow-12102025.md: small backoff after rotation
await sleep(500);
continue;
}
@@ -567,8 +560,10 @@ export interface FetchPageOptions {
}
/**
* Fetch HTML page from Dutchie (for city pages, dispensary pages, etc.)
* Returns raw HTML string
* Per workflow-12102025.md: Fetch HTML page from Dutchie
* - Uses browser-specific TLS fingerprint
* - Headers in browser-specific order
* - Same 403 handling as GraphQL
*/
export async function fetchPage(
path: string,
@@ -577,32 +572,22 @@ export async function fetchPage(
const { maxRetries = 3, retryOn403 = true } = options;
const url = `${DUTCHIE_CONFIG.baseUrl}${path}`;
// Per workflow-12102025.md: Session must be active for requests
if (!currentSession) {
throw new Error('[Dutchie Client] Cannot fetch page without active session - call startSession() first');
}
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers: Record<string, string> = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'accept-language': fingerprint.acceptLanguage,
'user-agent': fingerprint.userAgent,
};
if (fingerprint.secChUa) {
headers['sec-ch-ua'] = fingerprint.secChUa;
headers['sec-ch-ua-mobile'] = fingerprint.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fingerprint.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'document';
headers['sec-fetch-mode'] = 'navigate';
headers['sec-fetch-site'] = 'none';
headers['sec-fetch-user'] = '?1';
headers['upgrade-insecure-requests'] = '1';
}
// Per workflow-12102025.md: curlGet now uses ordered headers and curl-impersonate
console.log(`[Dutchie Client] curl GET ${path} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlGet(url, headers, DUTCHIE_CONFIG.timeout);
const startTime = Date.now();
const response = curlGet(url, DUTCHIE_CONFIG.timeout);
const responseTime = Date.now() - startTime;
console.log(`[Dutchie Client] Response status: ${response.status}`);
console.log(`[Dutchie Client] Response status: ${response.status} (${responseTime}ms)`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
@@ -614,15 +599,26 @@ export async function fetchPage(
}
if (response.status === 200) {
// Per workflow-12102025.md: success resets consecutive 403 count
await recordProxySuccess(responseTime);
return { html: response.data, status: response.status };
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating proxy and fingerprint...`);
await rotateProxyOn403('403 Forbidden on page fetch');
rotateFingerprint();
// Per workflow-12102025.md: immediately rotate IP + fingerprint
console.warn(`[Dutchie Client] 403 blocked - immediately rotating proxy + fingerprint...`);
const hasMoreProxies = await handle403Block();
if (!hasMoreProxies) {
throw new Error('All proxies exhausted - no more IPs available');
}
// Per workflow-12102025.md: Update session after rotation
currentSession.referer = buildRefererFromMenuUrl(currentSession.menuUrl);
attempt++;
await sleep(1000 * attempt);
// Per workflow-12102025.md: small backoff after rotation
await sleep(500);
continue;
}

View File

@@ -6,22 +6,17 @@
*/
export {
// HTTP Client
// HTTP Client (per workflow-12102025.md: uses curl-impersonate + ordered headers)
curlPost,
curlGet,
executeGraphQL,
fetchPage,
extractNextData,
// Headers & Fingerprints
// Headers (per workflow-12102025.md: browser-specific ordering)
buildHeaders,
getFingerprint,
rotateFingerprint,
resetFingerprint,
getRandomFingerprint,
getLocaleForTimezone,
// Session Management (per-store fingerprint rotation)
// Session Management (per workflow-12102025.md: menuUrl for dynamic Referer)
startSession,
endSession,
getCurrentSession,

View File

@@ -7,10 +7,17 @@
* Routes are prefixed with /api/analytics/v2
*
* Phase 3: Analytics Engine + Rec/Med by State
*
* SECURITY: All routes require authentication via authMiddleware.
* Access is granted to:
* - Trusted origins (cannaiq.co, findadispo.com, etc.)
* - Trusted IPs (localhost, internal pods)
* - Valid JWT or API tokens
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import { authMiddleware } from '../auth/middleware';
import { PriceAnalyticsService } from '../services/analytics/PriceAnalyticsService';
import { BrandPenetrationService } from '../services/analytics/BrandPenetrationService';
import { CategoryAnalyticsService } from '../services/analytics/CategoryAnalyticsService';
@@ -36,6 +43,10 @@ function parseLegalType(legalType?: string): LegalType {
export function createAnalyticsV2Router(pool: Pool): Router {
const router = Router();
// SECURITY: Apply auth middleware to ALL routes
// This gate ensures only authenticated requests can access analytics data
router.use(authMiddleware);
// Initialize services
const priceService = new PriceAnalyticsService(pool);
const brandService = new BrandPenetrationService(pool);

View File

@@ -14,13 +14,25 @@ router.use(authMiddleware);
/**
* GET /api/admin/intelligence/brands
* List all brands with state presence, store counts, and pricing
* Query params:
* - state: Filter by state (e.g., "AZ")
* - limit: Max results (default 500)
* - offset: Pagination offset
*/
router.get('/brands', async (req: Request, res: Response) => {
try {
const { limit = '500', offset = '0' } = req.query;
const { limit = '500', offset = '0', state } = req.query;
const limitNum = Math.min(parseInt(limit as string, 10), 1000);
const offsetNum = parseInt(offset as string, 10);
// Build WHERE clause based on state filter
let stateFilter = '';
const params: any[] = [limitNum, offsetNum];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $3';
params.push(state);
}
const { rows } = await pool.query(`
SELECT
sp.brand_name_raw as brand_name,
@@ -32,17 +44,26 @@ router.get('/brands', async (req: Request, res: Response) => {
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.brand_name_raw IS NOT NULL AND sp.brand_name_raw != ''
${stateFilter}
GROUP BY sp.brand_name_raw
ORDER BY store_count DESC, sku_count DESC
LIMIT $1 OFFSET $2
`, [limitNum, offsetNum]);
`, params);
// Get total count
// Get total count with same state filter
const countParams: any[] = [];
let countStateFilter = '';
if (state && state !== 'all') {
countStateFilter = 'AND d.state = $1';
countParams.push(state);
}
const { rows: countRows } = await pool.query(`
SELECT COUNT(DISTINCT brand_name_raw) as total
FROM store_products
WHERE brand_name_raw IS NOT NULL AND brand_name_raw != ''
`);
SELECT COUNT(DISTINCT sp.brand_name_raw) as total
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.brand_name_raw IS NOT NULL AND sp.brand_name_raw != ''
${countStateFilter}
`, countParams);
res.json({
brands: rows.map((r: any) => ({
@@ -147,10 +168,42 @@ router.get('/brands/:brandName/penetration', async (req: Request, res: Response)
/**
* GET /api/admin/intelligence/pricing
* Get pricing analytics by category
* Query params:
* - state: Filter by state (e.g., "AZ")
*/
router.get('/pricing', async (req: Request, res: Response) => {
try {
const { rows: categoryRows } = await pool.query(`
const { state } = req.query;
// Build WHERE clause based on state filter
let stateFilter = '';
const categoryParams: any[] = [];
const stateQueryParams: any[] = [];
const overallParams: any[] = [];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $1';
categoryParams.push(state);
overallParams.push(state);
}
// Category pricing with optional state filter
const categoryQuery = state && state !== 'all'
? `
SELECT
sp.category_raw as category,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)::numeric, 2) as median_price,
COUNT(*) as product_count
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.category_raw IS NOT NULL AND sp.price_rec > 0 ${stateFilter}
GROUP BY sp.category_raw
ORDER BY product_count DESC
`
: `
SELECT
sp.category_raw as category,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
@@ -162,8 +215,11 @@ router.get('/pricing', async (req: Request, res: Response) => {
WHERE sp.category_raw IS NOT NULL AND sp.price_rec > 0
GROUP BY sp.category_raw
ORDER BY product_count DESC
`);
`;
const { rows: categoryRows } = await pool.query(categoryQuery, categoryParams);
// State pricing
const { rows: stateRows } = await pool.query(`
SELECT
d.state,
@@ -178,6 +234,31 @@ router.get('/pricing', async (req: Request, res: Response) => {
ORDER BY avg_price DESC
`);
// Overall stats with optional state filter
const overallQuery = state && state !== 'all'
? `
SELECT
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
COUNT(*) as total_products
FROM store_products sp
JOIN dispensaries d ON sp.dispensary_id = d.id
WHERE sp.price_rec > 0 ${stateFilter}
`
: `
SELECT
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MIN(sp.price_rec) as min_price,
MAX(sp.price_rec) as max_price,
COUNT(*) as total_products
FROM store_products sp
WHERE sp.price_rec > 0
`;
const { rows: overallRows } = await pool.query(overallQuery, overallParams);
const overall = overallRows[0];
res.json({
byCategory: categoryRows.map((r: any) => ({
category: r.category,
@@ -194,6 +275,12 @@ router.get('/pricing', async (req: Request, res: Response) => {
maxPrice: r.max_price ? parseFloat(r.max_price) : null,
productCount: parseInt(r.product_count, 10),
})),
overall: {
avgPrice: overall?.avg_price ? parseFloat(overall.avg_price) : null,
minPrice: overall?.min_price ? parseFloat(overall.min_price) : null,
maxPrice: overall?.max_price ? parseFloat(overall.max_price) : null,
totalProducts: parseInt(overall?.total_products || '0', 10),
},
});
} catch (error: any) {
console.error('[Intelligence] Error fetching pricing:', error.message);
@@ -204,9 +291,23 @@ router.get('/pricing', async (req: Request, res: Response) => {
/**
* GET /api/admin/intelligence/stores
* Get store intelligence summary
* Query params:
* - state: Filter by state (e.g., "AZ")
* - limit: Max results (default 200)
*/
router.get('/stores', async (req: Request, res: Response) => {
try {
const { state, limit = '200' } = req.query;
const limitNum = Math.min(parseInt(limit as string, 10), 500);
// Build WHERE clause based on state filter
let stateFilter = '';
const params: any[] = [limitNum];
if (state && state !== 'all') {
stateFilter = 'AND d.state = $2';
params.push(state);
}
const { rows: storeRows } = await pool.query(`
SELECT
d.id,
@@ -216,17 +317,22 @@ router.get('/stores', async (req: Request, res: Response) => {
d.state,
d.menu_type,
d.crawl_enabled,
COUNT(DISTINCT sp.id) as product_count,
c.name as chain_name,
COUNT(DISTINCT sp.id) as sku_count,
COUNT(DISTINCT sp.brand_name_raw) as brand_count,
ROUND(AVG(sp.price_rec)::numeric, 2) as avg_price,
MAX(sp.updated_at) as last_product_update
MAX(sp.updated_at) as last_crawl,
(SELECT COUNT(*) FROM store_product_snapshots sps
WHERE sps.store_product_id IN (SELECT id FROM store_products WHERE dispensary_id = d.id)) as snapshot_count
FROM dispensaries d
LEFT JOIN store_products sp ON sp.dispensary_id = d.id
WHERE d.state IS NOT NULL
GROUP BY d.id, d.name, d.dba_name, d.city, d.state, d.menu_type, d.crawl_enabled
ORDER BY product_count DESC
LIMIT 200
`);
LEFT JOIN chains c ON d.chain_id = c.id
WHERE d.state IS NOT NULL AND d.crawl_enabled = true
${stateFilter}
GROUP BY d.id, d.name, d.dba_name, d.city, d.state, d.menu_type, d.crawl_enabled, c.name
ORDER BY sku_count DESC
LIMIT $1
`, params);
res.json({
stores: storeRows.map((r: any) => ({
@@ -237,10 +343,13 @@ router.get('/stores', async (req: Request, res: Response) => {
state: r.state,
menuType: r.menu_type,
crawlEnabled: r.crawl_enabled,
productCount: parseInt(r.product_count || '0', 10),
chainName: r.chain_name || null,
skuCount: parseInt(r.sku_count || '0', 10),
snapshotCount: parseInt(r.snapshot_count || '0', 10),
brandCount: parseInt(r.brand_count || '0', 10),
avgPrice: r.avg_price ? parseFloat(r.avg_price) : null,
lastProductUpdate: r.last_product_update,
lastCrawl: r.last_crawl,
crawlFrequencyHours: 4, // Default crawl frequency
})),
total: storeRows.length,
});

View File

@@ -543,6 +543,9 @@ router.post('/bulk-priority', async (req: Request, res: Response) => {
/**
* POST /api/job-queue/enqueue - Add a new job to the queue
*
* 2024-12-10: Rewired to use worker_tasks via taskService.
* Legacy dispensary_crawl_jobs code commented out below.
*/
router.post('/enqueue', async (req: Request, res: Response) => {
try {
@@ -552,6 +555,59 @@ router.post('/enqueue', async (req: Request, res: Response) => {
return res.status(400).json({ success: false, error: 'dispensary_id is required' });
}
// 2024-12-10: Map legacy job_type to new task role
const roleMap: Record<string, string> = {
'dutchie_product_crawl': 'product_refresh',
'menu_detection': 'entry_point_discovery',
'menu_detection_single': 'entry_point_discovery',
'product_discovery': 'product_discovery',
'store_discovery': 'store_discovery',
};
const role = roleMap[job_type] || 'product_refresh';
// 2024-12-10: Use taskService to create task in worker_tasks table
const { taskService } = await import('../tasks/task-service');
// Check if task already pending for this dispensary
const existingTasks = await taskService.listTasks({
dispensary_id,
role: role as any,
status: ['pending', 'claimed', 'running'],
limit: 1,
});
if (existingTasks.length > 0) {
return res.json({
success: true,
task_id: existingTasks[0].id,
message: 'Task already queued'
});
}
const task = await taskService.createTask({
role: role as any,
dispensary_id,
priority,
});
res.json({ success: true, task_id: task.id, message: 'Task enqueued' });
} catch (error: any) {
console.error('[JobQueue] Error enqueuing task:', error);
res.status(500).json({ success: false, error: error.message });
}
});
/*
* LEGACY CODE - 2024-12-10: Commented out, was using orphaned dispensary_crawl_jobs table
*
router.post('/enqueue', async (req: Request, res: Response) => {
try {
const { dispensary_id, job_type = 'dutchie_product_crawl', priority = 0 } = req.body;
if (!dispensary_id) {
return res.status(400).json({ success: false, error: 'dispensary_id is required' });
}
// Check if job already pending for this dispensary
const existing = await pool.query(`
SELECT id FROM dispensary_crawl_jobs
@@ -585,6 +641,7 @@ router.post('/enqueue', async (req: Request, res: Response) => {
res.status(500).json({ success: false, error: error.message });
}
});
*/
/**
* POST /api/job-queue/pause - Pause queue processing
@@ -612,6 +669,8 @@ router.get('/paused', async (_req: Request, res: Response) => {
/**
* POST /api/job-queue/enqueue-batch - Queue multiple dispensaries at once
* Body: { dispensary_ids: number[], job_type?: string, priority?: number }
*
* 2024-12-10: Rewired to use worker_tasks via taskService.
*/
router.post('/enqueue-batch', async (req: Request, res: Response) => {
try {
@@ -625,35 +684,30 @@ router.post('/enqueue-batch', async (req: Request, res: Response) => {
return res.status(400).json({ success: false, error: 'Maximum 500 dispensaries per batch' });
}
// Insert jobs, skipping duplicates
const { rows } = await pool.query(`
INSERT INTO dispensary_crawl_jobs (dispensary_id, job_type, priority, trigger_type, status, created_at)
SELECT
d.id,
$2::text,
$3::integer,
'api_batch',
'pending',
NOW()
FROM dispensaries d
WHERE d.id = ANY($1::int[])
AND d.crawl_enabled = true
AND d.platform_dispensary_id IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM dispensary_crawl_jobs cj
WHERE cj.dispensary_id = d.id
AND cj.job_type = $2::text
AND cj.status IN ('pending', 'running')
)
RETURNING id, dispensary_id
`, [dispensary_ids, job_type, priority]);
// 2024-12-10: Map legacy job_type to new task role
const roleMap: Record<string, string> = {
'dutchie_product_crawl': 'product_refresh',
'menu_detection': 'entry_point_discovery',
'product_discovery': 'product_discovery',
};
const role = roleMap[job_type] || 'product_refresh';
// 2024-12-10: Use taskService to create tasks in worker_tasks table
const { taskService } = await import('../tasks/task-service');
const tasks = dispensary_ids.map(dispensary_id => ({
role: role as any,
dispensary_id,
priority,
}));
const createdCount = await taskService.createTasks(tasks);
res.json({
success: true,
queued: rows.length,
queued: createdCount,
requested: dispensary_ids.length,
job_ids: rows.map(r => r.id),
message: `Queued ${rows.length} of ${dispensary_ids.length} dispensaries`
message: `Queued ${createdCount} of ${dispensary_ids.length} dispensaries`
});
} catch (error: any) {
console.error('[JobQueue] Error batch enqueuing:', error);
@@ -664,6 +718,8 @@ router.post('/enqueue-batch', async (req: Request, res: Response) => {
/**
* POST /api/job-queue/enqueue-state - Queue all crawl-enabled dispensaries for a state
* Body: { state_code: string, job_type?: string, priority?: number, limit?: number }
*
* 2024-12-10: Rewired to use worker_tasks via taskService.
*/
router.post('/enqueue-state', async (req: Request, res: Response) => {
try {
@@ -673,52 +729,55 @@ router.post('/enqueue-state', async (req: Request, res: Response) => {
return res.status(400).json({ success: false, error: 'state_code is required (e.g., "AZ")' });
}
// Get state_id and queue jobs
const { rows } = await pool.query(`
WITH target_state AS (
SELECT id FROM states WHERE code = $1
)
INSERT INTO dispensary_crawl_jobs (dispensary_id, job_type, priority, trigger_type, status, created_at)
SELECT
d.id,
$2::text,
$3::integer,
'api_state',
'pending',
NOW()
FROM dispensaries d, target_state
WHERE d.state_id = target_state.id
// 2024-12-10: Map legacy job_type to new task role
const roleMap: Record<string, string> = {
'dutchie_product_crawl': 'product_refresh',
'menu_detection': 'entry_point_discovery',
'product_discovery': 'product_discovery',
};
const role = roleMap[job_type] || 'product_refresh';
// Get dispensary IDs for the state
const dispensaryResult = await pool.query(`
SELECT d.id
FROM dispensaries d
JOIN states s ON s.id = d.state_id
WHERE s.code = $1
AND d.crawl_enabled = true
AND d.platform_dispensary_id IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM dispensary_crawl_jobs cj
WHERE cj.dispensary_id = d.id
AND cj.job_type = $2::text
AND cj.status IN ('pending', 'running')
)
LIMIT $4::integer
RETURNING id, dispensary_id
`, [state_code.toUpperCase(), job_type, priority, limit]);
LIMIT $2
`, [state_code.toUpperCase(), limit]);
const dispensary_ids = dispensaryResult.rows.map((r: any) => r.id);
// 2024-12-10: Use taskService to create tasks in worker_tasks table
const { taskService } = await import('../tasks/task-service');
const tasks = dispensary_ids.map((dispensary_id: number) => ({
role: role as any,
dispensary_id,
priority,
}));
const createdCount = await taskService.createTasks(tasks);
// Get total available count
const countResult = await pool.query(`
WITH target_state AS (
SELECT id FROM states WHERE code = $1
)
SELECT COUNT(*) as total
FROM dispensaries d, target_state
WHERE d.state_id = target_state.id
FROM dispensaries d
JOIN states s ON s.id = d.state_id
WHERE s.code = $1
AND d.crawl_enabled = true
AND d.platform_dispensary_id IS NOT NULL
`, [state_code.toUpperCase()]);
res.json({
success: true,
queued: rows.length,
queued: createdCount,
total_available: parseInt(countResult.rows[0].total),
state: state_code.toUpperCase(),
job_type,
message: `Queued ${rows.length} dispensaries for ${state_code.toUpperCase()}`
role,
message: `Queued ${createdCount} dispensaries for ${state_code.toUpperCase()}`
});
} catch (error: any) {
console.error('[JobQueue] Error enqueuing state:', error);

View File

@@ -291,6 +291,107 @@ router.get('/stores/:id/summary', async (req: Request, res: Response) => {
}
});
/**
* GET /api/markets/stores/:id/crawl-history
* Get crawl history for a specific store
*/
router.get('/stores/:id/crawl-history', async (req: Request, res: Response) => {
try {
const { id } = req.params;
const { limit = '50' } = req.query;
const dispensaryId = parseInt(id, 10);
const limitNum = Math.min(parseInt(limit as string, 10), 100);
// Get crawl history from crawl_orchestration_traces
const { rows: historyRows } = await pool.query(`
SELECT
id,
run_id,
profile_key,
crawler_module,
state_at_start,
state_at_end,
total_steps,
duration_ms,
success,
error_message,
products_found,
started_at,
completed_at
FROM crawl_orchestration_traces
WHERE dispensary_id = $1
ORDER BY started_at DESC
LIMIT $2
`, [dispensaryId, limitNum]);
// Get next scheduled crawl if available
const { rows: scheduleRows } = await pool.query(`
SELECT
js.id as schedule_id,
js.job_name,
js.enabled,
js.base_interval_minutes,
js.jitter_minutes,
js.next_run_at,
js.last_run_at,
js.last_status
FROM job_schedules js
WHERE js.enabled = true
AND js.job_config->>'dispensaryId' = $1::text
ORDER BY js.next_run_at
LIMIT 1
`, [dispensaryId.toString()]);
// Get dispensary info for slug
const { rows: dispRows } = await pool.query(`
SELECT
id,
name,
dba_name,
slug,
state,
city,
menu_type,
platform_dispensary_id,
last_menu_scrape
FROM dispensaries
WHERE id = $1
`, [dispensaryId]);
res.json({
dispensary: dispRows[0] || null,
history: historyRows.map(row => ({
id: row.id,
runId: row.run_id,
profileKey: row.profile_key,
crawlerModule: row.crawler_module,
stateAtStart: row.state_at_start,
stateAtEnd: row.state_at_end,
totalSteps: row.total_steps,
durationMs: row.duration_ms,
success: row.success,
errorMessage: row.error_message,
productsFound: row.products_found,
startedAt: row.started_at?.toISOString() || null,
completedAt: row.completed_at?.toISOString() || null,
})),
nextSchedule: scheduleRows[0] ? {
scheduleId: scheduleRows[0].schedule_id,
jobName: scheduleRows[0].job_name,
enabled: scheduleRows[0].enabled,
baseIntervalMinutes: scheduleRows[0].base_interval_minutes,
jitterMinutes: scheduleRows[0].jitter_minutes,
nextRunAt: scheduleRows[0].next_run_at?.toISOString() || null,
lastRunAt: scheduleRows[0].last_run_at?.toISOString() || null,
lastStatus: scheduleRows[0].last_status,
} : null,
});
} catch (error: any) {
console.error('[Markets] Error fetching crawl history:', error.message);
res.status(500).json({ error: error.message });
}
});
/**
* GET /api/markets/stores/:id/products
* Get products for a store with filtering and pagination

View File

@@ -78,14 +78,14 @@ router.get('/metrics', async (_req: Request, res: Response) => {
/**
* GET /api/admin/orchestrator/states
* Returns array of states with at least one known dispensary
* Returns array of states with at least one crawl-enabled dispensary
*/
router.get('/states', async (_req: Request, res: Response) => {
try {
const { rows } = await pool.query(`
SELECT DISTINCT state, COUNT(*) as store_count
FROM dispensaries
WHERE state IS NOT NULL
WHERE state IS NOT NULL AND crawl_enabled = true
GROUP BY state
ORDER BY state
`);

View File

@@ -0,0 +1,334 @@
/**
* Payload Routes
*
* Per TASK_WORKFLOW_2024-12-10.md: API access to raw crawl payloads.
*
* Endpoints:
* - GET /api/payloads - List payload metadata (paginated)
* - GET /api/payloads/:id - Get payload metadata by ID
* - GET /api/payloads/:id/data - Get full payload JSON
* - GET /api/payloads/store/:dispensaryId - List payloads for a store
* - GET /api/payloads/store/:dispensaryId/latest - Get latest payload for a store
* - GET /api/payloads/store/:dispensaryId/diff - Diff two payloads
*/
import { Router, Request, Response } from 'express';
import { getPool } from '../db/pool';
import {
loadRawPayloadById,
getLatestPayload,
getRecentPayloads,
listPayloadMetadata,
} from '../utils/payload-storage';
import { Pool } from 'pg';
const router = Router();
// Get pool instance for queries
const getDbPool = (): Pool => getPool() as unknown as Pool;
/**
* GET /api/payloads
* List payload metadata (paginated)
*/
router.get('/', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const limit = Math.min(parseInt(req.query.limit as string) || 50, 100);
const offset = parseInt(req.query.offset as string) || 0;
const dispensaryId = req.query.dispensary_id ? parseInt(req.query.dispensary_id as string) : undefined;
const payloads = await listPayloadMetadata(pool, {
dispensaryId,
limit,
offset,
});
res.json({
success: true,
payloads,
pagination: { limit, offset },
});
} catch (error: any) {
console.error('[Payloads] List error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/payloads/:id
* Get payload metadata by ID
*/
router.get('/:id', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const id = parseInt(req.params.id);
const result = await pool.query(`
SELECT
p.id,
p.dispensary_id,
p.crawl_run_id,
p.storage_path,
p.product_count,
p.size_bytes,
p.size_bytes_raw,
p.fetched_at,
p.processed_at,
p.checksum_sha256,
d.name as dispensary_name
FROM raw_crawl_payloads p
LEFT JOIN dispensaries d ON d.id = p.dispensary_id
WHERE p.id = $1
`, [id]);
if (result.rows.length === 0) {
return res.status(404).json({ success: false, error: 'Payload not found' });
}
res.json({
success: true,
payload: result.rows[0],
});
} catch (error: any) {
console.error('[Payloads] Get error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/payloads/:id/data
* Get full payload JSON (decompressed from disk)
*/
router.get('/:id/data', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const id = parseInt(req.params.id);
const result = await loadRawPayloadById(pool, id);
if (!result) {
return res.status(404).json({ success: false, error: 'Payload not found' });
}
res.json({
success: true,
metadata: result.metadata,
data: result.payload,
});
} catch (error: any) {
console.error('[Payloads] Get data error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/payloads/store/:dispensaryId
* List payloads for a specific store
*/
router.get('/store/:dispensaryId', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const dispensaryId = parseInt(req.params.dispensaryId);
const limit = Math.min(parseInt(req.query.limit as string) || 20, 100);
const offset = parseInt(req.query.offset as string) || 0;
const payloads = await listPayloadMetadata(pool, {
dispensaryId,
limit,
offset,
});
res.json({
success: true,
dispensaryId,
payloads,
pagination: { limit, offset },
});
} catch (error: any) {
console.error('[Payloads] Store list error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/payloads/store/:dispensaryId/latest
* Get the latest payload for a store (with full data)
*/
router.get('/store/:dispensaryId/latest', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const dispensaryId = parseInt(req.params.dispensaryId);
const result = await getLatestPayload(pool, dispensaryId);
if (!result) {
return res.status(404).json({
success: false,
error: `No payloads found for dispensary ${dispensaryId}`,
});
}
res.json({
success: true,
metadata: result.metadata,
data: result.payload,
});
} catch (error: any) {
console.error('[Payloads] Latest error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/payloads/store/:dispensaryId/diff
* Compare two payloads for a store
*
* Query params:
* - from: payload ID (older)
* - to: payload ID (newer) - optional, defaults to latest
*/
router.get('/store/:dispensaryId/diff', async (req: Request, res: Response) => {
try {
const pool = getDbPool();
const dispensaryId = parseInt(req.params.dispensaryId);
const fromId = req.query.from ? parseInt(req.query.from as string) : undefined;
const toId = req.query.to ? parseInt(req.query.to as string) : undefined;
let fromPayload: any;
let toPayload: any;
if (fromId && toId) {
// Load specific payloads
const [from, to] = await Promise.all([
loadRawPayloadById(pool, fromId),
loadRawPayloadById(pool, toId),
]);
fromPayload = from;
toPayload = to;
} else {
// Load two most recent
const recent = await getRecentPayloads(pool, dispensaryId, 2);
if (recent.length < 2) {
return res.status(400).json({
success: false,
error: 'Need at least 2 payloads to diff. Only found ' + recent.length,
});
}
toPayload = recent[0]; // Most recent
fromPayload = recent[1]; // Previous
}
if (!fromPayload || !toPayload) {
return res.status(404).json({ success: false, error: 'One or both payloads not found' });
}
// Build product maps by ID
const fromProducts = new Map<string, any>();
const toProducts = new Map<string, any>();
for (const p of fromPayload.payload.products || []) {
const id = p._id || p.id;
if (id) fromProducts.set(id, p);
}
for (const p of toPayload.payload.products || []) {
const id = p._id || p.id;
if (id) toProducts.set(id, p);
}
// Find differences
const added: any[] = [];
const removed: any[] = [];
const priceChanges: any[] = [];
const stockChanges: any[] = [];
// Products in "to" but not in "from" = added
for (const [id, product] of toProducts) {
if (!fromProducts.has(id)) {
added.push({
id,
name: product.name,
brand: product.brand?.name,
price: product.Prices?.[0]?.price,
});
}
}
// Products in "from" but not in "to" = removed
for (const [id, product] of fromProducts) {
if (!toProducts.has(id)) {
removed.push({
id,
name: product.name,
brand: product.brand?.name,
price: product.Prices?.[0]?.price,
});
}
}
// Products in both - check for changes
for (const [id, toProduct] of toProducts) {
const fromProduct = fromProducts.get(id);
if (!fromProduct) continue;
const fromPrice = fromProduct.Prices?.[0]?.price;
const toPrice = toProduct.Prices?.[0]?.price;
if (fromPrice !== toPrice) {
priceChanges.push({
id,
name: toProduct.name,
brand: toProduct.brand?.name,
oldPrice: fromPrice,
newPrice: toPrice,
change: toPrice && fromPrice ? toPrice - fromPrice : null,
});
}
const fromStock = fromProduct.Status || fromProduct.status;
const toStock = toProduct.Status || toProduct.status;
if (fromStock !== toStock) {
stockChanges.push({
id,
name: toProduct.name,
brand: toProduct.brand?.name,
oldStatus: fromStock,
newStatus: toStock,
});
}
}
res.json({
success: true,
from: {
id: fromPayload.metadata.id,
fetchedAt: fromPayload.metadata.fetchedAt,
productCount: fromPayload.metadata.productCount,
},
to: {
id: toPayload.metadata.id,
fetchedAt: toPayload.metadata.fetchedAt,
productCount: toPayload.metadata.productCount,
},
diff: {
added: added.length,
removed: removed.length,
priceChanges: priceChanges.length,
stockChanges: stockChanges.length,
},
details: {
added,
removed,
priceChanges,
stockChanges,
},
});
} catch (error: any) {
console.error('[Payloads] Diff error:', error.message);
res.status(500).json({ success: false, error: error.message });
}
});
export default router;

View File

@@ -13,6 +13,12 @@ import {
TaskFilter,
} from '../tasks/task-service';
import { pool } from '../db/pool';
import {
isTaskPoolPaused,
pauseTaskPool,
resumeTaskPool,
getTaskPoolStatus,
} from '../tasks/task-pool-state';
const router = Router();
@@ -592,4 +598,42 @@ router.post('/migration/full-migrate', async (req: Request, res: Response) => {
}
});
/**
* GET /api/tasks/pool/status
* Check if task pool is paused
*/
router.get('/pool/status', async (_req: Request, res: Response) => {
const status = getTaskPoolStatus();
res.json({
success: true,
...status,
});
});
/**
* POST /api/tasks/pool/pause
* Pause the task pool - workers won't pick up new tasks
*/
router.post('/pool/pause', async (_req: Request, res: Response) => {
pauseTaskPool();
res.json({
success: true,
paused: true,
message: 'Task pool paused - workers will not pick up new tasks',
});
});
/**
* POST /api/tasks/pool/resume
* Resume the task pool - workers will pick up tasks again
*/
router.post('/pool/resume', async (_req: Request, res: Response) => {
resumeTaskPool();
res.json({
success: true,
paused: false,
message: 'Task pool resumed - workers will pick up new tasks',
});
});
export default router;

View File

@@ -17,13 +17,167 @@
* GET /api/monitor/jobs - Get recent job history
* GET /api/monitor/active-jobs - Get currently running jobs
* GET /api/monitor/summary - Get monitoring summary
*
* K8s Scaling (added 2024-12-10):
* GET /api/workers/k8s/replicas - Get current replica count
* POST /api/workers/k8s/scale - Scale worker replicas up/down
*/
import { Router, Request, Response } from 'express';
import { pool } from '../db/pool';
import * as k8s from '@kubernetes/client-node';
const router = Router();
// ============================================================
// K8S SCALING CONFIGURATION (added 2024-12-10)
// Per TASK_WORKFLOW_2024-12-10.md: Admin can scale workers from UI
// ============================================================
const K8S_NAMESPACE = process.env.K8S_NAMESPACE || 'dispensary-scraper';
const K8S_STATEFULSET_NAME = process.env.K8S_WORKER_STATEFULSET || 'scraper-worker';
// Initialize K8s client - uses in-cluster config when running in K8s,
// or kubeconfig when running locally
let k8sAppsApi: k8s.AppsV1Api | null = null;
function getK8sClient(): k8s.AppsV1Api | null {
if (k8sAppsApi) return k8sAppsApi;
try {
const kc = new k8s.KubeConfig();
// Try in-cluster config first (when running as a pod)
// Falls back to default kubeconfig (~/.kube/config) for local dev
try {
kc.loadFromCluster();
} catch {
kc.loadFromDefault();
}
k8sAppsApi = kc.makeApiClient(k8s.AppsV1Api);
return k8sAppsApi;
} catch (err: any) {
console.warn('[Workers] K8s client not available:', err.message);
return null;
}
}
// ============================================================
// K8S SCALING ROUTES (added 2024-12-10)
// Per TASK_WORKFLOW_2024-12-10.md: Admin can scale workers from UI
// ============================================================
/**
* GET /api/workers/k8s/replicas - Get current worker replica count
* Returns current and desired replica counts from the StatefulSet
*/
router.get('/k8s/replicas', async (_req: Request, res: Response) => {
const client = getK8sClient();
if (!client) {
return res.status(503).json({
success: false,
error: 'K8s client not available (not running in cluster or no kubeconfig)',
replicas: null,
});
}
try {
const response = await client.readNamespacedStatefulSet({
name: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
});
const statefulSet = response;
res.json({
success: true,
replicas: {
current: statefulSet.status?.readyReplicas || 0,
desired: statefulSet.spec?.replicas || 0,
available: statefulSet.status?.availableReplicas || 0,
updated: statefulSet.status?.updatedReplicas || 0,
},
statefulset: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
});
} catch (err: any) {
console.error('[Workers] K8s replicas error:', err.body?.message || err.message);
res.status(500).json({
success: false,
error: err.body?.message || err.message,
});
}
});
/**
* POST /api/workers/k8s/scale - Scale worker replicas
* Body: { replicas: number } - desired replica count (1-20)
*/
router.post('/k8s/scale', async (req: Request, res: Response) => {
const client = getK8sClient();
if (!client) {
return res.status(503).json({
success: false,
error: 'K8s client not available (not running in cluster or no kubeconfig)',
});
}
const { replicas } = req.body;
// Validate replica count
if (typeof replicas !== 'number' || replicas < 0 || replicas > 20) {
return res.status(400).json({
success: false,
error: 'replicas must be a number between 0 and 20',
});
}
try {
// Get current state first
const currentResponse = await client.readNamespacedStatefulSetScale({
name: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
});
const currentReplicas = currentResponse.spec?.replicas || 0;
// Update scale using replaceNamespacedStatefulSetScale
await client.replaceNamespacedStatefulSetScale({
name: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
body: {
apiVersion: 'autoscaling/v1',
kind: 'Scale',
metadata: {
name: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
},
spec: {
replicas: replicas,
},
},
});
console.log(`[Workers] Scaled ${K8S_STATEFULSET_NAME} from ${currentReplicas} to ${replicas} replicas`);
res.json({
success: true,
message: `Scaled from ${currentReplicas} to ${replicas} replicas`,
previous: currentReplicas,
desired: replicas,
statefulset: K8S_STATEFULSET_NAME,
namespace: K8S_NAMESPACE,
});
} catch (err: any) {
console.error('[Workers] K8s scale error:', err.body?.message || err.message);
res.status(500).json({
success: false,
error: err.body?.message || err.message,
});
}
});
// ============================================================
// STATIC ROUTES (must come before parameterized routes)
// ============================================================

View File

@@ -16,10 +16,11 @@ import {
executeGraphQL,
startSession,
endSession,
getFingerprint,
setCrawlRotator,
GRAPHQL_HASHES,
DUTCHIE_CONFIG,
} from '../platforms/dutchie';
import { CrawlRotator } from '../services/crawl-rotator';
dotenv.config();
@@ -108,19 +109,27 @@ async function main() {
// ============================================================
// STEP 2: Start stealth session
// Per workflow-12102025.md: Initialize CrawlRotator and start session with menuUrl
// ============================================================
console.log('┌─────────────────────────────────────────────────────────────┐');
console.log('│ STEP 2: Start Stealth Session │');
console.log('└─────────────────────────────────────────────────────────────┘');
// Use Arizona timezone for this store
const session = startSession(disp.state || 'AZ', 'America/Phoenix');
// Per workflow-12102025.md: Initialize CrawlRotator (required for sessions)
const rotator = new CrawlRotator();
setCrawlRotator(rotator);
const fp = getFingerprint();
// Per workflow-12102025.md: startSession takes menuUrl for dynamic Referer
const session = startSession(disp.menu_url);
const fp = session.fingerprint;
console.log(` Session ID: ${session.sessionId}`);
console.log(` Browser: ${fp.browserName} (${fp.deviceCategory})`);
console.log(` User-Agent: ${fp.userAgent.slice(0, 60)}...`);
console.log(` Accept-Language: ${fp.acceptLanguage}`);
console.log(` Sec-CH-UA: ${fp.secChUa || '(not set)'}`);
console.log(` Referer: ${session.referer}`);
console.log(` DNT: ${fp.httpFingerprint.hasDNT ? 'enabled' : 'disabled'}`);
console.log(` TLS: ${fp.httpFingerprint.curlImpersonateBinary}`);
console.log('');
// ============================================================

View File

@@ -1,10 +1,10 @@
/**
* Test script for stealth session management
*
* Tests:
* 1. Per-session fingerprint rotation
* 2. Geographic consistency (timezone → Accept-Language)
* 3. Proxy location loading from database
* Per workflow-12102025.md:
* - Tests HTTP fingerprinting (browser-specific headers + ordering)
* - Tests UA generation (device distribution, browser filtering)
* - Tests dynamic Referer per dispensary
*
* Usage:
* npx tsx src/scripts/test-stealth-session.ts
@@ -14,104 +14,142 @@ import {
startSession,
endSession,
getCurrentSession,
getFingerprint,
getRandomFingerprint,
getLocaleForTimezone,
buildHeaders,
setCrawlRotator,
} from '../platforms/dutchie';
import { CrawlRotator } from '../services/crawl-rotator';
import {
generateHTTPFingerprint,
buildRefererFromMenuUrl,
BrowserType,
} from '../services/http-fingerprint';
console.log('='.repeat(60));
console.log('STEALTH SESSION TEST');
console.log('STEALTH SESSION TEST (per workflow-12102025.md)');
console.log('='.repeat(60));
// Test 1: Timezone to Locale mapping
console.log('\n[Test 1] Timezone to Locale Mapping:');
const testTimezones = [
'America/Phoenix',
'America/Los_Angeles',
'America/New_York',
'America/Chicago',
// Initialize CrawlRotator (required for sessions)
console.log('\n[Setup] Initializing CrawlRotator...');
const rotator = new CrawlRotator();
setCrawlRotator(rotator);
console.log(' CrawlRotator initialized');
// Test 1: HTTP Fingerprint Generation
console.log('\n[Test 1] HTTP Fingerprint Generation:');
const browsers: BrowserType[] = ['Chrome', 'Firefox', 'Safari', 'Edge'];
for (const browser of browsers) {
const httpFp = generateHTTPFingerprint(browser);
console.log(` ${browser}:`);
console.log(` TLS binary: ${httpFp.curlImpersonateBinary}`);
console.log(` DNT: ${httpFp.hasDNT ? 'enabled' : 'disabled'}`);
console.log(` Header order: ${httpFp.headerOrder.slice(0, 5).join(', ')}...`);
}
// Test 2: Dynamic Referer from menu URLs
console.log('\n[Test 2] Dynamic Referer from Menu URLs:');
const testUrls = [
'https://dutchie.com/embedded-menu/harvest-of-tempe',
'https://dutchie.com/dispensary/zen-leaf-mesa',
'/embedded-menu/deeply-rooted',
'/dispensary/curaleaf-phoenix',
null,
undefined,
'Invalid/Timezone',
];
for (const tz of testTimezones) {
const locale = getLocaleForTimezone(tz);
console.log(` ${tz || '(undefined)'}${locale}`);
for (const url of testUrls) {
const referer = buildRefererFromMenuUrl(url);
console.log(` ${url || '(null/undefined)'}`);
console.log(`${referer}`);
}
// Test 2: Random fingerprint selection
console.log('\n[Test 2] Random Fingerprint Selection (5 samples):');
for (let i = 0; i < 5; i++) {
const fp = getRandomFingerprint();
console.log(` ${i + 1}. ${fp.userAgent.slice(0, 60)}...`);
}
// Test 3: Session with Dynamic Referer
console.log('\n[Test 3] Session with Dynamic Referer:');
const testMenuUrl = 'https://dutchie.com/dispensary/harvest-of-tempe';
console.log(` Starting session with menuUrl: ${testMenuUrl}`);
// Test 3: Session Management
console.log('\n[Test 3] Session Management:');
// Before session - should use default fingerprint
console.log(' Before session:');
const beforeFp = getFingerprint();
console.log(` getFingerprint(): ${beforeFp.userAgent.slice(0, 50)}...`);
console.log(` getCurrentSession(): ${getCurrentSession()}`);
// Start session with Arizona timezone
console.log('\n Starting session (AZ, America/Phoenix):');
const session1 = startSession('AZ', 'America/Phoenix');
const session1 = startSession(testMenuUrl);
console.log(` Session ID: ${session1.sessionId}`);
console.log(` Fingerprint UA: ${session1.fingerprint.userAgent.slice(0, 50)}...`);
console.log(` Accept-Language: ${session1.fingerprint.acceptLanguage}`);
console.log(` Timezone: ${session1.timezone}`);
console.log(` Browser: ${session1.fingerprint.browserName}`);
console.log(` Device: ${session1.fingerprint.deviceCategory}`);
console.log(` Referer: ${session1.referer}`);
console.log(` DNT: ${session1.fingerprint.httpFingerprint.hasDNT ? 'enabled' : 'disabled'}`);
console.log(` TLS: ${session1.fingerprint.httpFingerprint.curlImpersonateBinary}`);
// During session - should use session fingerprint
console.log('\n During session:');
const duringFp = getFingerprint();
console.log(` getFingerprint(): ${duringFp.userAgent.slice(0, 50)}...`);
console.log(` Same as session? ${duringFp.userAgent === session1.fingerprint.userAgent}`);
// Test 4: Build Headers (browser-specific order)
console.log('\n[Test 4] Build Headers (browser-specific order):');
const { headers, orderedHeaders } = buildHeaders(true, 1000);
console.log(` Headers built for ${session1.fingerprint.browserName}:`);
console.log(` Order: ${orderedHeaders.join(' → ')}`);
console.log(` Sample headers:`);
console.log(` User-Agent: ${headers['User-Agent']?.slice(0, 50)}...`);
console.log(` Accept: ${headers['Accept']}`);
console.log(` Accept-Language: ${headers['Accept-Language']}`);
console.log(` Referer: ${headers['Referer']}`);
if (headers['sec-ch-ua']) {
console.log(` sec-ch-ua: ${headers['sec-ch-ua']}`);
}
if (headers['DNT']) {
console.log(` DNT: ${headers['DNT']}`);
}
// Test buildHeaders with session
console.log('\n buildHeaders() during session:');
const headers = buildHeaders('/embedded-menu/test-store');
console.log(` User-Agent: ${headers['user-agent'].slice(0, 50)}...`);
console.log(` Accept-Language: ${headers['accept-language']}`);
console.log(` Origin: ${headers['origin']}`);
console.log(` Referer: ${headers['referer']}`);
// End session
console.log('\n Ending session:');
endSession();
console.log(` getCurrentSession(): ${getCurrentSession()}`);
// Test 4: Multiple sessions should have different fingerprints
console.log('\n[Test 4] Multiple Sessions (fingerprint variety):');
const fingerprints: string[] = [];
// Test 5: Multiple Sessions (UA variety)
console.log('\n[Test 5] Multiple Sessions (UA & fingerprint variety):');
const sessions: {
browser: string;
device: string;
hasDNT: boolean;
}[] = [];
for (let i = 0; i < 10; i++) {
const session = startSession('CA', 'America/Los_Angeles');
fingerprints.push(session.fingerprint.userAgent);
const session = startSession(`/dispensary/store-${i}`);
sessions.push({
browser: session.fingerprint.browserName,
device: session.fingerprint.deviceCategory,
hasDNT: session.fingerprint.httpFingerprint.hasDNT,
});
endSession();
}
const uniqueCount = new Set(fingerprints).size;
console.log(` 10 sessions created, ${uniqueCount} unique fingerprints`);
console.log(` Variety: ${uniqueCount >= 3 ? '✅ Good' : '⚠️ Low - may need more fingerprint options'}`);
// Count distribution
const browserCounts: Record<string, number> = {};
const deviceCounts: Record<string, number> = {};
let dntCount = 0;
// Test 5: Geographic consistency check
console.log('\n[Test 5] Geographic Consistency:');
const geoTests = [
{ state: 'AZ', tz: 'America/Phoenix' },
{ state: 'CA', tz: 'America/Los_Angeles' },
{ state: 'NY', tz: 'America/New_York' },
{ state: 'IL', tz: 'America/Chicago' },
];
for (const s of sessions) {
browserCounts[s.browser] = (browserCounts[s.browser] || 0) + 1;
deviceCounts[s.device] = (deviceCounts[s.device] || 0) + 1;
if (s.hasDNT) dntCount++;
}
for (const { state, tz } of geoTests) {
const session = startSession(state, tz);
const consistent = session.fingerprint.acceptLanguage.includes('en-US');
console.log(` ${state} (${tz}): Accept-Language=${session.fingerprint.acceptLanguage} ${consistent ? '✅' : '❌'}`);
console.log(` 10 sessions created:`);
console.log(` Browsers: ${JSON.stringify(browserCounts)}`);
console.log(` Devices: ${JSON.stringify(deviceCounts)}`);
console.log(` DNT enabled: ${dntCount}/10 (expected ~30%)`);
// Test 6: Device distribution check (per workflow-12102025.md: 62/36/2)
console.log('\n[Test 6] Device Distribution (larger sample):');
const deviceSamples: string[] = [];
for (let i = 0; i < 100; i++) {
const session = startSession();
deviceSamples.push(session.fingerprint.deviceCategory);
endSession();
}
const mobileCount = deviceSamples.filter(d => d === 'mobile').length;
const desktopCount = deviceSamples.filter(d => d === 'desktop').length;
const tabletCount = deviceSamples.filter(d => d === 'tablet').length;
console.log(` 100 sessions (expected: 62% mobile, 36% desktop, 2% tablet):`);
console.log(` Mobile: ${mobileCount}%`);
console.log(` Desktop: ${desktopCount}%`);
console.log(` Tablet: ${tabletCount}%`);
console.log(` Distribution: ${Math.abs(mobileCount - 62) < 15 && Math.abs(desktopCount - 36) < 15 ? '✅ Reasonable' : '⚠️ Off target'}`);
console.log('\n' + '='.repeat(60));
console.log('TEST COMPLETE');
console.log('='.repeat(60));

View File

@@ -43,14 +43,14 @@ export class CategoryAnalyticsService {
// Get current category metrics
const currentResult = await this.pool.query(`
SELECT
sp.category,
sp.category_raw,
COUNT(*) AS sku_count,
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count,
AVG(sp.price_rec) AS avg_price
FROM store_products sp
WHERE sp.category = $1
WHERE sp.category_raw = $1
AND sp.is_in_stock = TRUE
GROUP BY sp.category
GROUP BY sp.category_raw
`, [category]);
if (currentResult.rows.length === 0) {
@@ -70,7 +70,7 @@ export class CategoryAnalyticsService {
COUNT(DISTINCT sps.dispensary_id) AS dispensary_count,
AVG(sps.price_rec) AS avg_price
FROM store_product_snapshots sps
WHERE sps.category = $1
WHERE sps.category_raw = $1
AND sps.captured_at >= $2
AND sps.captured_at <= $3
AND sps.is_in_stock = TRUE
@@ -111,8 +111,9 @@ export class CategoryAnalyticsService {
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count,
AVG(sp.price_rec) AS avg_price
FROM store_products sp
JOIN states s ON s.id = sp.state_id
WHERE sp.category = $1
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE sp.category_raw = $1
AND sp.is_in_stock = TRUE
GROUP BY s.code, s.name, s.recreational_legal
ORDER BY sku_count DESC
@@ -154,24 +155,25 @@ export class CategoryAnalyticsService {
const result = await this.pool.query(`
SELECT
sp.category,
sp.category_raw,
COUNT(*) AS sku_count,
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count,
COUNT(DISTINCT sp.brand_name) AS brand_count,
COUNT(DISTINCT sp.brand_name_raw) AS brand_count,
AVG(sp.price_rec) AS avg_price,
COUNT(DISTINCT s.code) AS state_count
FROM store_products sp
LEFT JOIN states s ON s.id = sp.state_id
WHERE sp.category IS NOT NULL
LEFT JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE sp.category_raw IS NOT NULL
AND sp.is_in_stock = TRUE
${stateFilter}
GROUP BY sp.category
GROUP BY sp.category_raw
ORDER BY sku_count DESC
LIMIT $1
`, params);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
sku_count: parseInt(row.sku_count),
dispensary_count: parseInt(row.dispensary_count),
brand_count: parseInt(row.brand_count),
@@ -188,14 +190,14 @@ export class CategoryAnalyticsService {
let categoryFilter = '';
if (category) {
categoryFilter = 'WHERE sp.category = $1';
categoryFilter = 'WHERE sp.category_raw = $1';
params.push(category);
}
const result = await this.pool.query(`
WITH category_stats AS (
SELECT
sp.category,
sp.category_raw,
CASE WHEN s.recreational_legal = TRUE THEN 'recreational' ELSE 'medical_only' END AS legal_type,
COUNT(DISTINCT s.code) AS state_count,
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count,
@@ -203,13 +205,14 @@ export class CategoryAnalyticsService {
AVG(sp.price_rec) AS avg_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec) AS median_price
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
${categoryFilter}
${category ? 'AND' : 'WHERE'} sp.category IS NOT NULL
${category ? 'AND' : 'WHERE'} sp.category_raw IS NOT NULL
AND sp.is_in_stock = TRUE
AND sp.price_rec IS NOT NULL
AND (s.recreational_legal = TRUE OR s.medical_legal = TRUE)
GROUP BY sp.category, CASE WHEN s.recreational_legal = TRUE THEN 'recreational' ELSE 'medical_only' END
GROUP BY sp.category_raw, CASE WHEN s.recreational_legal = TRUE THEN 'recreational' ELSE 'medical_only' END
),
rec_stats AS (
SELECT * FROM category_stats WHERE legal_type = 'recreational'
@@ -218,7 +221,7 @@ export class CategoryAnalyticsService {
SELECT * FROM category_stats WHERE legal_type = 'medical_only'
)
SELECT
COALESCE(r.category, m.category) AS category,
COALESCE(r.category_raw, m.category_raw) AS category,
r.state_count AS rec_state_count,
r.dispensary_count AS rec_dispensary_count,
r.sku_count AS rec_sku_count,
@@ -235,7 +238,7 @@ export class CategoryAnalyticsService {
ELSE NULL
END AS price_diff_percent
FROM rec_stats r
FULL OUTER JOIN med_stats m ON r.category = m.category
FULL OUTER JOIN med_stats m ON r.category_raw = m.category_raw
ORDER BY COALESCE(r.sku_count, 0) + COALESCE(m.sku_count, 0) DESC
`, params);
@@ -282,7 +285,7 @@ export class CategoryAnalyticsService {
COUNT(*) AS sku_count,
COUNT(DISTINCT sps.dispensary_id) AS dispensary_count
FROM store_product_snapshots sps
WHERE sps.category = $1
WHERE sps.category_raw = $1
AND sps.captured_at >= $2
AND sps.captured_at <= $3
AND sps.is_in_stock = TRUE
@@ -335,31 +338,33 @@ export class CategoryAnalyticsService {
WITH category_total AS (
SELECT COUNT(*) AS total
FROM store_products sp
LEFT JOIN states s ON s.id = sp.state_id
WHERE sp.category = $1
LEFT JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE sp.category_raw = $1
AND sp.is_in_stock = TRUE
AND sp.brand_name IS NOT NULL
AND sp.brand_name_raw IS NOT NULL
${stateFilter}
)
SELECT
sp.brand_name,
sp.brand_name_raw,
COUNT(*) AS sku_count,
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count,
AVG(sp.price_rec) AS avg_price,
ROUND(COUNT(*)::NUMERIC * 100 / NULLIF((SELECT total FROM category_total), 0), 2) AS category_share_percent
FROM store_products sp
LEFT JOIN states s ON s.id = sp.state_id
WHERE sp.category = $1
LEFT JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE sp.category_raw = $1
AND sp.is_in_stock = TRUE
AND sp.brand_name IS NOT NULL
AND sp.brand_name_raw IS NOT NULL
${stateFilter}
GROUP BY sp.brand_name
GROUP BY sp.brand_name_raw
ORDER BY sku_count DESC
LIMIT $2
`, params);
return result.rows.map((row: any) => ({
brand_name: row.brand_name,
brand_name: row.brand_name_raw,
sku_count: parseInt(row.sku_count),
dispensary_count: parseInt(row.dispensary_count),
avg_price: row.avg_price ? parseFloat(row.avg_price) : null,
@@ -421,7 +426,7 @@ export class CategoryAnalyticsService {
`, [start, end, limit]);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
start_sku_count: parseInt(row.start_sku_count),
end_sku_count: parseInt(row.end_sku_count),
growth: parseInt(row.growth),

View File

@@ -43,9 +43,9 @@ export class PriceAnalyticsService {
const productResult = await this.pool.query(`
SELECT
sp.id,
sp.name,
sp.brand_name,
sp.category,
sp.name_raw,
sp.brand_name_raw,
sp.category_raw,
sp.dispensary_id,
sp.price_rec,
sp.price_med,
@@ -53,7 +53,7 @@ export class PriceAnalyticsService {
s.code AS state_code
FROM store_products sp
JOIN dispensaries d ON d.id = sp.dispensary_id
LEFT JOIN states s ON s.id = sp.state_id
JOIN states s ON s.id = d.state_id
WHERE sp.id = $1
`, [storeProductId]);
@@ -133,7 +133,7 @@ export class PriceAnalyticsService {
const result = await this.pool.query(`
SELECT
sp.category,
sp.category_raw,
s.code AS state_code,
s.name AS state_name,
CASE
@@ -148,18 +148,18 @@ export class PriceAnalyticsService {
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count
FROM store_products sp
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = sp.state_id
WHERE sp.category = $1
JOIN states s ON s.id = d.state_id
WHERE sp.category_raw = $1
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND (s.recreational_legal = TRUE OR s.medical_legal = TRUE)
${stateFilter}
GROUP BY sp.category, s.code, s.name, s.recreational_legal
GROUP BY sp.category_raw, s.code, s.name, s.recreational_legal
ORDER BY state_code
`, params);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
state_code: row.state_code,
state_name: row.state_name,
legal_type: row.legal_type,
@@ -189,7 +189,7 @@ export class PriceAnalyticsService {
const result = await this.pool.query(`
SELECT
sp.brand_name AS category,
sp.brand_name_raw AS category,
s.code AS state_code,
s.name AS state_name,
CASE
@@ -204,18 +204,18 @@ export class PriceAnalyticsService {
COUNT(DISTINCT sp.dispensary_id) AS dispensary_count
FROM store_products sp
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = sp.state_id
WHERE sp.brand_name = $1
JOIN states s ON s.id = d.state_id
WHERE sp.brand_name_raw = $1
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND (s.recreational_legal = TRUE OR s.medical_legal = TRUE)
${stateFilter}
GROUP BY sp.brand_name, s.code, s.name, s.recreational_legal
GROUP BY sp.brand_name_raw, s.code, s.name, s.recreational_legal
ORDER BY state_code
`, params);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
state_code: row.state_code,
state_name: row.state_name,
legal_type: row.legal_type,
@@ -254,7 +254,7 @@ export class PriceAnalyticsService {
}
if (category) {
filters += ` AND sp.category = $${paramIdx}`;
filters += ` AND sp.category_raw = $${paramIdx}`;
params.push(category);
paramIdx++;
}
@@ -288,15 +288,16 @@ export class PriceAnalyticsService {
)
SELECT
v.store_product_id,
sp.name AS product_name,
sp.brand_name,
sp.name_raw AS product_name,
sp.brand_name_raw,
v.change_count,
v.avg_change_pct,
v.max_change_pct,
v.last_change_at
FROM volatility v
JOIN store_products sp ON sp.id = v.store_product_id
LEFT JOIN states s ON s.id = sp.state_id
LEFT JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE 1=1 ${filters}
ORDER BY v.change_count DESC, v.avg_change_pct DESC
LIMIT $3
@@ -305,7 +306,7 @@ export class PriceAnalyticsService {
return result.rows.map((row: any) => ({
store_product_id: row.store_product_id,
product_name: row.product_name,
brand_name: row.brand_name,
brand_name: row.brand_name_raw,
change_count: parseInt(row.change_count),
avg_change_percent: row.avg_change_pct ? parseFloat(row.avg_change_pct) : 0,
max_change_percent: row.max_change_pct ? parseFloat(row.max_change_pct) : 0,
@@ -327,13 +328,13 @@ export class PriceAnalyticsService {
let categoryFilter = '';
if (category) {
categoryFilter = 'WHERE sp.category = $1';
categoryFilter = 'WHERE sp.category_raw = $1';
params.push(category);
}
const result = await this.pool.query(`
SELECT
sp.category,
sp.category_raw,
AVG(sp.price_rec) FILTER (WHERE s.recreational_legal = TRUE) AS rec_avg,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)
FILTER (WHERE s.recreational_legal = TRUE) AS rec_median,
@@ -343,17 +344,18 @@ export class PriceAnalyticsService {
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec)
FILTER (WHERE s.medical_legal = TRUE AND (s.recreational_legal = FALSE OR s.recreational_legal IS NULL)) AS med_median
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
${categoryFilter}
${category ? 'AND' : 'WHERE'} sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND sp.category IS NOT NULL
GROUP BY sp.category
ORDER BY sp.category
AND sp.category_raw IS NOT NULL
GROUP BY sp.category_raw
ORDER BY sp.category_raw
`, params);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
rec_avg: row.rec_avg ? parseFloat(row.rec_avg) : null,
rec_median: row.rec_median ? parseFloat(row.rec_median) : null,
med_avg: row.med_avg ? parseFloat(row.med_avg) : null,

View File

@@ -108,14 +108,14 @@ export class StateAnalyticsService {
SELECT
COUNT(DISTINCT d.id) AS dispensary_count,
COUNT(DISTINCT sp.id) AS product_count,
COUNT(DISTINCT sp.brand_name) FILTER (WHERE sp.brand_name IS NOT NULL) AS brand_count,
COUNT(DISTINCT sp.category) FILTER (WHERE sp.category IS NOT NULL) AS category_count,
COUNT(DISTINCT sp.brand_name_raw) FILTER (WHERE sp.brand_name_raw IS NOT NULL) AS brand_count,
COUNT(DISTINCT sp.category_raw) FILTER (WHERE sp.category_raw IS NOT NULL) AS category_count,
COUNT(sps.id) AS snapshot_count,
MAX(sps.captured_at) AS last_crawl_at
FROM states s
LEFT JOIN dispensaries d ON d.state_id = s.id
LEFT JOIN store_products sp ON sp.state_id = s.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.state_id = s.id
LEFT JOIN store_products sp ON sp.dispensary_id = d.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.dispensary_id = d.id
WHERE s.code = $1
`, [stateCode]);
@@ -129,7 +129,8 @@ export class StateAnalyticsService {
MIN(price_rec) AS min_price,
MAX(price_rec) AS max_price
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE s.code = $1
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
@@ -140,14 +141,15 @@ export class StateAnalyticsService {
// Get top categories
const topCategoriesResult = await this.pool.query(`
SELECT
sp.category,
sp.category_raw,
COUNT(*) AS count
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE s.code = $1
AND sp.category IS NOT NULL
AND sp.category_raw IS NOT NULL
AND sp.is_in_stock = TRUE
GROUP BY sp.category
GROUP BY sp.category_raw
ORDER BY count DESC
LIMIT 10
`, [stateCode]);
@@ -155,14 +157,15 @@ export class StateAnalyticsService {
// Get top brands
const topBrandsResult = await this.pool.query(`
SELECT
sp.brand_name AS brand,
sp.brand_name_raw AS brand,
COUNT(*) AS count
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE s.code = $1
AND sp.brand_name IS NOT NULL
AND sp.brand_name_raw IS NOT NULL
AND sp.is_in_stock = TRUE
GROUP BY sp.brand_name
GROUP BY sp.brand_name_raw
ORDER BY count DESC
LIMIT 10
`, [stateCode]);
@@ -191,7 +194,7 @@ export class StateAnalyticsService {
max_price: pricing.max_price ? parseFloat(pricing.max_price) : null,
},
top_categories: topCategoriesResult.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
count: parseInt(row.count),
})),
top_brands: topBrandsResult.rows.map((row: any) => ({
@@ -215,8 +218,8 @@ export class StateAnalyticsService {
COUNT(sps.id) AS snapshot_count
FROM states s
LEFT JOIN dispensaries d ON d.state_id = s.id
LEFT JOIN store_products sp ON sp.state_id = s.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.state_id = s.id
LEFT JOIN store_products sp ON sp.dispensary_id = d.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.dispensary_id = d.id
WHERE s.recreational_legal = TRUE
GROUP BY s.code, s.name
ORDER BY dispensary_count DESC
@@ -232,8 +235,8 @@ export class StateAnalyticsService {
COUNT(sps.id) AS snapshot_count
FROM states s
LEFT JOIN dispensaries d ON d.state_id = s.id
LEFT JOIN store_products sp ON sp.state_id = s.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.state_id = s.id
LEFT JOIN store_products sp ON sp.dispensary_id = d.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.dispensary_id = d.id
WHERE s.medical_legal = TRUE
AND (s.recreational_legal = FALSE OR s.recreational_legal IS NULL)
GROUP BY s.code, s.name
@@ -295,46 +298,48 @@ export class StateAnalyticsService {
let groupBy = 'NULL';
if (category) {
categoryFilter = 'AND sp.category = $1';
categoryFilter = 'AND sp.category_raw = $1';
params.push(category);
groupBy = 'sp.category';
groupBy = 'sp.category_raw';
} else {
groupBy = 'sp.category';
groupBy = 'sp.category_raw';
}
const result = await this.pool.query(`
WITH rec_prices AS (
SELECT
${category ? 'sp.category' : 'sp.category'},
${category ? 'sp.category_raw' : 'sp.category_raw'},
COUNT(DISTINCT s.code) AS state_count,
COUNT(*) AS product_count,
AVG(sp.price_rec) AS avg_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec) AS median_price
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE s.recreational_legal = TRUE
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND sp.category IS NOT NULL
AND sp.category_raw IS NOT NULL
${categoryFilter}
GROUP BY sp.category
GROUP BY sp.category_raw
),
med_prices AS (
SELECT
${category ? 'sp.category' : 'sp.category'},
${category ? 'sp.category_raw' : 'sp.category_raw'},
COUNT(DISTINCT s.code) AS state_count,
COUNT(*) AS product_count,
AVG(sp.price_rec) AS avg_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec) AS median_price
FROM store_products sp
JOIN states s ON s.id = sp.state_id
JOIN dispensaries d ON d.id = sp.dispensary_id
JOIN states s ON s.id = d.state_id
WHERE s.medical_legal = TRUE
AND (s.recreational_legal = FALSE OR s.recreational_legal IS NULL)
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND sp.category IS NOT NULL
AND sp.category_raw IS NOT NULL
${categoryFilter}
GROUP BY sp.category
GROUP BY sp.category_raw
)
SELECT
COALESCE(r.category, m.category) AS category,
@@ -357,7 +362,7 @@ export class StateAnalyticsService {
`, params);
return result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
recreational: {
state_count: parseInt(row.rec_state_count) || 0,
product_count: parseInt(row.rec_product_count) || 0,
@@ -395,12 +400,12 @@ export class StateAnalyticsService {
COALESCE(s.medical_legal, FALSE) AS medical_legal,
COUNT(DISTINCT d.id) AS dispensary_count,
COUNT(DISTINCT sp.id) AS product_count,
COUNT(DISTINCT sp.brand_name) FILTER (WHERE sp.brand_name IS NOT NULL) AS brand_count,
COUNT(DISTINCT sp.brand_name_raw) FILTER (WHERE sp.brand_name_raw IS NOT NULL) AS brand_count,
MAX(sps.captured_at) AS last_crawl_at
FROM states s
LEFT JOIN dispensaries d ON d.state_id = s.id
LEFT JOIN store_products sp ON sp.state_id = s.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.state_id = s.id
LEFT JOIN store_products sp ON sp.dispensary_id = d.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.dispensary_id = d.id
GROUP BY s.code, s.name, s.recreational_legal, s.medical_legal
ORDER BY dispensary_count DESC, s.name
`);
@@ -451,8 +456,8 @@ export class StateAnalyticsService {
END AS gap_reason
FROM states s
LEFT JOIN dispensaries d ON d.state_id = s.id
LEFT JOIN store_products sp ON sp.state_id = s.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.state_id = s.id
LEFT JOIN store_products sp ON sp.dispensary_id = d.id AND sp.is_in_stock = TRUE
LEFT JOIN store_product_snapshots sps ON sps.dispensary_id = d.id
WHERE s.recreational_legal = TRUE OR s.medical_legal = TRUE
GROUP BY s.code, s.name, s.recreational_legal, s.medical_legal
HAVING COUNT(DISTINCT d.id) = 0
@@ -499,7 +504,8 @@ export class StateAnalyticsService {
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sp.price_rec) AS median_price,
COUNT(*) AS product_count
FROM states s
JOIN store_products sp ON sp.state_id = s.id
JOIN dispensaries d ON d.state_id = s.id
JOIN store_products sp ON sp.dispensary_id = d.id
WHERE sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND (s.recreational_legal = TRUE OR s.medical_legal = TRUE)

View File

@@ -89,22 +89,22 @@ export class StoreAnalyticsService {
// Get brands added/dropped
const brandsResult = await this.pool.query(`
WITH start_brands AS (
SELECT DISTINCT brand_name
SELECT DISTINCT brand_name_raw
FROM store_product_snapshots
WHERE dispensary_id = $1
AND captured_at >= $2 AND captured_at < $2 + INTERVAL '1 day'
AND brand_name IS NOT NULL
AND captured_at >= $2::timestamp AND captured_at < $2::timestamp + INTERVAL '1 day'
AND brand_name_raw IS NOT NULL
),
end_brands AS (
SELECT DISTINCT brand_name
SELECT DISTINCT brand_name_raw
FROM store_product_snapshots
WHERE dispensary_id = $1
AND captured_at >= $3 - INTERVAL '1 day' AND captured_at <= $3
AND brand_name IS NOT NULL
AND captured_at >= $3::timestamp - INTERVAL '1 day' AND captured_at <= $3::timestamp
AND brand_name_raw IS NOT NULL
)
SELECT
ARRAY(SELECT brand_name FROM end_brands EXCEPT SELECT brand_name FROM start_brands) AS added,
ARRAY(SELECT brand_name FROM start_brands EXCEPT SELECT brand_name FROM end_brands) AS dropped
ARRAY(SELECT brand_name_raw FROM end_brands EXCEPT SELECT brand_name_raw FROM start_brands) AS added,
ARRAY(SELECT brand_name_raw FROM start_brands EXCEPT SELECT brand_name_raw FROM end_brands) AS dropped
`, [dispensaryId, start, end]);
const brands = brandsResult.rows[0] || { added: [], dropped: [] };
@@ -184,9 +184,9 @@ export class StoreAnalyticsService {
-- Products added
SELECT
sp.id AS store_product_id,
sp.name AS product_name,
sp.brand_name,
sp.category,
sp.name_raw AS product_name,
sp.brand_name_raw,
sp.category_raw,
'added' AS event_type,
sp.first_seen_at AS event_date,
NULL::TEXT AS old_value,
@@ -201,9 +201,9 @@ export class StoreAnalyticsService {
-- Stock in/out from snapshots
SELECT
sps.store_product_id,
sp.name AS product_name,
sp.brand_name,
sp.category,
sp.name_raw AS product_name,
sp.brand_name_raw,
sp.category_raw,
CASE
WHEN sps.is_in_stock = TRUE AND LAG(sps.is_in_stock) OVER w = FALSE THEN 'stock_in'
WHEN sps.is_in_stock = FALSE AND LAG(sps.is_in_stock) OVER w = TRUE THEN 'stock_out'
@@ -224,9 +224,9 @@ export class StoreAnalyticsService {
-- Price changes from snapshots
SELECT
sps.store_product_id,
sp.name AS product_name,
sp.brand_name,
sp.category,
sp.name_raw AS product_name,
sp.brand_name_raw,
sp.category_raw,
'price_change' AS event_type,
sps.captured_at AS event_date,
LAG(sps.price_rec::TEXT) OVER w AS old_value,
@@ -250,8 +250,8 @@ export class StoreAnalyticsService {
return result.rows.map((row: any) => ({
store_product_id: row.store_product_id,
product_name: row.product_name,
brand_name: row.brand_name,
category: row.category,
brand_name: row.brand_name_raw,
category: row.category_raw,
event_type: row.event_type,
event_date: row.event_date ? row.event_date.toISOString() : null,
old_value: row.old_value,
@@ -364,8 +364,8 @@ export class StoreAnalyticsService {
changes: result.rows.map((row: any) => ({
store_product_id: row.store_product_id,
product_name: row.product_name,
brand_name: row.brand_name,
category: row.category,
brand_name: row.brand_name_raw,
category: row.category_raw,
old_quantity: row.old_quantity,
new_quantity: row.new_quantity,
quantity_delta: row.qty_delta,
@@ -415,14 +415,14 @@ export class StoreAnalyticsService {
// Get top brands
const brandsResult = await this.pool.query(`
SELECT
brand_name AS brand,
brand_name_raw AS brand,
COUNT(*) AS count,
ROUND(COUNT(*)::NUMERIC * 100 / NULLIF($2, 0), 2) AS percent
FROM store_products
WHERE dispensary_id = $1
AND brand_name IS NOT NULL
AND brand_name_raw IS NOT NULL
AND is_in_stock = TRUE
GROUP BY brand_name
GROUP BY brand_name_raw
ORDER BY count DESC
LIMIT 20
`, [dispensaryId, totalProducts]);
@@ -432,7 +432,7 @@ export class StoreAnalyticsService {
in_stock_count: parseInt(totals.in_stock) || 0,
out_of_stock_count: parseInt(totals.out_of_stock) || 0,
categories: categoriesResult.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
count: parseInt(row.count),
percent: parseFloat(row.percent) || 0,
})),
@@ -574,23 +574,24 @@ export class StoreAnalyticsService {
),
market_prices AS (
SELECT
sp.category,
sp.category_raw,
AVG(sp.price_rec) AS market_avg
FROM store_products sp
WHERE sp.state_id = $2
JOIN dispensaries d ON d.id = sp.dispensary_id
WHERE d.state_id = $2
AND sp.price_rec IS NOT NULL
AND sp.is_in_stock = TRUE
AND sp.category IS NOT NULL
GROUP BY sp.category
AND sp.category_raw IS NOT NULL
GROUP BY sp.category_raw
)
SELECT
sp.category,
sp.category_raw,
sp.store_avg AS store_avg_price,
mp.market_avg AS market_avg_price,
ROUND(((sp.store_avg - mp.market_avg) / NULLIF(mp.market_avg, 0) * 100)::NUMERIC, 2) AS price_vs_market_percent,
sp.product_count
FROM store_prices sp
LEFT JOIN market_prices mp ON mp.category = sp.category
LEFT JOIN market_prices mp ON mp.category = sp.category_raw
ORDER BY sp.product_count DESC
`, [dispensaryId, dispensary.state_id]);
@@ -602,9 +603,10 @@ export class StoreAnalyticsService {
WHERE dispensary_id = $1 AND price_rec IS NOT NULL AND is_in_stock = TRUE
),
market_avg AS (
SELECT AVG(price_rec) AS avg
FROM store_products
WHERE state_id = $2 AND price_rec IS NOT NULL AND is_in_stock = TRUE
SELECT AVG(sp.price_rec) AS avg
FROM store_products sp
JOIN dispensaries d ON d.id = sp.dispensary_id
WHERE d.state_id = $2 AND sp.price_rec IS NOT NULL AND sp.is_in_stock = TRUE
)
SELECT
ROUND(((sa.avg - ma.avg) / NULLIF(ma.avg, 0) * 100)::NUMERIC, 2) AS price_vs_market
@@ -615,7 +617,7 @@ export class StoreAnalyticsService {
dispensary_id: dispensaryId,
dispensary_name: dispensary.name,
categories: result.rows.map((row: any) => ({
category: row.category,
category: row.category_raw,
store_avg_price: parseFloat(row.store_avg_price),
market_avg_price: row.market_avg_price ? parseFloat(row.market_avg_price) : 0,
price_vs_market_percent: row.price_vs_market_percent ? parseFloat(row.price_vs_market_percent) : 0,

View File

@@ -1,49 +1,53 @@
/**
* Crawl Rotator - Proxy & User Agent Rotation for Crawlers
*
* Manages rotation of proxies and user agents to avoid blocks.
* Used by platform-specific crawlers (Dutchie, Jane, etc.)
* Updated: 2025-12-10 per workflow-12102025.md
*
* KEY BEHAVIORS (per workflow-12102025.md):
* 1. Task determines WHAT work to do, proxy determines SESSION IDENTITY
* 2. Proxy location (timezone) sets Accept-Language headers (always English)
* 3. On 403: immediately get new IP, new fingerprint, retry
* 4. After 3 consecutive 403s on same proxy with different fingerprints → disable proxy
*
* USER-AGENT GENERATION (per workflow-12102025.md):
* - Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
* - Browser whitelist: Chrome, Safari, Edge, Firefox only
* - UA sticks until IP rotates
* - Failure = alert admin + stop crawl (no fallback)
*
* Uses intoli/user-agents for realistic UA generation with daily-updated data.
*
* Canonical location: src/services/crawl-rotator.ts
*/
import { Pool } from 'pg';
import UserAgent from 'user-agents';
import {
HTTPFingerprint,
generateHTTPFingerprint,
BrowserType,
} from './http-fingerprint';
// ============================================================
// USER AGENT CONFIGURATION
// UA CONSTANTS (per workflow-12102025.md)
// ============================================================
/**
* Modern browser user agents (Chrome, Firefox, Safari, Edge on various platforms)
* Updated: 2024
* Per workflow-12102025.md: Device category distribution (hardcoded)
* Mobile: 62%, Desktop: 36%, Tablet: 2%
*/
export const USER_AGENTS = [
// Chrome on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
const DEVICE_WEIGHTS = {
mobile: 62,
desktop: 36,
tablet: 2,
} as const;
// Chrome on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
// Firefox on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0',
// Firefox on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
// Safari on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
// Edge on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
// Chrome on Linux
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
];
/**
* Per workflow-12102025.md: Browser whitelist
* Only Chrome (67%), Safari (20%), Edge (6%), Firefox (3%)
* Samsung Internet, Opera, and other niche browsers are filtered out
*/
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'] as const;
// ============================================================
// PROXY TYPES
@@ -61,8 +65,13 @@ export interface Proxy {
failureCount: number;
successCount: number;
avgResponseTimeMs: number | null;
maxConnections: number; // Number of concurrent connections allowed (for rotating proxies)
// Location info (if known)
maxConnections: number;
/**
* Per workflow-12102025.md: Track consecutive 403s with different fingerprints.
* After 3 consecutive 403s → disable proxy (it's burned).
*/
consecutive403Count: number;
// Location info - determines session headers per workflow-12102025.md
city?: string;
state?: string;
country?: string;
@@ -77,6 +86,40 @@ export interface ProxyStats {
avgSuccessRate: number;
}
// ============================================================
// FINGERPRINT TYPE
// Per workflow-12102025.md: Full browser fingerprint from user-agents
// ============================================================
export interface BrowserFingerprint {
userAgent: string;
platform: string;
screenWidth: number;
screenHeight: number;
viewportWidth: number;
viewportHeight: number;
deviceCategory: string;
browserName: string; // Per workflow-12102025.md: for session logging
// Derived headers for anti-detect
acceptLanguage: string;
secChUa?: string;
secChUaPlatform?: string;
secChUaMobile?: string;
// Per workflow-12102025.md: HTTP Fingerprinting section
httpFingerprint: HTTPFingerprint;
}
/**
* Per workflow-12102025.md: Session log entry for debugging blocked sessions
*/
export interface UASessionLog {
deviceCategory: string;
browserName: string;
userAgent: string;
proxyIp: string | null;
sessionStartedAt: Date;
}
// ============================================================
// PROXY ROTATOR CLASS
// ============================================================
@@ -91,9 +134,6 @@ export class ProxyRotator {
this.pool = pool || null;
}
/**
* Initialize with database pool
*/
setPool(pool: Pool): void {
this.pool = pool;
}
@@ -122,6 +162,7 @@ export class ProxyRotator {
0 as "successCount",
response_time_ms as "avgResponseTimeMs",
COALESCE(max_connections, 1) as "maxConnections",
COALESCE(consecutive_403_count, 0) as "consecutive403Count",
city,
state,
country,
@@ -134,11 +175,9 @@ export class ProxyRotator {
this.proxies = result.rows;
// Calculate total concurrent capacity
const totalCapacity = this.proxies.reduce((sum, p) => sum + p.maxConnections, 0);
console.log(`[ProxyRotator] Loaded ${this.proxies.length} active proxies (${totalCapacity} max concurrent connections)`);
} catch (error) {
// Table might not exist - that's okay
console.warn(`[ProxyRotator] Could not load proxies: ${error}`);
this.proxies = [];
}
@@ -150,7 +189,6 @@ export class ProxyRotator {
getNext(): Proxy | null {
if (this.proxies.length === 0) return null;
// Round-robin rotation
this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
this.lastRotation = new Date();
@@ -185,23 +223,68 @@ export class ProxyRotator {
}
/**
* Mark proxy as failed (temporarily remove from rotation)
* Mark proxy as blocked (403 received)
* Per workflow-12102025.md:
* - Increment consecutive_403_count
* - After 3 consecutive 403s with different fingerprints → disable proxy
* - This is separate from general failures (timeouts, etc.)
*/
async markFailed(proxyId: number, error?: string): Promise<void> {
// Update in-memory
async markBlocked(proxyId: number): Promise<boolean> {
const proxy = this.proxies.find(p => p.id === proxyId);
if (proxy) {
proxy.failureCount++;
let shouldDisable = false;
// Deactivate if too many failures
if (proxy.failureCount >= 5) {
if (proxy) {
proxy.consecutive403Count++;
// Per workflow-12102025.md: 3 consecutive 403s → proxy is burned
if (proxy.consecutive403Count >= 3) {
proxy.isActive = false;
this.proxies = this.proxies.filter(p => p.id !== proxyId);
console.log(`[ProxyRotator] Proxy ${proxyId} deactivated after ${proxy.failureCount} failures`);
console.log(`[ProxyRotator] Proxy ${proxyId} DISABLED after ${proxy.consecutive403Count} consecutive 403s (burned)`);
shouldDisable = true;
} else {
console.log(`[ProxyRotator] Proxy ${proxyId} blocked (403 #${proxy.consecutive403Count}/3)`);
}
}
// Update database
if (this.pool) {
try {
await this.pool.query(`
UPDATE proxies
SET
consecutive_403_count = COALESCE(consecutive_403_count, 0) + 1,
last_failure_at = NOW(),
test_result = '403 Forbidden',
active = CASE WHEN COALESCE(consecutive_403_count, 0) >= 2 THEN false ELSE active END,
updated_at = NOW()
WHERE id = $1
`, [proxyId]);
} catch (err) {
console.error(`[ProxyRotator] Failed to update proxy ${proxyId}:`, err);
}
}
return shouldDisable;
}
/**
* Mark proxy as failed (general error - timeout, connection error, etc.)
* Separate from 403 blocking per workflow-12102025.md
*/
async markFailed(proxyId: number, error?: string): Promise<void> {
const proxy = this.proxies.find(p => p.id === proxyId);
if (proxy) {
proxy.failureCount++;
// Deactivate if too many general failures
if (proxy.failureCount >= 5) {
proxy.isActive = false;
this.proxies = this.proxies.filter(p => p.id !== proxyId);
console.log(`[ProxyRotator] Proxy ${proxyId} deactivated after ${proxy.failureCount} general failures`);
}
}
if (this.pool) {
try {
await this.pool.query(`
@@ -220,23 +303,22 @@ export class ProxyRotator {
}
/**
* Mark proxy as successful
* Mark proxy as successful - resets consecutive 403 count
* Per workflow-12102025.md: successful request clears the 403 counter
*/
async markSuccess(proxyId: number, responseTimeMs?: number): Promise<void> {
// Update in-memory
const proxy = this.proxies.find(p => p.id === proxyId);
if (proxy) {
proxy.successCount++;
proxy.consecutive403Count = 0; // Reset on success per workflow-12102025.md
proxy.lastUsedAt = new Date();
if (responseTimeMs !== undefined) {
// Rolling average
proxy.avgResponseTimeMs = proxy.avgResponseTimeMs
? (proxy.avgResponseTimeMs * 0.8) + (responseTimeMs * 0.2)
: responseTimeMs;
}
}
// Update database
if (this.pool) {
try {
await this.pool.query(`
@@ -244,6 +326,7 @@ export class ProxyRotator {
SET
last_tested_at = NOW(),
test_result = 'success',
consecutive_403_count = 0,
response_time_ms = CASE
WHEN response_time_ms IS NULL THEN $2
ELSE (response_time_ms * 0.8 + $2 * 0.2)::integer
@@ -272,8 +355,8 @@ export class ProxyRotator {
*/
getStats(): ProxyStats {
const totalProxies = this.proxies.length;
const activeProxies = this.proxies.reduce((sum, p) => sum + p.maxConnections, 0); // Total concurrent capacity
const blockedProxies = this.proxies.filter(p => p.failureCount >= 5).length;
const activeProxies = this.proxies.reduce((sum, p) => sum + p.maxConnections, 0);
const blockedProxies = this.proxies.filter(p => p.failureCount >= 5 || p.consecutive403Count >= 3).length;
const successRates = this.proxies
.filter(p => p.successCount + p.failureCount > 0)
@@ -285,15 +368,12 @@ export class ProxyRotator {
return {
totalProxies,
activeProxies, // Total concurrent capacity across all proxies
activeProxies,
blockedProxies,
avgSuccessRate,
};
}
/**
* Check if proxy pool has available proxies
*/
hasAvailableProxies(): boolean {
return this.proxies.length > 0;
}
@@ -301,53 +381,194 @@ export class ProxyRotator {
// ============================================================
// USER AGENT ROTATOR CLASS
// Per workflow-12102025.md: Uses intoli/user-agents for realistic fingerprints
// ============================================================
export class UserAgentRotator {
private userAgents: string[];
private currentIndex: number = 0;
private lastRotation: Date = new Date();
private currentFingerprint: BrowserFingerprint | null = null;
private sessionLog: UASessionLog | null = null;
constructor(userAgents: string[] = USER_AGENTS) {
this.userAgents = userAgents;
// Start at random index to avoid patterns
this.currentIndex = Math.floor(Math.random() * userAgents.length);
constructor() {
// Per workflow-12102025.md: Initialize with first fingerprint
this.rotate();
}
/**
* Get next user agent in rotation
* Per workflow-12102025.md: Roll device category based on distribution
* Mobile: 62%, Desktop: 36%, Tablet: 2%
*/
getNext(): string {
this.currentIndex = (this.currentIndex + 1) % this.userAgents.length;
this.lastRotation = new Date();
return this.userAgents[this.currentIndex];
private rollDeviceCategory(): 'mobile' | 'desktop' | 'tablet' {
const roll = Math.random() * 100;
if (roll < DEVICE_WEIGHTS.mobile) {
return 'mobile';
} else if (roll < DEVICE_WEIGHTS.mobile + DEVICE_WEIGHTS.desktop) {
return 'desktop';
} else {
return 'tablet';
}
}
/**
* Get current user agent without rotating
* Per workflow-12102025.md: Extract browser name from UA string
*/
getCurrent(): string {
return this.userAgents[this.currentIndex];
private extractBrowserName(userAgent: string): string {
if (userAgent.includes('Edg/')) return 'Edge';
if (userAgent.includes('Firefox/')) return 'Firefox';
if (userAgent.includes('Safari/') && !userAgent.includes('Chrome/')) return 'Safari';
if (userAgent.includes('Chrome/')) return 'Chrome';
return 'Unknown';
}
/**
* Get a random user agent
* Per workflow-12102025.md: Check if browser is in whitelist
*/
getRandom(): string {
const index = Math.floor(Math.random() * this.userAgents.length);
return this.userAgents[index];
private isAllowedBrowser(userAgent: string): boolean {
const browserName = this.extractBrowserName(userAgent);
return ALLOWED_BROWSERS.includes(browserName as typeof ALLOWED_BROWSERS[number]);
}
/**
* Get total available user agents
* Generate a new random fingerprint
* Per workflow-12102025.md:
* - Roll device category (62/36/2)
* - Filter to top 4 browsers only
* - Failure = alert admin + stop (no fallback)
*/
rotate(proxyIp?: string): BrowserFingerprint {
// Per workflow-12102025.md: Roll device category
const deviceCategory = this.rollDeviceCategory();
// Per workflow-12102025.md: Generate UA filtered to device category
const generator = new UserAgent({ deviceCategory });
// Per workflow-12102025.md: Try to get an allowed browser (max 50 attempts)
let ua: ReturnType<typeof generator>;
let attempts = 0;
const maxAttempts = 50;
do {
ua = generator();
attempts++;
} while (!this.isAllowedBrowser(ua.data.userAgent) && attempts < maxAttempts);
// Per workflow-12102025.md: If we can't get allowed browser, this is a failure
if (!this.isAllowedBrowser(ua.data.userAgent)) {
const errorMsg = `[UserAgentRotator] CRITICAL: Failed to generate allowed browser after ${maxAttempts} attempts. Device: ${deviceCategory}. Last UA: ${ua.data.userAgent}`;
console.error(errorMsg);
// Per workflow-12102025.md: Alert admin + stop crawl
// TODO: Post alert to admin dashboard
throw new Error(errorMsg);
}
const data = ua.data;
const browserName = this.extractBrowserName(data.userAgent);
// Build sec-ch-ua headers from user agent string
const secChUa = this.buildSecChUa(data.userAgent, deviceCategory);
// Per workflow-12102025.md: HTTP Fingerprinting - generate full HTTP fingerprint
const httpFingerprint = generateHTTPFingerprint(browserName as BrowserType);
this.currentFingerprint = {
userAgent: data.userAgent,
platform: data.platform,
screenWidth: data.screenWidth,
screenHeight: data.screenHeight,
viewportWidth: data.viewportWidth,
viewportHeight: data.viewportHeight,
deviceCategory: data.deviceCategory,
browserName, // Per workflow-12102025.md: for session logging
// Per workflow-12102025.md: always English
acceptLanguage: 'en-US,en;q=0.9',
...secChUa,
// Per workflow-12102025.md: HTTP Fingerprinting section
httpFingerprint,
};
// Per workflow-12102025.md: Log session data
this.sessionLog = {
deviceCategory,
browserName,
userAgent: data.userAgent,
proxyIp: proxyIp || null,
sessionStartedAt: new Date(),
};
console.log(`[UserAgentRotator] New fingerprint: device=${deviceCategory}, browser=${browserName}, UA=${data.userAgent.slice(0, 50)}...`);
return this.currentFingerprint;
}
/**
* Get current fingerprint without rotating
*/
getCurrent(): BrowserFingerprint {
if (!this.currentFingerprint) {
return this.rotate();
}
return this.currentFingerprint;
}
/**
* Get a random fingerprint (rotates and returns)
*/
getRandom(proxyIp?: string): BrowserFingerprint {
return this.rotate(proxyIp);
}
/**
* Per workflow-12102025.md: Get session log for debugging
*/
getSessionLog(): UASessionLog | null {
return this.sessionLog;
}
/**
* Build sec-ch-ua headers from user agent string
* Per workflow-12102025.md: Include mobile indicator based on device category
*/
private buildSecChUa(userAgent: string, deviceCategory: string): { secChUa?: string; secChUaPlatform?: string; secChUaMobile?: string } {
const isMobile = deviceCategory === 'mobile' || deviceCategory === 'tablet';
// Extract Chrome version if present
const chromeMatch = userAgent.match(/Chrome\/(\d+)/);
const edgeMatch = userAgent.match(/Edg\/(\d+)/);
if (edgeMatch) {
const version = edgeMatch[1];
return {
secChUa: `"Microsoft Edge";v="${version}", "Chromium";v="${version}", "Not_A Brand";v="24"`,
secChUaPlatform: userAgent.includes('Windows') ? '"Windows"' : userAgent.includes('Android') ? '"Android"' : '"macOS"',
secChUaMobile: isMobile ? '?1' : '?0',
};
}
if (chromeMatch) {
const version = chromeMatch[1];
let platform = '"Linux"';
if (userAgent.includes('Windows')) platform = '"Windows"';
else if (userAgent.includes('Mac')) platform = '"macOS"';
else if (userAgent.includes('Android')) platform = '"Android"';
else if (userAgent.includes('iPhone') || userAgent.includes('iPad')) platform = '"iOS"';
return {
secChUa: `"Google Chrome";v="${version}", "Chromium";v="${version}", "Not_A Brand";v="24"`,
secChUaPlatform: platform,
secChUaMobile: isMobile ? '?1' : '?0',
};
}
// Firefox/Safari don't send sec-ch-ua
return {};
}
getCount(): number {
return this.userAgents.length;
return 1; // user-agents generates dynamically
}
}
// ============================================================
// COMBINED ROTATOR (for convenience)
// COMBINED ROTATOR
// Per workflow-12102025.md: Coordinates proxy + fingerprint rotation
// ============================================================
export class CrawlRotator {
@@ -359,49 +580,51 @@ export class CrawlRotator {
this.userAgent = new UserAgentRotator();
}
/**
* Initialize rotator (load proxies from DB)
*/
async initialize(): Promise<void> {
await this.proxy.loadProxies();
}
/**
* Rotate proxy only
* Rotate proxy only (get new IP)
*/
rotateProxy(): Proxy | null {
return this.proxy.getNext();
}
/**
* Rotate user agent only
* Rotate fingerprint only (new UA, screen size, etc.)
*/
rotateUserAgent(): string {
return this.userAgent.getNext();
rotateFingerprint(): BrowserFingerprint {
return this.userAgent.rotate();
}
/**
* Rotate both proxy and user agent
* Rotate both proxy and fingerprint
* Per workflow-12102025.md: called on 403 for fresh identity
* Passes proxy IP to UA rotation for session logging
*/
rotateBoth(): { proxy: Proxy | null; userAgent: string } {
rotateBoth(): { proxy: Proxy | null; fingerprint: BrowserFingerprint } {
const proxy = this.proxy.getNext();
const proxyIp = proxy ? proxy.host : undefined;
return {
proxy: this.proxy.getNext(),
userAgent: this.userAgent.getNext(),
proxy,
fingerprint: this.userAgent.rotate(proxyIp),
};
}
/**
* Get current proxy and user agent without rotating
* Get current proxy and fingerprint without rotating
*/
getCurrent(): { proxy: Proxy | null; userAgent: string } {
getCurrent(): { proxy: Proxy | null; fingerprint: BrowserFingerprint } {
return {
proxy: this.proxy.getCurrent(),
userAgent: this.userAgent.getCurrent(),
fingerprint: this.userAgent.getCurrent(),
};
}
/**
* Record success for current proxy
* Per workflow-12102025.md: resets consecutive 403 count
*/
async recordSuccess(responseTimeMs?: number): Promise<void> {
const current = this.proxy.getCurrent();
@@ -411,7 +634,20 @@ export class CrawlRotator {
}
/**
* Record failure for current proxy
* Record 403 block for current proxy
* Per workflow-12102025.md: increments consecutive_403_count, disables after 3
* Returns true if proxy was disabled
*/
async recordBlock(): Promise<boolean> {
const current = this.proxy.getCurrent();
if (current) {
return await this.proxy.markBlocked(current.id);
}
return false;
}
/**
* Record general failure (not 403)
*/
async recordFailure(error?: string): Promise<void> {
const current = this.proxy.getCurrent();
@@ -421,14 +657,13 @@ export class CrawlRotator {
}
/**
* Get current proxy location info (for reporting)
* Note: For rotating proxies (like IPRoyal), the actual exit location varies per request
* Get current proxy location info
* Per workflow-12102025.md: proxy location determines session headers
*/
getProxyLocation(): { city?: string; state?: string; country?: string; timezone?: string; isRotating: boolean } | null {
const current = this.proxy.getCurrent();
if (!current) return null;
// Check if this is a rotating proxy (max_connections > 1 usually indicates rotating)
const isRotating = current.maxConnections > 1;
return {
@@ -439,6 +674,15 @@ export class CrawlRotator {
isRotating
};
}
/**
* Get timezone from current proxy
* Per workflow-12102025.md: used for Accept-Language header
*/
getProxyTimezone(): string | undefined {
const current = this.proxy.getCurrent();
return current?.timezone;
}
}
// ============================================================

View File

@@ -0,0 +1,315 @@
/**
* HTTP Fingerprinting Service
*
* Per workflow-12102025.md - HTTP Fingerprinting section:
* - Full header set per browser type
* - Browser-specific header ordering
* - Natural randomization (DNT, Accept quality)
* - Dynamic Referer per dispensary
*
* Canonical location: src/services/http-fingerprint.ts
*/
// ============================================================
// TYPES
// ============================================================
export type BrowserType = 'Chrome' | 'Firefox' | 'Safari' | 'Edge';
/**
* Per workflow-12102025.md: Full HTTP fingerprint for a session
*/
export interface HTTPFingerprint {
browserType: BrowserType;
headers: Record<string, string>;
headerOrder: string[];
curlImpersonateBinary: string;
hasDNT: boolean;
}
/**
* Per workflow-12102025.md: Context for building headers
*/
export interface HeaderContext {
userAgent: string;
secChUa?: string;
secChUaPlatform?: string;
secChUaMobile?: string;
referer: string;
isPost: boolean;
contentLength?: number;
}
// ============================================================
// CONSTANTS (per workflow-12102025.md)
// ============================================================
/**
* Per workflow-12102025.md: DNT header distribution (~30% of users)
*/
const DNT_PROBABILITY = 0.30;
/**
* Per workflow-12102025.md: Accept header variations for natural traffic
*/
const ACCEPT_VARIATIONS = [
'application/json, text/plain, */*',
'application/json,text/plain,*/*',
'*/*',
];
/**
* Per workflow-12102025.md: Accept-Language variations
*/
const ACCEPT_LANGUAGE_VARIATIONS = [
'en-US,en;q=0.9',
'en-US,en;q=0.8',
'en-US;q=0.9,en;q=0.8',
];
/**
* Per workflow-12102025.md: curl-impersonate binaries per browser
*/
const CURL_IMPERSONATE_BINARIES: Record<BrowserType, string> = {
Chrome: 'curl_chrome131',
Edge: 'curl_chrome131', // Edge uses Chromium
Firefox: 'curl_ff133',
Safari: 'curl_safari17',
};
// ============================================================
// HEADER ORDERING (per workflow-12102025.md)
// ============================================================
/**
* Per workflow-12102025.md: Chrome header order for GraphQL requests
*/
const CHROME_HEADER_ORDER = [
'Host',
'Connection',
'Content-Length',
'sec-ch-ua',
'DNT',
'sec-ch-ua-mobile',
'User-Agent',
'sec-ch-ua-platform',
'Content-Type',
'Accept',
'Origin',
'sec-fetch-site',
'sec-fetch-mode',
'sec-fetch-dest',
'Referer',
'Accept-Encoding',
'Accept-Language',
];
/**
* Per workflow-12102025.md: Firefox header order for GraphQL requests
*/
const FIREFOX_HEADER_ORDER = [
'Host',
'User-Agent',
'Accept',
'Accept-Language',
'Accept-Encoding',
'Content-Type',
'Content-Length',
'Origin',
'DNT',
'Connection',
'Referer',
'sec-fetch-dest',
'sec-fetch-mode',
'sec-fetch-site',
];
/**
* Per workflow-12102025.md: Safari header order for GraphQL requests
*/
const SAFARI_HEADER_ORDER = [
'Host',
'Connection',
'Content-Length',
'Accept',
'User-Agent',
'Content-Type',
'Origin',
'Referer',
'Accept-Encoding',
'Accept-Language',
];
/**
* Per workflow-12102025.md: Edge uses Chrome order (Chromium-based)
*/
const HEADER_ORDERS: Record<BrowserType, string[]> = {
Chrome: CHROME_HEADER_ORDER,
Edge: CHROME_HEADER_ORDER,
Firefox: FIREFOX_HEADER_ORDER,
Safari: SAFARI_HEADER_ORDER,
};
// ============================================================
// FINGERPRINT GENERATION
// ============================================================
/**
* Per workflow-12102025.md: Generate HTTP fingerprint for a session
* Randomization is done once per session for consistency
*/
export function generateHTTPFingerprint(browserType: BrowserType): HTTPFingerprint {
// Per workflow-12102025.md: DNT randomized per session (~30%)
const hasDNT = Math.random() < DNT_PROBABILITY;
return {
browserType,
headers: {}, // Built dynamically per request
headerOrder: HEADER_ORDERS[browserType],
curlImpersonateBinary: CURL_IMPERSONATE_BINARIES[browserType],
hasDNT,
};
}
/**
* Per workflow-12102025.md: Build complete headers for a request
* Returns headers in browser-specific order
*/
export function buildOrderedHeaders(
fingerprint: HTTPFingerprint,
context: HeaderContext
): { headers: Record<string, string>; orderedHeaders: string[] } {
const { browserType, hasDNT, headerOrder } = fingerprint;
const { userAgent, secChUa, secChUaPlatform, secChUaMobile, referer, isPost, contentLength } = context;
// Per workflow-12102025.md: Natural randomization for Accept
const accept = ACCEPT_VARIATIONS[Math.floor(Math.random() * ACCEPT_VARIATIONS.length)];
const acceptLanguage = ACCEPT_LANGUAGE_VARIATIONS[Math.floor(Math.random() * ACCEPT_LANGUAGE_VARIATIONS.length)];
// Build all possible headers
const allHeaders: Record<string, string> = {
'Connection': 'keep-alive',
'User-Agent': userAgent,
'Accept': accept,
'Accept-Language': acceptLanguage,
'Accept-Encoding': 'gzip, deflate, br',
};
// Per workflow-12102025.md: POST-only headers
if (isPost) {
allHeaders['Content-Type'] = 'application/json';
allHeaders['Origin'] = 'https://dutchie.com';
if (contentLength !== undefined) {
allHeaders['Content-Length'] = String(contentLength);
}
}
// Per workflow-12102025.md: Dynamic Referer per dispensary
allHeaders['Referer'] = referer;
// Per workflow-12102025.md: DNT randomized per session
if (hasDNT) {
allHeaders['DNT'] = '1';
}
// Per workflow-12102025.md: Chromium-only headers (Chrome, Edge)
if (browserType === 'Chrome' || browserType === 'Edge') {
if (secChUa) allHeaders['sec-ch-ua'] = secChUa;
if (secChUaMobile) allHeaders['sec-ch-ua-mobile'] = secChUaMobile;
if (secChUaPlatform) allHeaders['sec-ch-ua-platform'] = secChUaPlatform;
allHeaders['sec-fetch-site'] = 'same-origin';
allHeaders['sec-fetch-mode'] = 'cors';
allHeaders['sec-fetch-dest'] = 'empty';
}
// Per workflow-12102025.md: Firefox has sec-fetch but no sec-ch
if (browserType === 'Firefox') {
allHeaders['sec-fetch-site'] = 'same-origin';
allHeaders['sec-fetch-mode'] = 'cors';
allHeaders['sec-fetch-dest'] = 'empty';
}
// Per workflow-12102025.md: Safari has no sec-* headers
// Filter to only headers that exist and order them
const orderedHeaders: string[] = [];
const headers: Record<string, string> = {};
for (const headerName of headerOrder) {
if (allHeaders[headerName]) {
orderedHeaders.push(headerName);
headers[headerName] = allHeaders[headerName];
}
}
return { headers, orderedHeaders };
}
/**
* Per workflow-12102025.md: Build curl command arguments for headers
* Headers are added in browser-specific order
*/
export function buildCurlHeaderArgs(
fingerprint: HTTPFingerprint,
context: HeaderContext
): string[] {
const { headers, orderedHeaders } = buildOrderedHeaders(fingerprint, context);
const args: string[] = [];
for (const headerName of orderedHeaders) {
// Skip Host and Content-Length - curl handles these
if (headerName === 'Host' || headerName === 'Content-Length') continue;
args.push('-H', `${headerName}: ${headers[headerName]}`);
}
return args;
}
/**
* Per workflow-12102025.md: Extract Referer from dispensary menu_url
*/
export function buildRefererFromMenuUrl(menuUrl: string | null | undefined): string {
if (!menuUrl) {
return 'https://dutchie.com/';
}
// Extract slug from menu_url
// Formats: /embedded-menu/<slug> or /dispensary/<slug> or full URL
let slug: string | null = null;
const embeddedMatch = menuUrl.match(/\/embedded-menu\/([^/?]+)/);
const dispensaryMatch = menuUrl.match(/\/dispensary\/([^/?]+)/);
if (embeddedMatch) {
slug = embeddedMatch[1];
} else if (dispensaryMatch) {
slug = dispensaryMatch[1];
}
if (slug) {
return `https://dutchie.com/dispensary/${slug}`;
}
return 'https://dutchie.com/';
}
/**
* Per workflow-12102025.md: Get curl-impersonate binary for browser
*/
export function getCurlBinary(browserType: BrowserType): string {
return CURL_IMPERSONATE_BINARIES[browserType];
}
/**
* Per workflow-12102025.md: Check if curl-impersonate is available
*/
export function isCurlImpersonateAvailable(browserType: BrowserType): boolean {
const binary = CURL_IMPERSONATE_BINARIES[browserType];
try {
const { execSync } = require('child_process');
execSync(`which ${binary}`, { stdio: 'ignore' });
return true;
} catch {
return false;
}
}

View File

@@ -1,116 +1,38 @@
import cron from 'node-cron';
import { pool } from '../db/pool';
import { scrapeStore, scrapeCategory } from '../scraper-v2';
let scheduledJobs: cron.ScheduledTask[] = [];
async function getSettings(): Promise<{
scrapeIntervalHours: number;
scrapeSpecialsTime: string;
}> {
const result = await pool.query(`
SELECT key, value FROM settings
WHERE key IN ('scrape_interval_hours', 'scrape_specials_time')
`);
const settings: Record<string, string> = {};
result.rows.forEach((row: { key: string; value: string }) => {
settings[row.key] = row.value;
});
return {
scrapeIntervalHours: parseInt(settings.scrape_interval_hours || '4'),
scrapeSpecialsTime: settings.scrape_specials_time || '00:01'
};
}
async function scrapeAllStores(): Promise<void> {
console.log('🔄 Starting scheduled scrape for all stores...');
const result = await pool.query(`
SELECT id, name FROM stores WHERE active = true AND scrape_enabled = true
`);
for (const store of result.rows) {
try {
console.log(`Scraping store: ${store.name}`);
await scrapeStore(store.id);
} catch (error) {
console.error(`Failed to scrape store ${store.name}:`, error);
}
}
console.log('✅ Scheduled scrape completed');
}
async function scrapeSpecials(): Promise<void> {
console.log('🌟 Starting scheduled specials scrape...');
const result = await pool.query(`
SELECT s.id, s.name, c.id as category_id
FROM stores s
JOIN categories c ON c.store_id = s.id
WHERE s.active = true AND s.scrape_enabled = true
AND c.slug = 'specials' AND c.scrape_enabled = true
`);
for (const row of result.rows) {
try {
console.log(`Scraping specials for: ${row.name}`);
await scrapeCategory(row.id, row.category_id);
} catch (error) {
console.error(`Failed to scrape specials for ${row.name}:`, error);
}
}
console.log('✅ Specials scrape completed');
}
/**
* LEGACY SCHEDULER - DEPRECATED 2024-12-10
*
* DO NOT USE THIS FILE.
*
* Per TASK_WORKFLOW_2024-12-10.md:
* This node-cron scheduler has been replaced by the database-driven
* task scheduler in src/services/task-scheduler.ts
*
* The new scheduler:
* - Stores schedules in PostgreSQL (survives restarts)
* - Uses SELECT FOR UPDATE SKIP LOCKED (multi-replica safe)
* - Creates tasks in worker_tasks table (processed by task-worker.ts)
*
* This file is kept for reference only. All exports are no-ops.
* Legacy code has been removed - see git history for original implementation.
*/
// 2024-12-10: All functions are now no-ops
export async function startScheduler(): Promise<void> {
// Stop any existing jobs
stopScheduler();
const settings = await getSettings();
// Schedule regular store scrapes (every N hours)
const scrapeIntervalCron = `0 */${settings.scrapeIntervalHours} * * *`;
const storeJob = cron.schedule(scrapeIntervalCron, scrapeAllStores);
scheduledJobs.push(storeJob);
console.log(`📅 Scheduled store scraping: every ${settings.scrapeIntervalHours} hours`);
// Schedule specials scraping (daily at specified time)
const [hours, minutes] = settings.scrapeSpecialsTime.split(':');
const specialsCron = `${minutes} ${hours} * * *`;
const specialsJob = cron.schedule(specialsCron, scrapeSpecials);
scheduledJobs.push(specialsJob);
console.log(`📅 Scheduled specials scraping: daily at ${settings.scrapeSpecialsTime}`);
// Initial scrape on startup (after 10 seconds)
setTimeout(() => {
console.log('🚀 Running initial scrape...');
scrapeAllStores().catch(console.error);
}, 10000);
console.warn('[DEPRECATED] startScheduler() called - use taskScheduler from task-scheduler.ts instead');
}
export function stopScheduler(): void {
scheduledJobs.forEach(job => job.stop());
scheduledJobs = [];
console.log('🛑 Scheduler stopped');
console.warn('[DEPRECATED] stopScheduler() called - use taskScheduler from task-scheduler.ts instead');
}
export async function restartScheduler(): Promise<void> {
console.log('🔄 Restarting scheduler...');
stopScheduler();
await startScheduler();
console.warn('[DEPRECATED] restartScheduler() called - use taskScheduler from task-scheduler.ts instead');
}
// Manual trigger functions for admin
export async function triggerStoreScrape(storeId: number): Promise<void> {
console.log(`🔧 Manual scrape triggered for store ID: ${storeId}`);
await scrapeStore(storeId);
export async function triggerStoreScrape(_storeId: number): Promise<void> {
console.warn('[DEPRECATED] triggerStoreScrape() called - use taskService.createTask() instead');
}
export async function triggerAllStoresScrape(): Promise<void> {
console.log('🔧 Manual scrape triggered for all stores');
await scrapeAllStores();
console.warn('[DEPRECATED] triggerAllStoresScrape() called - use taskScheduler.triggerSchedule() instead');
}

View File

@@ -0,0 +1,375 @@
/**
* Database-Driven Task Scheduler
*
* Per TASK_WORKFLOW_2024-12-10.md:
* - Schedules stored in DB (survives restarts)
* - Uses SELECT FOR UPDATE to prevent duplicate execution across replicas
* - Polls every 60s to check if schedules are due
* - Generates tasks into worker_tasks table for task-worker.ts to process
*
* 2024-12-10: Created to replace legacy node-cron scheduler
*/
import { pool } from '../db/pool';
import { taskService, TaskRole } from '../tasks/task-service';
// Per TASK_WORKFLOW_2024-12-10.md: Poll interval for checking schedules
const POLL_INTERVAL_MS = 60_000; // 60 seconds
interface TaskSchedule {
id: number;
name: string;
role: TaskRole;
enabled: boolean;
interval_hours: number;
last_run_at: Date | null;
next_run_at: Date | null;
state_code: string | null;
priority: number;
}
class TaskScheduler {
private pollTimer: NodeJS.Timeout | null = null;
private isRunning = false;
/**
* Start the scheduler
* Per TASK_WORKFLOW_2024-12-10.md: Called on API server startup
*/
async start(): Promise<void> {
if (this.isRunning) {
console.log('[TaskScheduler] Already running');
return;
}
console.log('[TaskScheduler] Starting database-driven scheduler...');
this.isRunning = true;
// Per TASK_WORKFLOW_2024-12-10.md: On startup, recover stale tasks
try {
const recovered = await taskService.recoverStaleTasks(10);
if (recovered > 0) {
console.log(`[TaskScheduler] Recovered ${recovered} stale tasks from dead workers`);
}
} catch (err: any) {
console.error('[TaskScheduler] Failed to recover stale tasks:', err.message);
}
// Per TASK_WORKFLOW_2024-12-10.md: Ensure default schedules exist
await this.ensureDefaultSchedules();
// Per TASK_WORKFLOW_2024-12-10.md: Check immediately on startup
await this.checkAndRunDueSchedules();
// Per TASK_WORKFLOW_2024-12-10.md: Then poll every 60 seconds
this.pollTimer = setInterval(async () => {
await this.checkAndRunDueSchedules();
}, POLL_INTERVAL_MS);
console.log('[TaskScheduler] Started - polling every 60s');
}
/**
* Stop the scheduler
*/
stop(): void {
if (this.pollTimer) {
clearInterval(this.pollTimer);
this.pollTimer = null;
}
this.isRunning = false;
console.log('[TaskScheduler] Stopped');
}
/**
* Ensure default schedules exist in the database
* Per TASK_WORKFLOW_2024-12-10.md: Creates schedules if they don't exist
*/
private async ensureDefaultSchedules(): Promise<void> {
// Per TASK_WORKFLOW_2024-12-10.md: Default schedules for task generation
// NOTE: payload_fetch replaces direct product_refresh - it chains to product_refresh
const defaults = [
{
name: 'payload_fetch_all',
role: 'payload_fetch' as TaskRole,
interval_hours: 4,
priority: 0,
description: 'Fetch payloads from Dutchie API for all crawl-enabled stores every 4 hours. Chains to product_refresh.',
},
{
name: 'store_discovery_dutchie',
role: 'store_discovery' as TaskRole,
interval_hours: 24,
priority: 5,
description: 'Discover new Dutchie stores daily',
},
{
name: 'analytics_refresh',
role: 'analytics_refresh' as TaskRole,
interval_hours: 6,
priority: 0,
description: 'Refresh analytics materialized views every 6 hours',
},
];
for (const sched of defaults) {
try {
await pool.query(`
INSERT INTO task_schedules (name, role, interval_hours, priority, description, enabled, next_run_at)
VALUES ($1, $2, $3, $4, $5, true, NOW())
ON CONFLICT (name) DO NOTHING
`, [sched.name, sched.role, sched.interval_hours, sched.priority, sched.description]);
} catch (err: any) {
// Table may not exist yet - will be created by migration
if (!err.message.includes('does not exist')) {
console.error(`[TaskScheduler] Failed to create default schedule ${sched.name}:`, err.message);
}
}
}
}
/**
* Check for and run any due schedules
* Per TASK_WORKFLOW_2024-12-10.md: Uses SELECT FOR UPDATE SKIP LOCKED to prevent duplicates
*/
private async checkAndRunDueSchedules(): Promise<void> {
const client = await pool.connect();
try {
await client.query('BEGIN');
// Per TASK_WORKFLOW_2024-12-10.md: Atomic claim of due schedules
const result = await client.query<TaskSchedule>(`
SELECT *
FROM task_schedules
WHERE enabled = true
AND (next_run_at IS NULL OR next_run_at <= NOW())
FOR UPDATE SKIP LOCKED
`);
for (const schedule of result.rows) {
console.log(`[TaskScheduler] Running schedule: ${schedule.name} (${schedule.role})`);
try {
const tasksCreated = await this.executeSchedule(schedule);
console.log(`[TaskScheduler] Schedule ${schedule.name} created ${tasksCreated} tasks`);
// Per TASK_WORKFLOW_2024-12-10.md: Update last_run_at and calculate next_run_at
await client.query(`
UPDATE task_schedules
SET
last_run_at = NOW(),
next_run_at = NOW() + ($1 || ' hours')::interval,
last_task_count = $2,
updated_at = NOW()
WHERE id = $3
`, [schedule.interval_hours, tasksCreated, schedule.id]);
} catch (err: any) {
console.error(`[TaskScheduler] Schedule ${schedule.name} failed:`, err.message);
// Still update next_run_at to prevent infinite retry loop
await client.query(`
UPDATE task_schedules
SET
next_run_at = NOW() + ($1 || ' hours')::interval,
last_error = $2,
updated_at = NOW()
WHERE id = $3
`, [schedule.interval_hours, err.message, schedule.id]);
}
}
await client.query('COMMIT');
} catch (err: any) {
await client.query('ROLLBACK');
console.error('[TaskScheduler] Failed to check schedules:', err.message);
} finally {
client.release();
}
}
/**
* Execute a schedule and create tasks
* Per TASK_WORKFLOW_2024-12-10.md: Different logic per role
*/
private async executeSchedule(schedule: TaskSchedule): Promise<number> {
switch (schedule.role) {
case 'payload_fetch':
// Per TASK_WORKFLOW_2024-12-10.md: payload_fetch replaces direct product_refresh
return this.generatePayloadFetchTasks(schedule);
case 'product_refresh':
// Legacy - kept for manual triggers, but scheduled crawls use payload_fetch
return this.generatePayloadFetchTasks(schedule);
case 'store_discovery':
return this.generateStoreDiscoveryTasks(schedule);
case 'analytics_refresh':
return this.generateAnalyticsRefreshTasks(schedule);
default:
console.warn(`[TaskScheduler] Unknown role: ${schedule.role}`);
return 0;
}
}
/**
* Generate payload_fetch tasks for stores that need crawling
* Per TASK_WORKFLOW_2024-12-10.md: payload_fetch hits API, saves to disk, chains to product_refresh
*/
private async generatePayloadFetchTasks(schedule: TaskSchedule): Promise<number> {
// Per TASK_WORKFLOW_2024-12-10.md: Find stores needing refresh
const result = await pool.query(`
SELECT d.id
FROM dispensaries d
WHERE d.crawl_enabled = true
AND d.platform_dispensary_id IS NOT NULL
-- No pending/running payload_fetch or product_refresh task already
AND NOT EXISTS (
SELECT 1 FROM worker_tasks t
WHERE t.dispensary_id = d.id
AND t.role IN ('payload_fetch', 'product_refresh')
AND t.status IN ('pending', 'claimed', 'running')
)
-- Never fetched OR last fetch > interval ago
AND (
d.last_fetch_at IS NULL
OR d.last_fetch_at < NOW() - ($1 || ' hours')::interval
)
${schedule.state_code ? 'AND d.state_id = (SELECT id FROM states WHERE code = $2)' : ''}
`, schedule.state_code ? [schedule.interval_hours, schedule.state_code] : [schedule.interval_hours]);
const dispensaryIds = result.rows.map((r: { id: number }) => r.id);
if (dispensaryIds.length === 0) {
return 0;
}
// Per TASK_WORKFLOW_2024-12-10.md: Create payload_fetch tasks (they chain to product_refresh)
const tasks = dispensaryIds.map((id: number) => ({
role: 'payload_fetch' as TaskRole,
dispensary_id: id,
priority: schedule.priority,
}));
return taskService.createTasks(tasks);
}
/**
* Generate store_discovery tasks
* Per TASK_WORKFLOW_2024-12-10.md: One task per platform
*/
private async generateStoreDiscoveryTasks(schedule: TaskSchedule): Promise<number> {
// Check if discovery task already pending
const existing = await taskService.listTasks({
role: 'store_discovery',
status: ['pending', 'claimed', 'running'],
limit: 1,
});
if (existing.length > 0) {
console.log('[TaskScheduler] Store discovery task already pending, skipping');
return 0;
}
await taskService.createTask({
role: 'store_discovery',
platform: 'dutchie',
priority: schedule.priority,
});
return 1;
}
/**
* Generate analytics_refresh tasks
* Per TASK_WORKFLOW_2024-12-10.md: Single task to refresh all MVs
*/
private async generateAnalyticsRefreshTasks(schedule: TaskSchedule): Promise<number> {
// Check if analytics task already pending
const existing = await taskService.listTasks({
role: 'analytics_refresh',
status: ['pending', 'claimed', 'running'],
limit: 1,
});
if (existing.length > 0) {
console.log('[TaskScheduler] Analytics refresh task already pending, skipping');
return 0;
}
await taskService.createTask({
role: 'analytics_refresh',
priority: schedule.priority,
});
return 1;
}
/**
* Get all schedules for dashboard display
*/
async getSchedules(): Promise<TaskSchedule[]> {
try {
const result = await pool.query(`
SELECT * FROM task_schedules ORDER BY name
`);
return result.rows as TaskSchedule[];
} catch {
return [];
}
}
/**
* Update a schedule
*/
async updateSchedule(id: number, updates: Partial<TaskSchedule>): Promise<void> {
const setClauses: string[] = [];
const values: any[] = [];
let paramIndex = 1;
if (updates.enabled !== undefined) {
setClauses.push(`enabled = $${paramIndex++}`);
values.push(updates.enabled);
}
if (updates.interval_hours !== undefined) {
setClauses.push(`interval_hours = $${paramIndex++}`);
values.push(updates.interval_hours);
}
if (updates.priority !== undefined) {
setClauses.push(`priority = $${paramIndex++}`);
values.push(updates.priority);
}
if (setClauses.length === 0) return;
setClauses.push('updated_at = NOW()');
values.push(id);
await pool.query(`
UPDATE task_schedules
SET ${setClauses.join(', ')}
WHERE id = $${paramIndex}
`, values);
}
/**
* Trigger a schedule to run immediately
*/
async triggerSchedule(id: number): Promise<number> {
const result = await pool.query(`
SELECT * FROM task_schedules WHERE id = $1
`, [id]);
if (result.rows.length === 0) {
throw new Error(`Schedule ${id} not found`);
}
return this.executeSchedule(result.rows[0] as TaskSchedule);
}
}
// Per TASK_WORKFLOW_2024-12-10.md: Singleton instance
export const taskScheduler = new TaskScheduler();

View File

@@ -94,7 +94,8 @@ export async function handleEntryPointDiscovery(ctx: TaskContext): Promise<TaskR
// ============================================================
// STEP 3: Start stealth session
// ============================================================
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
// Per workflow-12102025.md: session identity comes from proxy location, not task params
const session = startSession();
console.log(`[EntryPointDiscovery] Session started: ${session.sessionId}`);
try {

View File

@@ -0,0 +1,221 @@
/**
* Payload Fetch Handler
*
* Per TASK_WORKFLOW_2024-12-10.md: Separates API fetch from data processing.
*
* This handler ONLY:
* 1. Hits Dutchie GraphQL API
* 2. Saves raw payload to filesystem (gzipped)
* 3. Records metadata in raw_crawl_payloads table
* 4. Queues a product_refresh task to process the payload
*
* Benefits of separation:
* - Retry-friendly: If normalize fails, re-run refresh without re-crawling
* - Faster refreshes: Local file read vs network call
* - Replay-able: Run refresh against any historical payload
* - Less API pressure: Only this role hits Dutchie
*/
import { TaskContext, TaskResult } from '../task-worker';
import {
executeGraphQL,
startSession,
endSession,
GRAPHQL_HASHES,
DUTCHIE_CONFIG,
} from '../../platforms/dutchie';
import { saveRawPayload } from '../../utils/payload-storage';
import { taskService } from '../task-service';
export async function handlePayloadFetch(ctx: TaskContext): Promise<TaskResult> {
const { pool, task } = ctx;
const dispensaryId = task.dispensary_id;
if (!dispensaryId) {
return { success: false, error: 'No dispensary_id specified for payload_fetch task' };
}
try {
// ============================================================
// STEP 1: Load dispensary info
// ============================================================
const dispResult = await pool.query(`
SELECT
id, name, platform_dispensary_id, menu_url, menu_type, city, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
`, [dispensaryId]);
if (dispResult.rows.length === 0) {
return { success: false, error: `Dispensary ${dispensaryId} not found or not crawl_enabled` };
}
const dispensary = dispResult.rows[0];
const platformId = dispensary.platform_dispensary_id;
if (!platformId) {
return { success: false, error: `Dispensary ${dispensaryId} has no platform_dispensary_id` };
}
// Extract cName from menu_url
const cNameMatch = dispensary.menu_url?.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/);
const cName = cNameMatch ? cNameMatch[1] : 'dispensary';
console.log(`[PayloadFetch] Starting fetch for ${dispensary.name} (ID: ${dispensaryId})`);
console.log(`[PayloadFetch] Platform ID: ${platformId}, cName: ${cName}`);
// ============================================================
// STEP 2: Start stealth session
// ============================================================
const session = startSession();
console.log(`[PayloadFetch] Session started: ${session.sessionId}`);
await ctx.heartbeat();
// ============================================================
// STEP 3: Fetch products via GraphQL (Status: 'All')
// ============================================================
const allProducts: any[] = [];
let page = 0;
let totalCount = 0;
const perPage = DUTCHIE_CONFIG.perPage;
const maxPages = DUTCHIE_CONFIG.maxPages;
try {
while (page < maxPages) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'All',
types: [],
useCache: false,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page,
perPage,
};
console.log(`[PayloadFetch] Fetching page ${page + 1}...`);
const result = await executeGraphQL(
'FilteredProducts',
variables,
GRAPHQL_HASHES.FilteredProducts,
{ cName, maxRetries: 3 }
);
const data = result?.data?.filteredProducts;
if (!data || !data.products) {
if (page === 0) {
throw new Error('No product data returned from GraphQL');
}
break;
}
const products = data.products;
allProducts.push(...products);
if (page === 0) {
totalCount = data.queryInfo?.totalCount || products.length;
console.log(`[PayloadFetch] Total products reported: ${totalCount}`);
}
if (allProducts.length >= totalCount || products.length < perPage) {
break;
}
page++;
if (page < maxPages) {
await new Promise(r => setTimeout(r, DUTCHIE_CONFIG.pageDelayMs));
}
if (page % 5 === 0) {
await ctx.heartbeat();
}
}
console.log(`[PayloadFetch] Fetched ${allProducts.length} products in ${page + 1} pages`);
} finally {
endSession();
}
if (allProducts.length === 0) {
return {
success: false,
error: 'No products returned from GraphQL',
productsProcessed: 0,
};
}
await ctx.heartbeat();
// ============================================================
// STEP 4: Save raw payload to filesystem
// Per TASK_WORKFLOW_2024-12-10.md: Metadata/Payload separation
// ============================================================
const rawPayload = {
dispensaryId,
platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: allProducts.length,
products: allProducts,
};
const payloadResult = await saveRawPayload(
pool,
dispensaryId,
rawPayload,
null, // crawl_run_id - not using crawl_runs in new system
allProducts.length
);
console.log(`[PayloadFetch] Saved payload #${payloadResult.id} (${(payloadResult.sizeBytes / 1024).toFixed(1)}KB)`);
// ============================================================
// STEP 5: Update dispensary last_fetch_at
// ============================================================
await pool.query(`
UPDATE dispensaries
SET last_fetch_at = NOW()
WHERE id = $1
`, [dispensaryId]);
// ============================================================
// STEP 6: Queue product_refresh task to process the payload
// Per TASK_WORKFLOW_2024-12-10.md: Task chaining
// ============================================================
await taskService.createTask({
role: 'product_refresh',
dispensary_id: dispensaryId,
priority: task.priority || 0,
payload: { payload_id: payloadResult.id },
});
console.log(`[PayloadFetch] Queued product_refresh task for payload #${payloadResult.id}`);
return {
success: true,
payloadId: payloadResult.id,
productCount: allProducts.length,
sizeBytes: payloadResult.sizeBytes,
};
} catch (error: unknown) {
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
console.error(`[PayloadFetch] Error for dispensary ${dispensaryId}:`, errorMessage);
return {
success: false,
error: errorMessage,
};
}
}

View File

@@ -1,16 +1,31 @@
/**
* Product Discovery Handler
*
* Initial product fetch for stores that have 0 products.
* Same logic as product_resync, but for initial discovery.
* Per TASK_WORKFLOW_2024-12-10.md: Initial product fetch for newly discovered stores.
*
* Flow:
* 1. Triggered after store_discovery promotes a new dispensary
* 2. Chains to payload_fetch to get initial product data
* 3. payload_fetch chains to product_refresh for DB upsert
*
* Chaining:
* store_discovery → (newStoreIds) → product_discovery → payload_fetch → product_refresh
*/
import { TaskContext, TaskResult } from '../task-worker';
import { handleProductRefresh } from './product-refresh';
import { handlePayloadFetch } from './payload-fetch';
export async function handleProductDiscovery(ctx: TaskContext): Promise<TaskResult> {
// Product discovery is essentially the same as refresh for the first time
// The main difference is in when this task is triggered (new store vs scheduled)
console.log(`[ProductDiscovery] Starting initial product fetch for dispensary ${ctx.task.dispensary_id}`);
return handleProductRefresh(ctx);
const { task } = ctx;
const dispensaryId = task.dispensary_id;
if (!dispensaryId) {
return { success: false, error: 'No dispensary_id provided' };
}
console.log(`[ProductDiscovery] Starting initial product discovery for dispensary ${dispensaryId}`);
// Per TASK_WORKFLOW_2024-12-10.md: Chain to payload_fetch for API → disk
// payload_fetch will then chain to product_refresh for disk → DB
return handlePayloadFetch(ctx);
}

View File

@@ -1,33 +1,32 @@
/**
* Product Refresh Handler
*
* Re-crawls a store to capture price/stock changes using the GraphQL pipeline.
* Per TASK_WORKFLOW_2024-12-10.md: Processes a locally-stored payload.
*
* This handler reads from the filesystem (NOT the Dutchie API).
* The payload_fetch handler is responsible for API calls.
*
* Flow:
* 1. Load dispensary info from database
* 2. Start stealth session (fingerprint + optional proxy)
* 3. Fetch products via GraphQL (Status: 'All')
* 4. Normalize data via DutchieNormalizer
* 5. Upsert to store_products and store_product_snapshots
* 6. Track missing products (increment consecutive_misses, mark OOS at 3)
* 7. Download new product images
* 8. End session
* 1. Load payload from filesystem (by payload_id or latest for dispensary)
* 2. Normalize data via DutchieNormalizer
* 3. Upsert to store_products and store_product_snapshots
* 4. Track missing products (increment consecutive_misses, mark OOS at 3)
* 5. Download new product images
*
* Benefits of separation:
* - Retry-friendly: If this fails, re-run without re-crawling
* - Replay-able: Run against any historical payload
* - Faster: Local file read vs network call
*/
import { TaskContext, TaskResult } from '../task-worker';
import {
executeGraphQL,
startSession,
endSession,
GRAPHQL_HASHES,
DUTCHIE_CONFIG,
} from '../../platforms/dutchie';
import { DutchieNormalizer } from '../../hydration/normalizers/dutchie';
import {
upsertStoreProducts,
createStoreProductSnapshots,
downloadProductImages,
} from '../../hydration/canonical-upsert';
import { loadRawPayloadById, getLatestPayload } from '../../utils/payload-storage';
const normalizer = new DutchieNormalizer();
@@ -47,129 +46,76 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
SELECT
id, name, platform_dispensary_id, menu_url, menu_type, city, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
WHERE id = $1
`, [dispensaryId]);
if (dispResult.rows.length === 0) {
return { success: false, error: `Dispensary ${dispensaryId} not found or not crawl_enabled` };
return { success: false, error: `Dispensary ${dispensaryId} not found` };
}
const dispensary = dispResult.rows[0];
const platformId = dispensary.platform_dispensary_id;
if (!platformId) {
return { success: false, error: `Dispensary ${dispensaryId} has no platform_dispensary_id` };
}
// Extract cName from menu_url
// Extract cName from menu_url for image storage context
const cNameMatch = dispensary.menu_url?.match(/\/(?:embedded-menu|dispensary)\/([^/?]+)/);
const cName = cNameMatch ? cNameMatch[1] : 'dispensary';
console.log(`[ProductResync] Starting crawl for ${dispensary.name} (ID: ${dispensaryId})`);
console.log(`[ProductResync] Platform ID: ${platformId}, cName: ${cName}`);
// ============================================================
// STEP 2: Start stealth session
// ============================================================
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
console.log(`[ProductResync] Session started: ${session.sessionId}`);
console.log(`[ProductRefresh] Starting refresh for ${dispensary.name} (ID: ${dispensaryId})`);
await ctx.heartbeat();
// ============================================================
// STEP 3: Fetch products via GraphQL (Status: 'All')
// STEP 2: Load payload from filesystem
// Per TASK_WORKFLOW_2024-12-10.md: Read local payload, not API
// ============================================================
const allProducts: any[] = [];
let page = 0;
let totalCount = 0;
const perPage = DUTCHIE_CONFIG.perPage;
const maxPages = DUTCHIE_CONFIG.maxPages;
let payloadData: any;
let payloadId: number;
try {
while (page < maxPages) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'All',
types: [],
useCache: false,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page,
perPage,
};
// Check if specific payload_id was provided (from task chaining)
const taskPayload = task.payload as { payload_id?: number } | null;
console.log(`[ProductResync] Fetching page ${page + 1}...`);
const result = await executeGraphQL(
'FilteredProducts',
variables,
GRAPHQL_HASHES.FilteredProducts,
{ cName, maxRetries: 3 }
);
const data = result?.data?.filteredProducts;
if (!data || !data.products) {
if (page === 0) {
throw new Error('No product data returned from GraphQL');
if (taskPayload?.payload_id) {
// Load specific payload (from payload_fetch chaining)
const result = await loadRawPayloadById(pool, taskPayload.payload_id);
if (!result) {
return { success: false, error: `Payload ${taskPayload.payload_id} not found` };
}
break;
payloadData = result.payload;
payloadId = result.metadata.id;
console.log(`[ProductRefresh] Loaded specific payload #${payloadId}`);
} else {
// Load latest payload for this dispensary
const result = await getLatestPayload(pool, dispensaryId);
if (!result) {
return { success: false, error: `No payload found for dispensary ${dispensaryId}` };
}
payloadData = result.payload;
payloadId = result.metadata.id;
console.log(`[ProductRefresh] Loaded latest payload #${payloadId} (${result.metadata.fetchedAt})`);
}
const products = data.products;
allProducts.push(...products);
if (page === 0) {
totalCount = data.queryInfo?.totalCount || products.length;
console.log(`[ProductResync] Total products reported: ${totalCount}`);
}
if (allProducts.length >= totalCount || products.length < perPage) {
break;
}
page++;
if (page < maxPages) {
await new Promise(r => setTimeout(r, DUTCHIE_CONFIG.pageDelayMs));
}
if (page % 5 === 0) {
await ctx.heartbeat();
}
}
console.log(`[ProductResync] Fetched ${allProducts.length} products in ${page + 1} pages`);
} finally {
endSession();
}
const allProducts = payloadData.products || [];
if (allProducts.length === 0) {
return {
success: false,
error: 'No products returned from GraphQL',
error: 'Payload contains no products',
payloadId,
productsProcessed: 0,
};
}
console.log(`[ProductRefresh] Processing ${allProducts.length} products from payload #${payloadId}`);
await ctx.heartbeat();
// ============================================================
// STEP 4: Normalize data
// STEP 3: Normalize data
// ============================================================
console.log(`[ProductResync] Normalizing ${allProducts.length} products...`);
console.log(`[ProductRefresh] Normalizing ${allProducts.length} products...`);
// Build RawPayload for the normalizer
const rawPayload = {
id: `resync-${dispensaryId}-${Date.now()}`,
id: `refresh-${dispensaryId}-${Date.now()}`,
dispensary_id: dispensaryId,
crawl_run_id: null,
platform: 'dutchie',
@@ -189,25 +135,26 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
const normalizationResult = normalizer.normalize(rawPayload);
if (normalizationResult.errors.length > 0) {
console.warn(`[ProductResync] Normalization warnings: ${normalizationResult.errors.map(e => e.message).join(', ')}`);
console.warn(`[ProductRefresh] Normalization warnings: ${normalizationResult.errors.map(e => e.message).join(', ')}`);
}
if (normalizationResult.products.length === 0) {
return {
success: false,
error: 'Normalization produced no products',
payloadId,
productsProcessed: 0,
};
}
console.log(`[ProductResync] Normalized ${normalizationResult.products.length} products`);
console.log(`[ProductRefresh] Normalized ${normalizationResult.products.length} products`);
await ctx.heartbeat();
// ============================================================
// STEP 5: Upsert to canonical tables
// STEP 4: Upsert to canonical tables
// ============================================================
console.log(`[ProductResync] Upserting to store_products...`);
console.log(`[ProductRefresh] Upserting to store_products...`);
const upsertResult = await upsertStoreProducts(
pool,
@@ -216,12 +163,12 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
normalizationResult.availability
);
console.log(`[ProductResync] Upserted: ${upsertResult.upserted} (${upsertResult.new} new, ${upsertResult.updated} updated)`);
console.log(`[ProductRefresh] Upserted: ${upsertResult.upserted} (${upsertResult.new} new, ${upsertResult.updated} updated)`);
await ctx.heartbeat();
// Create snapshots
console.log(`[ProductResync] Creating snapshots...`);
console.log(`[ProductRefresh] Creating snapshots...`);
const snapshotsResult = await createStoreProductSnapshots(
pool,
@@ -232,12 +179,12 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
null // No crawl_run_id in new system
);
console.log(`[ProductResync] Created ${snapshotsResult.created} snapshots`);
console.log(`[ProductRefresh] Created ${snapshotsResult.created} snapshots`);
await ctx.heartbeat();
// ============================================================
// STEP 6: Track missing products (consecutive_misses logic)
// STEP 5: Track missing products (consecutive_misses logic)
// - Products in feed: reset consecutive_misses to 0
// - Products not in feed: increment consecutive_misses
// - At 3 consecutive misses: mark as OOS
@@ -270,7 +217,7 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
const incrementedCount = incrementResult.rowCount || 0;
if (incrementedCount > 0) {
console.log(`[ProductResync] Incremented consecutive_misses for ${incrementedCount} products`);
console.log(`[ProductRefresh] Incremented consecutive_misses for ${incrementedCount} products`);
}
// Mark as OOS any products that hit 3 consecutive misses
@@ -286,16 +233,16 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
const markedOosCount = oosResult.rowCount || 0;
if (markedOosCount > 0) {
console.log(`[ProductResync] Marked ${markedOosCount} products as OOS (3+ consecutive misses)`);
console.log(`[ProductRefresh] Marked ${markedOosCount} products as OOS (3+ consecutive misses)`);
}
await ctx.heartbeat();
// ============================================================
// STEP 7: Download images for new products
// STEP 6: Download images for new products
// ============================================================
if (upsertResult.productsNeedingImages.length > 0) {
console.log(`[ProductResync] Downloading images for ${upsertResult.productsNeedingImages.length} products...`);
console.log(`[ProductRefresh] Downloading images for ${upsertResult.productsNeedingImages.length} products...`);
try {
const dispensaryContext = {
@@ -309,12 +256,12 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
);
} catch (imgError: any) {
// Image download errors shouldn't fail the whole task
console.warn(`[ProductResync] Image download error (non-fatal): ${imgError.message}`);
console.warn(`[ProductRefresh] Image download error (non-fatal): ${imgError.message}`);
}
}
// ============================================================
// STEP 8: Update dispensary last_crawl_at
// STEP 7: Update dispensary last_crawl_at
// ============================================================
await pool.query(`
UPDATE dispensaries
@@ -322,10 +269,20 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
WHERE id = $1
`, [dispensaryId]);
console.log(`[ProductResync] Completed ${dispensary.name}`);
// ============================================================
// STEP 8: Mark payload as processed
// ============================================================
await pool.query(`
UPDATE raw_crawl_payloads
SET processed_at = NOW()
WHERE id = $1
`, [payloadId]);
console.log(`[ProductRefresh] Completed ${dispensary.name}`);
return {
success: true,
payloadId,
productsProcessed: normalizationResult.products.length,
snapshotsCreated: snapshotsResult.created,
newProducts: upsertResult.new,
@@ -335,7 +292,7 @@ export async function handleProductRefresh(ctx: TaskContext): Promise<TaskResult
} catch (error: unknown) {
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
console.error(`[ProductResync] Error for dispensary ${dispensaryId}:`, errorMessage);
console.error(`[ProductRefresh] Error for dispensary ${dispensaryId}:`, errorMessage);
return {
success: false,
error: errorMessage,

View File

@@ -1,8 +1,16 @@
/**
* Store Discovery Handler
*
* Discovers new stores by crawling location APIs and adding them
* to discovery_locations table.
* Per TASK_WORKFLOW_2024-12-10.md: Discovers new stores and returns their IDs for task chaining.
*
* Flow:
* 1. For each active state, run Dutchie discovery
* 2. Discover locations via GraphQL
* 3. Auto-promote valid locations to dispensaries table
* 4. Return newStoreIds[] for chaining to payload_fetch
*
* Chaining:
* store_discovery → (returns newStoreIds) → payload_fetch → product_refresh
*/
import { TaskContext, TaskResult } from '../task-worker';
@@ -10,23 +18,25 @@ import { discoverState } from '../../discovery';
export async function handleStoreDiscovery(ctx: TaskContext): Promise<TaskResult> {
const { pool, task } = ctx;
const platform = task.platform || 'default';
const platform = task.platform || 'dutchie';
console.log(`[StoreDiscovery] Starting discovery for platform: ${platform}`);
try {
// Get states to discover
const statesResult = await pool.query(`
SELECT code FROM states WHERE active = true ORDER BY code
SELECT code FROM states WHERE is_active = true ORDER BY code
`);
const stateCodes = statesResult.rows.map(r => r.code);
if (stateCodes.length === 0) {
return { success: true, storesDiscovered: 0, message: 'No active states to discover' };
return { success: true, storesDiscovered: 0, newStoreIds: [], message: 'No active states to discover' };
}
let totalDiscovered = 0;
let totalPromoted = 0;
// Per TASK_WORKFLOW_2024-12-10.md: Collect all new store IDs for task chaining
const allNewStoreIds: number[] = [];
// Run discovery for each state
for (const stateCode of stateCodes) {
@@ -39,6 +49,13 @@ export async function handleStoreDiscovery(ctx: TaskContext): Promise<TaskResult
const result = await discoverState(pool, stateCode);
totalDiscovered += result.totalLocationsFound || 0;
totalPromoted += result.totalLocationsUpserted || 0;
// Per TASK_WORKFLOW_2024-12-10.md: Collect new IDs for chaining
if (result.newDispensaryIds && result.newDispensaryIds.length > 0) {
allNewStoreIds.push(...result.newDispensaryIds);
console.log(`[StoreDiscovery] ${stateCode}: ${result.newDispensaryIds.length} new stores`);
}
console.log(`[StoreDiscovery] ${stateCode}: found ${result.totalLocationsFound}, upserted ${result.totalLocationsUpserted}`);
} catch (error: unknown) {
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
@@ -47,13 +64,15 @@ export async function handleStoreDiscovery(ctx: TaskContext): Promise<TaskResult
}
}
console.log(`[StoreDiscovery] Complete: ${totalDiscovered} discovered, ${totalPromoted} promoted`);
console.log(`[StoreDiscovery] Complete: ${totalDiscovered} discovered, ${totalPromoted} promoted, ${allNewStoreIds.length} new stores`);
return {
success: true,
storesDiscovered: totalDiscovered,
storesPromoted: totalPromoted,
statesProcessed: stateCodes.length,
// Per TASK_WORKFLOW_2024-12-10.md: Return new IDs for task chaining
newStoreIds: allNewStoreIds,
};
} catch (error: unknown) {
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
@@ -61,6 +80,7 @@ export async function handleStoreDiscovery(ctx: TaskContext): Promise<TaskResult
return {
success: false,
error: errorMessage,
newStoreIds: [],
};
}
}

View File

@@ -0,0 +1,37 @@
/**
* Task Pool State
*
* Shared state for task pool pause/resume functionality.
* This is kept separate to avoid circular dependencies between
* task-service.ts and routes/tasks.ts.
*
* State is in-memory and resets on server restart.
* By default, the pool is PAUSED (closed) - admin must explicitly start it.
* This prevents workers from immediately grabbing tasks on deploy before
* the system is ready.
*/
let taskPoolPaused = true;
export function isTaskPoolPaused(): boolean {
return taskPoolPaused;
}
export function pauseTaskPool(): void {
taskPoolPaused = true;
console.log('[TaskPool] Task pool PAUSED - workers will not pick up new tasks');
}
export function resumeTaskPool(): void {
taskPoolPaused = false;
console.log('[TaskPool] Task pool RESUMED - workers can pick up tasks');
}
export function getTaskPoolStatus(): { paused: boolean; message: string } {
return {
paused: taskPoolPaused,
message: taskPoolPaused
? 'Task pool is paused - workers will not pick up new tasks'
: 'Task pool is open - workers are picking up tasks',
};
}

View File

@@ -9,6 +9,7 @@
*/
import { pool } from '../db/pool';
import { isTaskPoolPaused } from './task-pool-state';
// Helper to check if a table exists
async function tableExists(tableName: string): Promise<boolean> {
@@ -21,11 +22,15 @@ async function tableExists(tableName: string): Promise<boolean> {
return result.rows[0].exists;
}
// Per TASK_WORKFLOW_2024-12-10.md: Task roles
// payload_fetch: Hits Dutchie API, saves raw payload to filesystem
// product_refresh: Reads local payload, normalizes, upserts to DB
export type TaskRole =
| 'store_discovery'
| 'entry_point_discovery'
| 'product_discovery'
| 'product_refresh'
| 'payload_fetch' // NEW: Fetches from API, saves to disk
| 'product_refresh' // CHANGED: Now reads from local payload
| 'analytics_refresh';
export type TaskStatus =
@@ -55,6 +60,7 @@ export interface WorkerTask {
error_message: string | null;
retry_count: number;
max_retries: number;
payload: Record<string, unknown> | null; // Per TASK_WORKFLOW_2024-12-10.md: Task chaining data
created_at: Date;
updated_at: Date;
}
@@ -65,6 +71,7 @@ export interface CreateTaskParams {
platform?: string;
priority?: number;
scheduled_for?: Date;
payload?: Record<string, unknown>; // Per TASK_WORKFLOW_2024-12-10.md: For task chaining data
}
export interface CapacityMetrics {
@@ -96,8 +103,8 @@ class TaskService {
*/
async createTask(params: CreateTaskParams): Promise<WorkerTask> {
const result = await pool.query(
`INSERT INTO worker_tasks (role, dispensary_id, platform, priority, scheduled_for)
VALUES ($1, $2, $3, $4, $5)
`INSERT INTO worker_tasks (role, dispensary_id, platform, priority, scheduled_for, payload)
VALUES ($1, $2, $3, $4, $5, $6)
RETURNING *`,
[
params.role,
@@ -105,6 +112,7 @@ class TaskService {
params.platform ?? null,
params.priority ?? 0,
params.scheduled_for ?? null,
params.payload ? JSON.stringify(params.payload) : null,
]
);
return result.rows[0] as WorkerTask;
@@ -142,8 +150,14 @@ class TaskService {
/**
* Claim a task atomically for a worker
* If role is null, claims ANY available task (role-agnostic worker)
* Returns null if task pool is paused.
*/
async claimTask(role: TaskRole | null, workerId: string): Promise<WorkerTask | null> {
// Check if task pool is paused - don't claim any tasks
if (isTaskPoolPaused()) {
return null;
}
if (role) {
// Role-specific claiming - use the SQL function
const result = await pool.query(
@@ -401,6 +415,17 @@ class TaskService {
/**
* Chain next task after completion
* Called automatically when a task completes successfully
*
* Per TASK_WORKFLOW_2024-12-10.md: Task chaining flow:
*
* Discovery flow (new stores):
* store_discovery → product_discovery → payload_fetch → product_refresh
*
* Scheduled flow (existing stores):
* payload_fetch → product_refresh
*
* Note: entry_point_discovery is deprecated since platform_dispensary_id
* is now resolved during store promotion.
*/
async chainNextTask(completedTask: WorkerTask): Promise<WorkerTask | null> {
if (completedTask.status !== 'completed') {
@@ -409,12 +434,14 @@ class TaskService {
switch (completedTask.role) {
case 'store_discovery': {
// New stores discovered -> create entry_point_discovery tasks
// Per TASK_WORKFLOW_2024-12-10.md: New stores discovered -> create product_discovery tasks
// Skip entry_point_discovery since platform_dispensary_id is set during promotion
const newStoreIds = (completedTask.result as { newStoreIds?: number[] })?.newStoreIds;
if (newStoreIds && newStoreIds.length > 0) {
console.log(`[TaskService] Chaining ${newStoreIds.length} product_discovery tasks for new stores`);
for (const storeId of newStoreIds) {
await this.createTask({
role: 'entry_point_discovery',
role: 'product_discovery',
dispensary_id: storeId,
platform: completedTask.platform ?? undefined,
priority: 10, // High priority for new stores
@@ -425,7 +452,8 @@ class TaskService {
}
case 'entry_point_discovery': {
// Entry point resolved -> create product_discovery task
// DEPRECATED: Entry point resolution now happens during store promotion
// Kept for backward compatibility with any in-flight tasks
const success = (completedTask.result as { success?: boolean })?.success;
if (success && completedTask.dispensary_id) {
return this.createTask({
@@ -439,8 +467,15 @@ class TaskService {
}
case 'product_discovery': {
// Product discovery done -> store is now ready for regular resync
// No immediate chaining needed; will be picked up by daily batch generation
// Per TASK_WORKFLOW_2024-12-10.md: Product discovery chains internally to payload_fetch
// No external chaining needed - handleProductDiscovery calls handlePayloadFetch directly
break;
}
case 'payload_fetch': {
// Per TASK_WORKFLOW_2024-12-10.md: payload_fetch chains to product_refresh
// This is handled internally by the payload_fetch handler via taskService.createTask
// No external chaining needed here
break;
}
}

View File

@@ -52,6 +52,8 @@ import { CrawlRotator } from '../services/crawl-rotator';
import { setCrawlRotator } from '../platforms/dutchie';
// Task handlers by role
// Per TASK_WORKFLOW_2024-12-10.md: payload_fetch and product_refresh are now separate
import { handlePayloadFetch } from './handlers/payload-fetch';
import { handleProductRefresh } from './handlers/product-refresh';
import { handleProductDiscovery } from './handlers/product-discovery';
import { handleStoreDiscovery } from './handlers/store-discovery';
@@ -80,8 +82,12 @@ export interface TaskResult {
type TaskHandler = (ctx: TaskContext) => Promise<TaskResult>;
// Per TASK_WORKFLOW_2024-12-10.md: Handler registry
// payload_fetch: Fetches from Dutchie API, saves to disk, chains to product_refresh
// product_refresh: Reads local payload, normalizes, upserts to DB
const TASK_HANDLERS: Record<TaskRole, TaskHandler> = {
product_refresh: handleProductRefresh,
payload_fetch: handlePayloadFetch, // NEW: API fetch -> disk
product_refresh: handleProductRefresh, // CHANGED: disk -> DB
product_discovery: handleProductDiscovery,
store_discovery: handleStoreDiscovery,
entry_point_discovery: handleEntryPointDiscovery,
@@ -110,23 +116,80 @@ export class TaskWorker {
* Initialize stealth systems (proxy rotation, fingerprints)
* Called once on worker startup before processing any tasks.
*
* IMPORTANT: Proxies are REQUIRED. Workers will fail to start if no proxies available.
* IMPORTANT: Proxies are REQUIRED. Workers will wait until proxies are available.
* Workers listen for PostgreSQL NOTIFY 'proxy_added' to wake up immediately when proxies are added.
*/
private async initializeStealth(): Promise<void> {
const MAX_WAIT_MINUTES = 60;
const POLL_INTERVAL_MS = 30000; // 30 seconds fallback polling
const maxAttempts = (MAX_WAIT_MINUTES * 60 * 1000) / POLL_INTERVAL_MS;
let attempts = 0;
let notifyClient: any = null;
// Set up PostgreSQL LISTEN for proxy notifications
try {
notifyClient = await this.pool.connect();
await notifyClient.query('LISTEN proxy_added');
console.log(`[TaskWorker] Listening for proxy_added notifications...`);
} catch (err: any) {
console.log(`[TaskWorker] Could not set up LISTEN (will poll): ${err.message}`);
}
// Create a promise that resolves when notified
let notifyResolve: (() => void) | null = null;
if (notifyClient) {
notifyClient.on('notification', (msg: any) => {
if (msg.channel === 'proxy_added') {
console.log(`[TaskWorker] Received proxy_added notification!`);
if (notifyResolve) notifyResolve();
}
});
}
try {
while (attempts < maxAttempts) {
try {
// Load proxies from database
await this.crawlRotator.initialize();
const stats = this.crawlRotator.proxy.getStats();
if (stats.activeProxies === 0) {
throw new Error('No active proxies available. Workers MUST use proxies for all requests. Add proxies to the database before starting workers.');
}
if (stats.activeProxies > 0) {
console.log(`[TaskWorker] Loaded ${stats.activeProxies} proxies (${stats.avgSuccessRate.toFixed(1)}% avg success rate)`);
// Wire rotator to Dutchie client - proxies will be used for ALL requests
setCrawlRotator(this.crawlRotator);
console.log(`[TaskWorker] Stealth initialized: ${this.crawlRotator.userAgent.getCount()} fingerprints, proxy REQUIRED for all requests`);
return;
}
attempts++;
console.log(`[TaskWorker] No active proxies available (attempt ${attempts}). Waiting for proxies...`);
// Wait for either notification or timeout
await new Promise<void>((resolve) => {
notifyResolve = resolve;
setTimeout(resolve, POLL_INTERVAL_MS);
});
} catch (error: any) {
attempts++;
console.log(`[TaskWorker] Error loading proxies (attempt ${attempts}): ${error.message}. Retrying...`);
await this.sleep(POLL_INTERVAL_MS);
}
}
throw new Error(`No active proxies available after waiting ${MAX_WAIT_MINUTES} minutes. Add proxies to the database.`);
} finally {
// Clean up LISTEN connection
if (notifyClient) {
try {
await notifyClient.query('UNLISTEN proxy_added');
notifyClient.release();
} catch {
// Ignore cleanup errors
}
}
}
}
/**
@@ -414,11 +477,13 @@ export class TaskWorker {
async function main(): Promise<void> {
const role = process.env.WORKER_ROLE as TaskRole | undefined;
// Per TASK_WORKFLOW_2024-12-10.md: Valid task roles
const validRoles: TaskRole[] = [
'store_discovery',
'entry_point_discovery',
'product_discovery',
'product_refresh',
'payload_fetch', // NEW: Fetches from API, saves to disk
'product_refresh', // CHANGED: Reads from disk, processes to DB
'analytics_refresh',
];

49
backend/src/types/user-agents.d.ts vendored Normal file
View File

@@ -0,0 +1,49 @@
/**
* Type declarations for user-agents npm package
* Per workflow-12102025.md: Used for realistic UA generation with market-share weighting
*/
declare module 'user-agents' {
interface UserAgentData {
userAgent: string;
platform: string;
screenWidth: number;
screenHeight: number;
viewportWidth: number;
viewportHeight: number;
deviceCategory: 'desktop' | 'mobile' | 'tablet';
appName: string;
connection?: {
downlink: number;
effectiveType: string;
rtt: number;
};
}
interface UserAgentOptions {
deviceCategory?: 'desktop' | 'mobile' | 'tablet';
platform?: RegExp | string;
screenWidth?: RegExp | { min?: number; max?: number };
screenHeight?: RegExp | { min?: number; max?: number };
}
interface UserAgentInstance {
data: UserAgentData;
toString(): string;
random(): UserAgentInstance;
}
class UserAgent {
constructor(options?: UserAgentOptions | UserAgentOptions[]);
data: UserAgentData;
toString(): string;
random(): UserAgentInstance;
}
// Make it callable
interface UserAgent {
(): UserAgentInstance;
}
export default UserAgent;
}

View File

@@ -0,0 +1,406 @@
/**
* Payload Storage Utility
*
* Per TASK_WORKFLOW_2024-12-10.md: Store raw GraphQL payloads for historical analysis.
*
* Design Pattern: Metadata/Payload Separation
* - Metadata in PostgreSQL (raw_crawl_payloads table): Small, indexed, queryable
* - Payload on filesystem: Gzipped JSON at storage_path
*
* Storage structure:
* /storage/payloads/{year}/{month}/{day}/store_{dispensary_id}_{timestamp}.json.gz
*
* Benefits:
* - Compare any two crawls to see what changed
* - Replay/re-normalize historical data if logic changes
* - Debug issues by seeing exactly what the API returned
* - DB stays small, backups stay fast
* - ~90% compression (1.5MB -> 150KB per crawl)
*/
import * as fs from 'fs';
import * as path from 'path';
import * as zlib from 'zlib';
import { promisify } from 'util';
import { Pool } from 'pg';
import * as crypto from 'crypto';
const gzip = promisify(zlib.gzip);
const gunzip = promisify(zlib.gunzip);
// Base path for payload storage (matches image storage pattern)
const PAYLOAD_BASE_PATH = process.env.PAYLOAD_STORAGE_PATH || './storage/payloads';
/**
* Result from saving a payload
*/
export interface SavePayloadResult {
id: number;
storagePath: string;
sizeBytes: number;
sizeBytesRaw: number;
checksum: string;
}
/**
* Result from loading a payload
*/
export interface LoadPayloadResult {
payload: any;
metadata: {
id: number;
dispensaryId: number;
crawlRunId: number | null;
productCount: number;
fetchedAt: Date;
storagePath: string;
};
}
/**
* Generate storage path for a payload
*
* Format: /storage/payloads/{year}/{month}/{day}/store_{dispensary_id}_{timestamp}.json.gz
*/
function generateStoragePath(dispensaryId: number, timestamp: Date): string {
const year = timestamp.getFullYear();
const month = String(timestamp.getMonth() + 1).padStart(2, '0');
const day = String(timestamp.getDate()).padStart(2, '0');
const ts = timestamp.getTime();
return path.join(
PAYLOAD_BASE_PATH,
String(year),
month,
day,
`store_${dispensaryId}_${ts}.json.gz`
);
}
/**
* Ensure directory exists for a file path
*/
async function ensureDir(filePath: string): Promise<void> {
const dir = path.dirname(filePath);
await fs.promises.mkdir(dir, { recursive: true });
}
/**
* Calculate SHA256 checksum of data
*/
function calculateChecksum(data: Buffer): string {
return crypto.createHash('sha256').update(data).digest('hex');
}
/**
* Save a raw crawl payload to filesystem and record metadata in DB
*
* @param pool - Database connection pool
* @param dispensaryId - ID of the dispensary
* @param payload - Raw JSON payload from GraphQL
* @param crawlRunId - Optional crawl_run ID for linking
* @param productCount - Number of products in payload
* @returns SavePayloadResult with file info and DB record ID
*/
export async function saveRawPayload(
pool: Pool,
dispensaryId: number,
payload: any,
crawlRunId: number | null = null,
productCount: number = 0
): Promise<SavePayloadResult> {
const timestamp = new Date();
const storagePath = generateStoragePath(dispensaryId, timestamp);
// Serialize and compress
const jsonStr = JSON.stringify(payload);
const rawSize = Buffer.byteLength(jsonStr, 'utf8');
const compressed = await gzip(Buffer.from(jsonStr, 'utf8'));
const compressedSize = compressed.length;
const checksum = calculateChecksum(compressed);
// Write to filesystem
await ensureDir(storagePath);
await fs.promises.writeFile(storagePath, compressed);
// Record metadata in DB
const result = await pool.query(`
INSERT INTO raw_crawl_payloads (
crawl_run_id,
dispensary_id,
storage_path,
product_count,
size_bytes,
size_bytes_raw,
fetched_at,
checksum_sha256
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
RETURNING id
`, [
crawlRunId,
dispensaryId,
storagePath,
productCount,
compressedSize,
rawSize,
timestamp,
checksum
]);
console.log(`[PayloadStorage] Saved payload for store ${dispensaryId}: ${storagePath} (${(compressedSize / 1024).toFixed(1)}KB compressed, ${(rawSize / 1024).toFixed(1)}KB raw)`);
return {
id: result.rows[0].id,
storagePath,
sizeBytes: compressedSize,
sizeBytesRaw: rawSize,
checksum
};
}
/**
* Load a raw payload from filesystem by metadata ID
*
* @param pool - Database connection pool
* @param payloadId - ID from raw_crawl_payloads table
* @returns LoadPayloadResult with parsed payload and metadata
*/
export async function loadRawPayloadById(
pool: Pool,
payloadId: number
): Promise<LoadPayloadResult | null> {
const result = await pool.query(`
SELECT id, dispensary_id, crawl_run_id, storage_path, product_count, fetched_at
FROM raw_crawl_payloads
WHERE id = $1
`, [payloadId]);
if (result.rows.length === 0) {
return null;
}
const row = result.rows[0];
const payload = await loadPayloadFromPath(row.storage_path);
return {
payload,
metadata: {
id: row.id,
dispensaryId: row.dispensary_id,
crawlRunId: row.crawl_run_id,
productCount: row.product_count,
fetchedAt: row.fetched_at,
storagePath: row.storage_path
}
};
}
/**
* Load a raw payload directly from filesystem path
*
* @param storagePath - Path to gzipped JSON file
* @returns Parsed JSON payload
*/
export async function loadPayloadFromPath(storagePath: string): Promise<any> {
const compressed = await fs.promises.readFile(storagePath);
const decompressed = await gunzip(compressed);
return JSON.parse(decompressed.toString('utf8'));
}
/**
* Get the latest payload for a dispensary
*
* @param pool - Database connection pool
* @param dispensaryId - ID of the dispensary
* @returns LoadPayloadResult or null if none exists
*/
export async function getLatestPayload(
pool: Pool,
dispensaryId: number
): Promise<LoadPayloadResult | null> {
const result = await pool.query(`
SELECT id, dispensary_id, crawl_run_id, storage_path, product_count, fetched_at
FROM raw_crawl_payloads
WHERE dispensary_id = $1
ORDER BY fetched_at DESC
LIMIT 1
`, [dispensaryId]);
if (result.rows.length === 0) {
return null;
}
const row = result.rows[0];
const payload = await loadPayloadFromPath(row.storage_path);
return {
payload,
metadata: {
id: row.id,
dispensaryId: row.dispensary_id,
crawlRunId: row.crawl_run_id,
productCount: row.product_count,
fetchedAt: row.fetched_at,
storagePath: row.storage_path
}
};
}
/**
* Get two payloads for comparison (latest and previous, or by IDs)
*
* @param pool - Database connection pool
* @param dispensaryId - ID of the dispensary
* @param limit - Number of recent payloads to retrieve (default 2)
* @returns Array of LoadPayloadResult, most recent first
*/
export async function getRecentPayloads(
pool: Pool,
dispensaryId: number,
limit: number = 2
): Promise<LoadPayloadResult[]> {
const result = await pool.query(`
SELECT id, dispensary_id, crawl_run_id, storage_path, product_count, fetched_at
FROM raw_crawl_payloads
WHERE dispensary_id = $1
ORDER BY fetched_at DESC
LIMIT $2
`, [dispensaryId, limit]);
const payloads: LoadPayloadResult[] = [];
for (const row of result.rows) {
const payload = await loadPayloadFromPath(row.storage_path);
payloads.push({
payload,
metadata: {
id: row.id,
dispensaryId: row.dispensary_id,
crawlRunId: row.crawl_run_id,
productCount: row.product_count,
fetchedAt: row.fetched_at,
storagePath: row.storage_path
}
});
}
return payloads;
}
/**
* List payload metadata without loading files (for browsing/pagination)
*
* @param pool - Database connection pool
* @param options - Query options
* @returns Array of metadata rows
*/
export async function listPayloadMetadata(
pool: Pool,
options: {
dispensaryId?: number;
startDate?: Date;
endDate?: Date;
limit?: number;
offset?: number;
} = {}
): Promise<Array<{
id: number;
dispensaryId: number;
crawlRunId: number | null;
storagePath: string;
productCount: number;
sizeBytes: number;
sizeBytesRaw: number;
fetchedAt: Date;
}>> {
const conditions: string[] = [];
const params: any[] = [];
let paramIndex = 1;
if (options.dispensaryId) {
conditions.push(`dispensary_id = $${paramIndex++}`);
params.push(options.dispensaryId);
}
if (options.startDate) {
conditions.push(`fetched_at >= $${paramIndex++}`);
params.push(options.startDate);
}
if (options.endDate) {
conditions.push(`fetched_at <= $${paramIndex++}`);
params.push(options.endDate);
}
const whereClause = conditions.length > 0 ? `WHERE ${conditions.join(' AND ')}` : '';
const limit = options.limit || 50;
const offset = options.offset || 0;
params.push(limit, offset);
const result = await pool.query(`
SELECT
id,
dispensary_id,
crawl_run_id,
storage_path,
product_count,
size_bytes,
size_bytes_raw,
fetched_at
FROM raw_crawl_payloads
${whereClause}
ORDER BY fetched_at DESC
LIMIT $${paramIndex++} OFFSET $${paramIndex}
`, params);
return result.rows.map(row => ({
id: row.id,
dispensaryId: row.dispensary_id,
crawlRunId: row.crawl_run_id,
storagePath: row.storage_path,
productCount: row.product_count,
sizeBytes: row.size_bytes,
sizeBytesRaw: row.size_bytes_raw,
fetchedAt: row.fetched_at
}));
}
/**
* Delete old payloads (for retention policy)
*
* @param pool - Database connection pool
* @param olderThan - Delete payloads older than this date
* @returns Number of payloads deleted
*/
export async function deleteOldPayloads(
pool: Pool,
olderThan: Date
): Promise<number> {
// Get paths first
const result = await pool.query(`
SELECT id, storage_path FROM raw_crawl_payloads
WHERE fetched_at < $1
`, [olderThan]);
// Delete files
for (const row of result.rows) {
try {
await fs.promises.unlink(row.storage_path);
} catch (err: any) {
if (err.code !== 'ENOENT') {
console.warn(`[PayloadStorage] Failed to delete ${row.storage_path}: ${err.message}`);
}
}
}
// Delete DB records
await pool.query(`
DELETE FROM raw_crawl_payloads
WHERE fetched_at < $1
`, [olderThan]);
console.log(`[PayloadStorage] Deleted ${result.rows.length} payloads older than ${olderThan.toISOString()}`);
return result.rows.length;
}

View File

@@ -7,8 +7,8 @@
<title>CannaIQ - Cannabis Menu Intelligence Platform</title>
<meta name="description" content="CannaIQ provides real-time cannabis dispensary menu data, product tracking, and analytics for dispensaries across Arizona." />
<meta name="keywords" content="cannabis, dispensary, menu, products, analytics, Arizona" />
<script type="module" crossorigin src="/assets/index-BML8-px1.js"></script>
<link rel="stylesheet" crossorigin href="/assets/index-B2gR-58G.css">
<script type="module" crossorigin src="/assets/index-Dq9S0rVi.js"></script>
<link rel="stylesheet" crossorigin href="/assets/index-DhM09B-d.css">
</head>
<body>
<div id="root"></div>

View File

@@ -8,6 +8,7 @@ import { ProductDetail } from './pages/ProductDetail';
import { Stores } from './pages/Stores';
import { Dispensaries } from './pages/Dispensaries';
import { DispensaryDetail } from './pages/DispensaryDetail';
import { DispensarySchedule } from './pages/DispensarySchedule';
import { StoreDetail } from './pages/StoreDetail';
import { StoreBrands } from './pages/StoreBrands';
import { StoreSpecials } from './pages/StoreSpecials';
@@ -66,6 +67,7 @@ export default function App() {
<Route path="/stores" element={<PrivateRoute><Stores /></PrivateRoute>} />
<Route path="/dispensaries" element={<PrivateRoute><Dispensaries /></PrivateRoute>} />
<Route path="/dispensaries/:state/:city/:slug" element={<PrivateRoute><DispensaryDetail /></PrivateRoute>} />
<Route path="/dispensaries/:state/:city/:slug/schedule" element={<PrivateRoute><DispensarySchedule /></PrivateRoute>} />
<Route path="/stores/:state/:storeName/:slug/brands" element={<PrivateRoute><StoreBrands /></PrivateRoute>} />
<Route path="/stores/:state/:storeName/:slug/specials" element={<PrivateRoute><StoreSpecials /></PrivateRoute>} />
<Route path="/stores/:state/:storeName/:slug" element={<PrivateRoute><StoreDetail /></PrivateRoute>} />

View File

@@ -131,7 +131,7 @@ export function Layout({ children }: LayoutProps) {
<span className="text-lg font-bold text-gray-900">CannaIQ</span>
{versionInfo && (
<p className="text-xs text-gray-400">
v{versionInfo.version} ({versionInfo.git_sha}) {versionInfo.build_time !== 'unknown' && `- ${new Date(versionInfo.build_time).toLocaleDateString()}`}
{versionInfo.git_sha || 'dev'}
</p>
)}
</div>

View File

@@ -983,6 +983,47 @@ class ApiClient {
}>(`/api/markets/stores/${id}/categories`);
}
async getStoreCrawlHistory(id: number, limit = 50) {
return this.request<{
dispensary: {
id: number;
name: string;
dba_name: string | null;
slug: string;
state: string;
city: string;
menu_type: string | null;
platform_dispensary_id: string | null;
last_menu_scrape: string | null;
} | null;
history: Array<{
id: number;
runId: string | null;
profileKey: string | null;
crawlerModule: string | null;
stateAtStart: string | null;
stateAtEnd: string | null;
totalSteps: number;
durationMs: number | null;
success: boolean;
errorMessage: string | null;
productsFound: number | null;
startedAt: string | null;
completedAt: string | null;
}>;
nextSchedule: {
scheduleId: number;
jobName: string;
enabled: boolean;
baseIntervalMinutes: number;
jitterMinutes: number;
nextRunAt: string | null;
lastRunAt: string | null;
lastStatus: string | null;
} | null;
}>(`/api/markets/stores/${id}/crawl-history?limit=${limit}`);
}
// Global Brands/Categories (from v_brands/v_categories views)
async getMarketBrands(params?: { limit?: number; offset?: number }) {
const searchParams = new URLSearchParams();
@@ -1518,10 +1559,11 @@ class ApiClient {
}
// Intelligence API
async getIntelligenceBrands(params?: { limit?: number; offset?: number }) {
async getIntelligenceBrands(params?: { limit?: number; offset?: number; state?: string }) {
const searchParams = new URLSearchParams();
if (params?.limit) searchParams.append('limit', params.limit.toString());
if (params?.offset) searchParams.append('offset', params.offset.toString());
if (params?.state) searchParams.append('state', params.state);
const queryString = searchParams.toString() ? `?${searchParams.toString()}` : '';
return this.request<{
brands: Array<{
@@ -1536,7 +1578,10 @@ class ApiClient {
}>(`/api/admin/intelligence/brands${queryString}`);
}
async getIntelligencePricing() {
async getIntelligencePricing(params?: { state?: string }) {
const searchParams = new URLSearchParams();
if (params?.state) searchParams.append('state', params.state);
const queryString = searchParams.toString() ? `?${searchParams.toString()}` : '';
return this.request<{
byCategory: Array<{
category: string;
@@ -1552,7 +1597,7 @@ class ApiClient {
maxPrice: number;
totalProducts: number;
};
}>('/api/admin/intelligence/pricing');
}>(`/api/admin/intelligence/pricing${queryString}`);
}
async getIntelligenceStoreActivity(params?: { state?: string; chainId?: number; limit?: number }) {
@@ -2884,6 +2929,27 @@ class ApiClient {
`/api/tasks/store/${dispensaryId}/active`
);
}
// Task Pool Control
async getTaskPoolStatus() {
return this.request<{ success: boolean; paused: boolean; message: string }>(
'/api/tasks/pool/status'
);
}
async pauseTaskPool() {
return this.request<{ success: boolean; paused: boolean; message: string }>(
'/api/tasks/pool/pause',
{ method: 'POST' }
);
}
async resumeTaskPool() {
return this.request<{ success: boolean; paused: boolean; message: string }>(
'/api/tasks/pool/resume',
{ method: 'POST' }
);
}
}
export const api = new ApiClient(API_URL);

View File

@@ -1,6 +1,5 @@
import { useEffect, useState } from 'react';
import { Layout } from '../components/Layout';
import { HealthPanel } from '../components/HealthPanel';
import { api } from '../lib/api';
import { useNavigate } from 'react-router-dom';
import {
@@ -42,7 +41,6 @@ export function Dashboard() {
const [activity, setActivity] = useState<any>(null);
const [nationalStats, setNationalStats] = useState<any>(null);
const [loading, setLoading] = useState(true);
const [refreshing, setRefreshing] = useState(false);
const [pendingChangesCount, setPendingChangesCount] = useState(0);
const [showNotification, setShowNotification] = useState(false);
const [taskCounts, setTaskCounts] = useState<Record<string, number> | null>(null);
@@ -93,10 +91,7 @@ export function Dashboard() {
}
};
const loadData = async (isRefresh = false) => {
if (isRefresh) {
setRefreshing(true);
}
const loadData = async () => {
try {
// Fetch dashboard data (primary data source)
const dashboard = await api.getMarketDashboard();
@@ -158,7 +153,6 @@ export function Dashboard() {
console.error('Failed to load dashboard:', error);
} finally {
setLoading(false);
setRefreshing(false);
}
};
@@ -271,23 +265,10 @@ export function Dashboard() {
<div className="space-y-8">
{/* Header */}
<div className="flex flex-col sm:flex-row sm:justify-between sm:items-center gap-4">
<div>
<h1 className="text-xl sm:text-2xl font-semibold text-gray-900">Dashboard</h1>
<p className="text-sm text-gray-500 mt-1">Monitor your dispensary data aggregation</p>
</div>
<button
onClick={() => loadData(true)}
disabled={refreshing}
className="inline-flex items-center justify-center gap-2 px-4 py-2 bg-white border border-gray-200 rounded-lg hover:bg-gray-50 transition-colors text-sm font-medium text-gray-700 self-start sm:self-auto disabled:opacity-50 disabled:cursor-not-allowed"
>
<RefreshCw className={`w-4 h-4 ${refreshing ? 'animate-spin' : ''}`} />
{refreshing ? 'Refreshing...' : 'Refresh'}
</button>
</div>
{/* System Health */}
<HealthPanel showQueues={false} refreshInterval={60000} />
{/* Stats Grid */}
<div className="grid grid-cols-2 lg:grid-cols-3 gap-3 sm:gap-6">

View File

@@ -161,23 +161,6 @@ export function Dispensaries() {
))}
</select>
</div>
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">
Filter by Status
</label>
<select
value={filterStatus}
onChange={(e) => handleStatusFilter(e.target.value)}
className={`w-full px-3 py-2 border rounded-lg focus:ring-2 focus:ring-blue-500 focus:border-blue-500 ${
filterStatus === 'dropped' ? 'border-red-300 bg-red-50' : 'border-gray-300'
}`}
>
<option value="">All Statuses</option>
<option value="open">Open</option>
<option value="dropped">Dropped (Needs Review)</option>
<option value="closed">Closed</option>
</select>
</div>
</div>
</div>

View File

@@ -204,47 +204,6 @@ export function DispensaryDetail() {
Back to Dispensaries
</button>
{/* Update Dropdown */}
<div className="relative">
<button
onClick={() => setShowUpdateDropdown(!showUpdateDropdown)}
disabled={isUpdating}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-white bg-blue-600 hover:bg-blue-700 rounded-lg disabled:opacity-50 disabled:cursor-not-allowed"
>
<RefreshCw className={`w-4 h-4 ${isUpdating ? 'animate-spin' : ''}`} />
{isUpdating ? 'Updating...' : 'Update'}
{!isUpdating && <ChevronDown className="w-4 h-4" />}
</button>
{showUpdateDropdown && !isUpdating && (
<div className="absolute right-0 mt-2 w-48 bg-white rounded-lg shadow-lg border border-gray-200 z-10">
<button
onClick={() => handleUpdate('products')}
className="w-full text-left px-4 py-2 text-sm text-gray-700 hover:bg-gray-100 rounded-t-lg"
>
Products
</button>
<button
onClick={() => handleUpdate('brands')}
className="w-full text-left px-4 py-2 text-sm text-gray-700 hover:bg-gray-100"
>
Brands
</button>
<button
onClick={() => handleUpdate('specials')}
className="w-full text-left px-4 py-2 text-sm text-gray-700 hover:bg-gray-100"
>
Specials
</button>
<button
onClick={() => handleUpdate('all')}
className="w-full text-left px-4 py-2 text-sm text-gray-700 hover:bg-gray-100 rounded-b-lg border-t border-gray-200"
>
All
</button>
</div>
)}
</div>
</div>
{/* Dispensary Header */}
@@ -266,7 +225,7 @@ export function DispensaryDetail() {
<div className="flex items-center gap-2 text-sm text-gray-600 bg-gray-50 px-4 py-2 rounded-lg">
<Calendar className="w-4 h-4" />
<div>
<span className="font-medium">Last Crawl Date:</span>
<span className="font-medium">Last Updated:</span>
<span className="ml-2">
{dispensary.last_menu_scrape
? new Date(dispensary.last_menu_scrape).toLocaleDateString('en-US', {
@@ -331,7 +290,7 @@ export function DispensaryDetail() {
</a>
)}
<Link
to="/schedule"
to={`/dispensaries/${state}/${city}/${slug}/schedule`}
className="flex items-center gap-2 text-sm text-blue-600 hover:text-blue-800"
>
<Clock className="w-4 h-4" />
@@ -533,57 +492,31 @@ export function DispensaryDetail() {
`$${product.regular_price}`
) : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.quantity != null ? (
<span className={`badge badge-sm ${product.quantity > 0 ? 'badge-info' : 'badge-error'}`}>
{product.quantity}
</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.quantity != null ? product.quantity : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.thc_percentage ? (
<span className="badge badge-success badge-sm">{product.thc_percentage}%</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.thc_percentage ? `${product.thc_percentage}%` : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.cbd_percentage ? (
<span className="badge badge-info badge-sm">{product.cbd_percentage}%</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.cbd_percentage ? `${product.cbd_percentage}%` : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.strain_type ? (
<span className="badge badge-ghost badge-sm">{product.strain_type}</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.strain_type || '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.in_stock ? (
<span className="badge badge-success badge-sm">Yes</span>
) : product.in_stock === false ? (
<span className="badge badge-error badge-sm">No</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.in_stock ? 'Yes' : product.in_stock === false ? 'No' : '-'}
</td>
<td className="whitespace-nowrap text-xs text-gray-500">
{product.updated_at ? formatDate(product.updated_at) : '-'}
</td>
<td>
<div className="flex gap-1">
{product.dutchie_url && (
<a
href={product.dutchie_url}
target="_blank"
rel="noopener noreferrer"
className="btn btn-xs btn-outline"
>
Dutchie
</a>
)}
<button
onClick={() => navigate(`/products/${product.id}`)}
className="btn btn-xs btn-primary"
className="btn btn-xs btn-ghost text-gray-500 hover:text-gray-700"
>
Details
</button>
</div>
</td>
</tr>
))}

View File

@@ -0,0 +1,378 @@
import { useEffect, useState } from 'react';
import { useParams, useNavigate, Link } from 'react-router-dom';
import { Layout } from '../components/Layout';
import { api } from '../lib/api';
import {
ArrowLeft,
Clock,
Calendar,
CheckCircle,
XCircle,
AlertCircle,
Package,
Timer,
Building2,
} from 'lucide-react';
interface CrawlHistoryItem {
id: number;
runId: string | null;
profileKey: string | null;
crawlerModule: string | null;
stateAtStart: string | null;
stateAtEnd: string | null;
totalSteps: number;
durationMs: number | null;
success: boolean;
errorMessage: string | null;
productsFound: number | null;
startedAt: string | null;
completedAt: string | null;
}
interface NextSchedule {
scheduleId: number;
jobName: string;
enabled: boolean;
baseIntervalMinutes: number;
jitterMinutes: number;
nextRunAt: string | null;
lastRunAt: string | null;
lastStatus: string | null;
}
interface Dispensary {
id: number;
name: string;
dba_name: string | null;
slug: string;
state: string;
city: string;
menu_type: string | null;
platform_dispensary_id: string | null;
last_menu_scrape: string | null;
}
export function DispensarySchedule() {
const { state, city, slug } = useParams();
const navigate = useNavigate();
const [dispensary, setDispensary] = useState<Dispensary | null>(null);
const [history, setHistory] = useState<CrawlHistoryItem[]>([]);
const [nextSchedule, setNextSchedule] = useState<NextSchedule | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
loadScheduleData();
}, [slug]);
const loadScheduleData = async () => {
setLoading(true);
try {
// First get the dispensary to get the ID
const dispData = await api.getDispensary(slug!);
if (dispData?.id) {
const data = await api.getStoreCrawlHistory(dispData.id);
setDispensary(data.dispensary);
setHistory(data.history || []);
setNextSchedule(data.nextSchedule);
}
} catch (error) {
console.error('Failed to load schedule data:', error);
} finally {
setLoading(false);
}
};
const formatDate = (dateStr: string | null) => {
if (!dateStr) return 'Never';
const date = new Date(dateStr);
return date.toLocaleDateString('en-US', {
year: 'numeric',
month: 'short',
day: 'numeric',
hour: '2-digit',
minute: '2-digit',
});
};
const formatTimeAgo = (dateStr: string | null) => {
if (!dateStr) return 'Never';
const date = new Date(dateStr);
const now = new Date();
const diffMs = now.getTime() - date.getTime();
const diffMinutes = Math.floor(diffMs / (1000 * 60));
const diffHours = Math.floor(diffMs / (1000 * 60 * 60));
const diffDays = Math.floor(diffMs / (1000 * 60 * 60 * 24));
if (diffMinutes < 1) return 'Just now';
if (diffMinutes < 60) return `${diffMinutes}m ago`;
if (diffHours < 24) return `${diffHours}h ago`;
if (diffDays === 1) return 'Yesterday';
if (diffDays < 7) return `${diffDays} days ago`;
return date.toLocaleDateString();
};
const formatTimeUntil = (dateStr: string | null) => {
if (!dateStr) return 'Not scheduled';
const date = new Date(dateStr);
const now = new Date();
const diffMs = date.getTime() - now.getTime();
if (diffMs < 0) return 'Overdue';
const diffMinutes = Math.floor(diffMs / (1000 * 60));
const diffHours = Math.floor(diffMinutes / 60);
if (diffMinutes < 60) return `in ${diffMinutes}m`;
return `in ${diffHours}h ${diffMinutes % 60}m`;
};
const formatDuration = (ms: number | null) => {
if (!ms) return '-';
if (ms < 1000) return `${ms}ms`;
const seconds = Math.floor(ms / 1000);
const minutes = Math.floor(seconds / 60);
if (minutes < 1) return `${seconds}s`;
return `${minutes}m ${seconds % 60}s`;
};
const formatInterval = (baseMinutes: number, jitterMinutes: number) => {
const hours = Math.floor(baseMinutes / 60);
const mins = baseMinutes % 60;
let base = hours > 0 ? `${hours}h` : '';
if (mins > 0) base += `${mins}m`;
return `Every ${base} (+/- ${jitterMinutes}m jitter)`;
};
if (loading) {
return (
<Layout>
<div className="text-center py-12">
<div className="inline-block animate-spin rounded-full h-8 w-8 border-4 border-gray-400 border-t-transparent"></div>
<p className="mt-2 text-sm text-gray-600">Loading schedule...</p>
</div>
</Layout>
);
}
if (!dispensary) {
return (
<Layout>
<div className="text-center py-12">
<p className="text-gray-600">Dispensary not found</p>
</div>
</Layout>
);
}
// Stats from history
const successCount = history.filter(h => h.success).length;
const failureCount = history.filter(h => !h.success).length;
const lastSuccess = history.find(h => h.success);
const avgDuration = history.length > 0
? Math.round(history.reduce((sum, h) => sum + (h.durationMs || 0), 0) / history.length)
: 0;
return (
<Layout>
<div className="space-y-6">
{/* Header */}
<div className="flex items-center justify-between gap-4">
<button
onClick={() => navigate(`/dispensaries/${state}/${city}/${slug}`)}
className="flex items-center gap-2 text-sm text-gray-600 hover:text-gray-900"
>
<ArrowLeft className="w-4 h-4" />
Back to {dispensary.dba_name || dispensary.name}
</button>
</div>
{/* Dispensary Info */}
<div className="bg-white rounded-lg border border-gray-200 p-6">
<div className="flex items-start gap-4">
<div className="p-3 bg-blue-50 rounded-lg">
<Building2 className="w-8 h-8 text-blue-600" />
</div>
<div>
<h1 className="text-2xl font-bold text-gray-900">
{dispensary.dba_name || dispensary.name}
</h1>
<p className="text-sm text-gray-600 mt-1">
{dispensary.city}, {dispensary.state} - Crawl Schedule & History
</p>
<div className="flex items-center gap-4 mt-2 text-sm text-gray-500">
<span>Slug: {dispensary.slug}</span>
{dispensary.menu_type && (
<span className="px-2 py-0.5 bg-gray-100 rounded text-xs">
{dispensary.menu_type}
</span>
)}
</div>
</div>
</div>
</div>
{/* Next Scheduled Crawl */}
{nextSchedule && (
<div className="bg-white rounded-lg border border-gray-200 p-6">
<h2 className="text-lg font-semibold text-gray-900 mb-4 flex items-center gap-2">
<Clock className="w-5 h-5 text-blue-500" />
Upcoming Schedule
</h2>
<div className="grid grid-cols-4 gap-6">
<div>
<p className="text-sm text-gray-500">Next Run</p>
<p className="text-xl font-semibold text-blue-600">
{formatTimeUntil(nextSchedule.nextRunAt)}
</p>
<p className="text-xs text-gray-400">
{formatDate(nextSchedule.nextRunAt)}
</p>
</div>
<div>
<p className="text-sm text-gray-500">Interval</p>
<p className="text-lg font-medium">
{formatInterval(nextSchedule.baseIntervalMinutes, nextSchedule.jitterMinutes)}
</p>
</div>
<div>
<p className="text-sm text-gray-500">Last Run</p>
<p className="text-lg font-medium">
{formatTimeAgo(nextSchedule.lastRunAt)}
</p>
</div>
<div>
<p className="text-sm text-gray-500">Last Status</p>
<p className={`text-lg font-medium ${
nextSchedule.lastStatus === 'success' ? 'text-green-600' :
nextSchedule.lastStatus === 'error' ? 'text-red-600' : 'text-gray-600'
}`}>
{nextSchedule.lastStatus || '-'}
</p>
</div>
</div>
</div>
)}
{/* Stats Summary */}
<div className="grid grid-cols-4 gap-4">
<div className="bg-white rounded-lg border border-gray-200 p-4">
<div className="flex items-center gap-3">
<CheckCircle className="w-8 h-8 text-green-500" />
<div>
<p className="text-sm text-gray-500">Successful Runs</p>
<p className="text-2xl font-bold text-green-600">{successCount}</p>
</div>
</div>
</div>
<div className="bg-white rounded-lg border border-gray-200 p-4">
<div className="flex items-center gap-3">
<XCircle className="w-8 h-8 text-red-500" />
<div>
<p className="text-sm text-gray-500">Failed Runs</p>
<p className="text-2xl font-bold text-red-600">{failureCount}</p>
</div>
</div>
</div>
<div className="bg-white rounded-lg border border-gray-200 p-4">
<div className="flex items-center gap-3">
<Timer className="w-8 h-8 text-blue-500" />
<div>
<p className="text-sm text-gray-500">Avg Duration</p>
<p className="text-2xl font-bold">{formatDuration(avgDuration)}</p>
</div>
</div>
</div>
<div className="bg-white rounded-lg border border-gray-200 p-4">
<div className="flex items-center gap-3">
<Package className="w-8 h-8 text-purple-500" />
<div>
<p className="text-sm text-gray-500">Last Products Found</p>
<p className="text-2xl font-bold">
{lastSuccess?.productsFound?.toLocaleString() || '-'}
</p>
</div>
</div>
</div>
</div>
{/* Crawl History Table */}
<div className="bg-white rounded-lg border border-gray-200">
<div className="p-4 border-b border-gray-200">
<h2 className="text-lg font-semibold text-gray-900 flex items-center gap-2">
<Calendar className="w-5 h-5 text-gray-500" />
Crawl History
</h2>
</div>
<div className="overflow-x-auto">
<table className="table table-sm w-full">
<thead className="bg-gray-50">
<tr>
<th>Status</th>
<th>Started</th>
<th>Duration</th>
<th className="text-right">Products</th>
<th>State</th>
<th>Error</th>
</tr>
</thead>
<tbody>
{history.length === 0 ? (
<tr>
<td colSpan={6} className="text-center py-8 text-gray-500">
No crawl history available
</td>
</tr>
) : (
history.map((item) => (
<tr key={item.id} className="hover:bg-gray-50">
<td>
<span className={`inline-flex items-center gap-1 px-2 py-1 rounded text-xs font-medium ${
item.success
? 'bg-green-100 text-green-700'
: 'bg-red-100 text-red-700'
}`}>
{item.success ? (
<CheckCircle className="w-3 h-3" />
) : (
<XCircle className="w-3 h-3" />
)}
{item.success ? 'Success' : 'Failed'}
</span>
</td>
<td>
<div className="text-sm">{formatDate(item.startedAt)}</div>
<div className="text-xs text-gray-400">{formatTimeAgo(item.startedAt)}</div>
</td>
<td className="font-mono text-sm">
{formatDuration(item.durationMs)}
</td>
<td className="text-right font-mono text-sm">
{item.productsFound?.toLocaleString() || '-'}
</td>
<td className="text-sm text-gray-600">
{item.stateAtEnd || item.stateAtStart || '-'}
</td>
<td className="max-w-[200px]">
{item.errorMessage ? (
<span
className="text-xs text-red-600 truncate block cursor-help"
title={item.errorMessage}
>
{item.errorMessage.substring(0, 50)}...
</span>
) : '-'}
</td>
</tr>
))
)}
</tbody>
</table>
</div>
</div>
</div>
</Layout>
);
}
export default DispensarySchedule;

View File

@@ -3,15 +3,16 @@ import { useNavigate } from 'react-router-dom';
import { Layout } from '../components/Layout';
import { api } from '../lib/api';
import { trackProductClick } from '../lib/analytics';
import { useStateFilter } from '../hooks/useStateFilter';
import {
Building2,
MapPin,
Package,
DollarSign,
RefreshCw,
Search,
TrendingUp,
BarChart3,
ChevronDown,
} from 'lucide-react';
interface BrandData {
@@ -25,19 +26,28 @@ interface BrandData {
export function IntelligenceBrands() {
const navigate = useNavigate();
const { selectedState, setSelectedState, stateParam, stateLabel, isAllStates } = useStateFilter();
const [availableStates, setAvailableStates] = useState<string[]>([]);
const [brands, setBrands] = useState<BrandData[]>([]);
const [loading, setLoading] = useState(true);
const [searchTerm, setSearchTerm] = useState('');
const [sortBy, setSortBy] = useState<'stores' | 'skus' | 'name'>('stores');
const [sortBy, setSortBy] = useState<'stores' | 'skus' | 'name' | 'states'>('stores');
useEffect(() => {
loadBrands();
}, [stateParam]);
useEffect(() => {
// Load available states
api.getOrchestratorStates().then(data => {
setAvailableStates(data.states?.map((s: any) => s.state) || []);
}).catch(console.error);
}, []);
const loadBrands = async () => {
try {
setLoading(true);
const data = await api.getIntelligenceBrands({ limit: 500 });
const data = await api.getIntelligenceBrands({ limit: 500, state: stateParam });
setBrands(data.brands || []);
} catch (error) {
console.error('Failed to load brands:', error);
@@ -58,6 +68,8 @@ export function IntelligenceBrands() {
return b.skuCount - a.skuCount;
case 'name':
return a.brandName.localeCompare(b.brandName);
case 'states':
return b.states.length - a.states.length;
default:
return 0;
}
@@ -89,37 +101,62 @@ export function IntelligenceBrands() {
<Layout>
<div className="space-y-6">
{/* Header */}
<div className="flex items-center justify-between">
<div className="flex flex-col gap-4 sm:flex-row sm:items-center sm:justify-between">
<div>
<h1 className="text-2xl font-bold text-gray-900">Brands Intelligence</h1>
<p className="text-sm text-gray-600 mt-1">
Brand penetration and pricing analytics across markets
</p>
</div>
<div className="flex gap-2">
<div className="flex flex-wrap gap-2 items-center">
{/* State Selector */}
<div className="dropdown dropdown-end">
<button tabIndex={0} className="btn btn-sm gap-2 bg-emerald-50 border-emerald-200 hover:bg-emerald-100">
{stateLabel}
<ChevronDown className="w-4 h-4" />
</button>
<ul tabIndex={0} className="dropdown-content z-50 menu p-2 shadow-lg bg-white rounded-box w-44 max-h-60 overflow-y-auto border border-gray-200">
<li>
<a onClick={() => setSelectedState(null)} className={isAllStates ? 'active bg-emerald-100' : ''}>
All States
</a>
</li>
<div className="divider my-1"></div>
{availableStates.map((state) => (
<li key={state}>
<a onClick={() => setSelectedState(state)} className={selectedState === state ? 'active bg-emerald-100' : ''}>
{state}
</a>
</li>
))}
</ul>
</div>
{/* Page Navigation */}
<div className="flex gap-1">
<button
onClick={() => navigate('/admin/intelligence/pricing')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-emerald-600 text-white hover:bg-emerald-700 border-emerald-600"
>
<DollarSign className="w-4 h-4" />
Pricing
<Building2 className="w-4 h-4" />
<span>Brands</span>
</button>
<button
onClick={() => navigate('/admin/intelligence/stores')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<MapPin className="w-4 h-4" />
Stores
<span>Stores</span>
</button>
<button
onClick={loadBrands}
className="btn btn-sm btn-outline gap-2"
onClick={() => navigate('/admin/intelligence/pricing')}
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<RefreshCw className="w-4 h-4" />
Refresh
<DollarSign className="w-4 h-4" />
<span>Pricing</span>
</button>
</div>
</div>
</div>
{/* Summary Cards */}
<div className="grid grid-cols-4 gap-4">
@@ -169,28 +206,32 @@ export function IntelligenceBrands() {
{/* Top Brands Chart */}
<div className="bg-white rounded-lg border border-gray-200 p-4">
<h3 className="text-lg font-semibold text-gray-900 mb-4 flex items-center gap-2">
<BarChart3 className="w-5 h-5 text-blue-500" />
<h3 className="text-lg font-semibold text-gray-900 flex items-center gap-2 mb-4">
<BarChart3 className="w-5 h-5 text-emerald-500" />
Top 10 Brands by Store Count
</h3>
<div className="space-y-2">
{topBrands.map((brand, idx) => (
{topBrands.map((brand) => {
const barWidth = Math.min((brand.storeCount / maxStoreCount) * 100, 100);
return (
<div key={brand.brandName} className="flex items-center gap-3">
<span className="text-sm text-gray-500 w-6">{idx + 1}.</span>
<span className="text-sm font-medium w-40 truncate" title={brand.brandName}>
<span className="text-sm font-medium w-28 truncate shrink-0" title={brand.brandName}>
{brand.brandName}
</span>
<div className="flex-1 bg-gray-100 rounded-full h-4 relative">
<div className="flex-1 min-w-0">
<div className="bg-gray-100 rounded h-5 overflow-hidden">
<div
className="bg-blue-500 rounded-full h-4"
style={{ width: `${(brand.storeCount / maxStoreCount) * 100}%` }}
className="bg-gradient-to-r from-emerald-400 to-emerald-500 h-5 rounded transition-all"
style={{ width: `${barWidth}%` }}
/>
</div>
<span className="text-sm text-gray-600 w-16 text-right">
{brand.storeCount} stores
</div>
<span className="text-sm font-mono font-semibold text-emerald-600 w-16 text-right shrink-0">
{brand.storeCount}
</span>
</div>
))}
);
})}
</div>
</div>
@@ -213,6 +254,7 @@ export function IntelligenceBrands() {
>
<option value="stores">Sort by Stores</option>
<option value="skus">Sort by SKUs</option>
<option value="states">Sort by States</option>
<option value="name">Sort by Name</option>
</select>
<span className="text-sm text-gray-500">

View File

@@ -2,15 +2,16 @@ import { useEffect, useState } from 'react';
import { useNavigate } from 'react-router-dom';
import { Layout } from '../components/Layout';
import { api } from '../lib/api';
import { useStateFilter } from '../hooks/useStateFilter';
import {
DollarSign,
Building2,
MapPin,
Package,
RefreshCw,
TrendingUp,
TrendingDown,
BarChart3,
ChevronDown,
} from 'lucide-react';
interface CategoryPricing {
@@ -31,18 +32,27 @@ interface OverallPricing {
export function IntelligencePricing() {
const navigate = useNavigate();
const { selectedState, setSelectedState, stateParam, stateLabel, isAllStates } = useStateFilter();
const [availableStates, setAvailableStates] = useState<string[]>([]);
const [categories, setCategories] = useState<CategoryPricing[]>([]);
const [overall, setOverall] = useState<OverallPricing | null>(null);
const [loading, setLoading] = useState(true);
useEffect(() => {
loadPricing();
}, [stateParam]);
useEffect(() => {
// Load available states
api.getOrchestratorStates().then(data => {
setAvailableStates(data.states?.map((s: any) => s.state) || []);
}).catch(console.error);
}, []);
const loadPricing = async () => {
try {
setLoading(true);
const data = await api.getIntelligencePricing();
const data = await api.getIntelligencePricing({ state: stateParam });
setCategories(data.byCategory || []);
setOverall(data.overall || null);
} catch (error) {
@@ -76,37 +86,62 @@ export function IntelligencePricing() {
<Layout>
<div className="space-y-6">
{/* Header */}
<div className="flex items-center justify-between">
<div className="flex flex-col gap-4 sm:flex-row sm:items-center sm:justify-between">
<div>
<h1 className="text-2xl font-bold text-gray-900">Pricing Intelligence</h1>
<p className="text-sm text-gray-600 mt-1">
Price distribution and trends by category
</p>
</div>
<div className="flex gap-2">
<div className="flex flex-wrap gap-2 items-center">
{/* State Selector */}
<div className="dropdown dropdown-end">
<button tabIndex={0} className="btn btn-sm gap-2 bg-emerald-50 border-emerald-200 hover:bg-emerald-100">
{stateLabel}
<ChevronDown className="w-4 h-4" />
</button>
<ul tabIndex={0} className="dropdown-content z-50 menu p-2 shadow-lg bg-white rounded-box w-44 max-h-60 overflow-y-auto border border-gray-200">
<li>
<a onClick={() => setSelectedState(null)} className={isAllStates ? 'active bg-emerald-100' : ''}>
All States
</a>
</li>
<div className="divider my-1"></div>
{availableStates.map((state) => (
<li key={state}>
<a onClick={() => setSelectedState(state)} className={selectedState === state ? 'active bg-emerald-100' : ''}>
{state}
</a>
</li>
))}
</ul>
</div>
{/* Page Navigation */}
<div className="flex gap-1">
<button
onClick={() => navigate('/admin/intelligence/brands')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<Building2 className="w-4 h-4" />
Brands
<span>Brands</span>
</button>
<button
onClick={() => navigate('/admin/intelligence/stores')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<MapPin className="w-4 h-4" />
Stores
<span>Stores</span>
</button>
<button
onClick={loadPricing}
className="btn btn-sm btn-outline gap-2"
className="btn btn-sm gap-1 bg-emerald-600 text-white hover:bg-emerald-700 border-emerald-600"
>
<RefreshCw className="w-4 h-4" />
Refresh
<DollarSign className="w-4 h-4" />
<span>Pricing</span>
</button>
</div>
</div>
</div>
{/* Overall Stats */}
{overall && (
@@ -150,7 +185,7 @@ export function IntelligencePricing() {
<div>
<p className="text-sm text-gray-500">Products Priced</p>
<p className="text-2xl font-bold">
{overall.totalProducts.toLocaleString()}
{(overall.totalProducts || 0).toLocaleString()}
</p>
</div>
</div>
@@ -164,43 +199,29 @@ export function IntelligencePricing() {
<BarChart3 className="w-5 h-5 text-green-500" />
Average Price by Category
</h3>
<div className="space-y-3">
{sortedCategories.map((cat) => (
<div className="space-y-2">
{sortedCategories.slice(0, 12).map((cat) => {
const maxPrice = Math.max(...sortedCategories.map(c => c.avgPrice || 0), 1);
const barWidth = Math.min(((cat.avgPrice || 0) / maxPrice) * 100, 100);
return (
<div key={cat.category} className="flex items-center gap-3">
<span className="text-sm font-medium w-32 truncate" title={cat.category}>
<span className="text-sm font-medium w-28 truncate shrink-0" title={cat.category}>
{cat.category || 'Unknown'}
</span>
<div className="flex-1 relative">
{/* Price range bar */}
<div className="bg-gray-100 rounded-full h-6 relative">
{/* Min-Max range */}
<div className="flex-1 min-w-0">
<div className="bg-gray-100 rounded h-5 overflow-hidden">
<div
className="absolute top-0 h-6 bg-blue-100 rounded-full"
style={{
left: `${(cat.minPrice / (overall?.maxPrice || 100)) * 100}%`,
width: `${((cat.maxPrice - cat.minPrice) / (overall?.maxPrice || 100)) * 100}%`,
}}
/>
{/* Average marker */}
<div
className="absolute top-0 h-6 w-1 bg-green-500 rounded"
style={{ left: `${(cat.avgPrice / (overall?.maxPrice || 100)) * 100}%` }}
className="bg-gradient-to-r from-emerald-400 to-emerald-500 h-5 rounded transition-all"
style={{ width: `${barWidth}%` }}
/>
</div>
</div>
<div className="flex gap-4 text-xs w-48">
<span className="text-gray-500">
Min: <span className="text-blue-600 font-mono">{formatPrice(cat.minPrice)}</span>
</span>
<span className="text-gray-500">
Avg: <span className="text-green-600 font-mono font-bold">{formatPrice(cat.avgPrice)}</span>
</span>
<span className="text-gray-500">
Max: <span className="text-orange-600 font-mono">{formatPrice(cat.maxPrice)}</span>
<span className="text-sm font-mono font-semibold text-emerald-600 w-16 text-right shrink-0">
{formatPrice(cat.avgPrice)}
</span>
</div>
</div>
))}
);
})}
</div>
</div>
@@ -236,7 +257,7 @@ export function IntelligencePricing() {
<span className="font-medium">{cat.category || 'Unknown'}</span>
</td>
<td className="text-center">
<span className="font-mono">{cat.productCount.toLocaleString()}</span>
<span className="font-mono">{(cat.productCount || 0).toLocaleString()}</span>
</td>
<td className="text-right">
<span className="font-mono text-blue-600">{formatPrice(cat.minPrice)}</span>

View File

@@ -8,7 +8,6 @@ import {
Building2,
DollarSign,
Package,
RefreshCw,
Search,
Clock,
Activity,
@@ -34,12 +33,19 @@ export function IntelligenceStores() {
const [stores, setStores] = useState<StoreActivity[]>([]);
const [loading, setLoading] = useState(true);
const [searchTerm, setSearchTerm] = useState('');
const [localStates, setLocalStates] = useState<string[]>([]);
const [availableStates, setAvailableStates] = useState<string[]>([]);
useEffect(() => {
loadStores();
}, [selectedState]);
useEffect(() => {
// Load available states from orchestrator API
api.getOrchestratorStates().then(data => {
setAvailableStates(data.states?.map((s: any) => s.state) || []);
}).catch(console.error);
}, []);
const loadStores = async () => {
try {
setLoading(true);
@@ -48,10 +54,6 @@ export function IntelligenceStores() {
limit: 500,
});
setStores(data.stores || []);
// Extract unique states from response for dropdown counts
const uniqueStates = [...new Set(data.stores.map((s: StoreActivity) => s.state))].sort();
setLocalStates(uniqueStates);
} catch (error) {
console.error('Failed to load stores:', error);
} finally {
@@ -97,49 +99,74 @@ export function IntelligenceStores() {
);
}
// Calculate stats
const totalSKUs = stores.reduce((sum, s) => sum + s.skuCount, 0);
const totalSnapshots = stores.reduce((sum, s) => sum + s.snapshotCount, 0);
const avgFrequency = stores.filter(s => s.crawlFrequencyHours).length > 0
? stores.filter(s => s.crawlFrequencyHours).reduce((sum, s) => sum + (s.crawlFrequencyHours || 0), 0) /
stores.filter(s => s.crawlFrequencyHours).length
// Calculate stats with null safety
const totalSKUs = stores.reduce((sum, s) => sum + (s.skuCount || 0), 0);
const totalSnapshots = stores.reduce((sum, s) => sum + (s.snapshotCount || 0), 0);
const storesWithFrequency = stores.filter(s => s.crawlFrequencyHours != null);
const avgFrequency = storesWithFrequency.length > 0
? storesWithFrequency.reduce((sum, s) => sum + (s.crawlFrequencyHours || 0), 0) / storesWithFrequency.length
: 0;
return (
<Layout>
<div className="space-y-6">
{/* Header */}
<div className="flex items-center justify-between">
<div className="flex flex-col gap-4 sm:flex-row sm:items-center sm:justify-between">
<div>
<h1 className="text-2xl font-bold text-gray-900">Store Activity</h1>
<p className="text-sm text-gray-600 mt-1">
Per-store SKU counts, snapshots, and crawl frequency
</p>
</div>
<div className="flex gap-2">
<div className="flex flex-wrap gap-2 items-center">
{/* State Selector */}
<div className="dropdown dropdown-end">
<button tabIndex={0} className="btn btn-sm gap-2 bg-emerald-50 border-emerald-200 hover:bg-emerald-100">
{stateLabel}
<ChevronDown className="w-4 h-4" />
</button>
<ul tabIndex={0} className="dropdown-content z-50 menu p-2 shadow-lg bg-white rounded-box w-44 max-h-60 overflow-y-auto border border-gray-200">
<li>
<a onClick={() => setSelectedState(null)} className={isAllStates ? 'active bg-emerald-100' : ''}>
All States
</a>
</li>
<div className="divider my-1"></div>
{availableStates.map((state) => (
<li key={state}>
<a onClick={() => setSelectedState(state)} className={selectedState === state ? 'active bg-emerald-100' : ''}>
{state}
</a>
</li>
))}
</ul>
</div>
{/* Page Navigation */}
<div className="flex gap-1">
<button
onClick={() => navigate('/admin/intelligence/brands')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<Building2 className="w-4 h-4" />
Brands
<span>Brands</span>
</button>
<button
className="btn btn-sm gap-1 bg-emerald-600 text-white hover:bg-emerald-700 border-emerald-600"
>
<MapPin className="w-4 h-4" />
<span>Stores</span>
</button>
<button
onClick={() => navigate('/admin/intelligence/pricing')}
className="btn btn-sm btn-outline gap-1"
className="btn btn-sm gap-1 bg-white border-gray-300 text-gray-700 hover:bg-gray-100"
>
<DollarSign className="w-4 h-4" />
Pricing
</button>
<button
onClick={loadStores}
className="btn btn-sm btn-outline gap-2"
>
<RefreshCw className="w-4 h-4" />
Refresh
<span>Pricing</span>
</button>
</div>
</div>
</div>
{/* Summary Cards - Responsive: 2→4 columns */}
<div className="grid grid-cols-2 md:grid-cols-4 gap-4">
@@ -193,26 +220,6 @@ export function IntelligenceStores() {
className="input input-bordered input-sm w-full pl-10"
/>
</div>
<div className="dropdown">
<button tabIndex={0} className="btn btn-sm btn-outline gap-2">
{stateLabel}
<ChevronDown className="w-4 h-4" />
</button>
<ul tabIndex={0} className="dropdown-content z-[1] menu p-2 shadow bg-base-100 rounded-box w-40 max-h-60 overflow-y-auto">
<li>
<a onClick={() => setSelectedState(null)} className={isAllStates ? 'active' : ''}>
All States
</a>
</li>
{localStates.map(state => (
<li key={state}>
<a onClick={() => setSelectedState(state)} className={selectedState === state ? 'active' : ''}>
{state}
</a>
</li>
))}
</ul>
</div>
<span className="text-sm text-gray-500">
Showing {filteredStores.length} of {stores.length} stores
</span>
@@ -246,7 +253,7 @@ export function IntelligenceStores() {
<tr
key={store.id}
className="hover:bg-gray-50 cursor-pointer"
onClick={() => navigate(`/admin/orchestrator/stores?storeId=${store.id}`)}
onClick={() => navigate(`/stores/list/${store.id}`)}
>
<td>
<span className="font-medium">{store.name}</span>
@@ -262,10 +269,10 @@ export function IntelligenceStores() {
)}
</td>
<td className="text-center">
<span className="font-mono">{store.skuCount.toLocaleString()}</span>
<span className="font-mono">{(store.skuCount || 0).toLocaleString()}</span>
</td>
<td className="text-center">
<span className="font-mono">{store.snapshotCount.toLocaleString()}</span>
<span className="font-mono">{(store.snapshotCount || 0).toLocaleString()}</span>
</td>
<td>
<span className={store.lastCrawl ? 'text-green-600' : 'text-gray-400'}>

View File

@@ -8,7 +8,6 @@
import { useState, useEffect } from 'react';
import { useNavigate } from 'react-router-dom';
import { Layout } from '../components/Layout';
import { StateBadge } from '../components/StateSelector';
import { useStateStore } from '../store/stateStore';
import { api } from '../lib/api';
import {
@@ -21,7 +20,6 @@ import {
DollarSign,
MapPin,
ArrowRight,
RefreshCw,
AlertCircle
} from 'lucide-react';
@@ -205,7 +203,6 @@ export default function NationalDashboard() {
const [loading, setLoading] = useState(true);
const [error, setError] = useState<string | null>(null);
const [summary, setSummary] = useState<NationalSummary | null>(null);
const [refreshing, setRefreshing] = useState(false);
const fetchData = async () => {
setLoading(true);
@@ -230,18 +227,6 @@ export default function NationalDashboard() {
fetchData();
}, []);
const handleRefreshMetrics = async () => {
setRefreshing(true);
try {
await api.post('/api/admin/states/refresh-metrics');
await fetchData();
} catch (err) {
console.error('Failed to refresh metrics:', err);
} finally {
setRefreshing(false);
}
};
const handleStateClick = (stateCode: string) => {
setSelectedState(stateCode);
navigate(`/national/state/${stateCode}`);
@@ -278,32 +263,19 @@ export default function NationalDashboard() {
<Layout>
<div className="space-y-6">
{/* Header */}
<div className="flex items-center justify-between">
<div>
<h1 className="text-2xl font-bold text-gray-900">National Dashboard</h1>
<p className="text-gray-500 mt-1">
Multi-state cannabis market intelligence
</p>
</div>
<div className="flex items-center gap-3">
<StateBadge />
<button
onClick={handleRefreshMetrics}
disabled={refreshing}
className="flex items-center gap-2 px-3 py-2 text-sm text-gray-600 hover:text-gray-900 border border-gray-200 rounded-lg hover:bg-gray-50 disabled:opacity-50"
>
<RefreshCw className={`w-4 h-4 ${refreshing ? 'animate-spin' : ''}`} />
Refresh Metrics
</button>
</div>
</div>
{/* Summary Cards */}
{summary && (
<>
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-4 gap-4">
<MetricCard
title="Active States"
title="States"
value={summary.activeStates}
icon={Globe}
/>

View File

@@ -153,29 +153,6 @@ export function StoreDetailPage() {
Back to Stores
</button>
{/* Update Button */}
<div className="relative">
<button
onClick={() => setShowUpdateDropdown(!showUpdateDropdown)}
disabled={isUpdating}
className="flex items-center gap-2 px-4 py-2 text-sm font-medium text-white bg-blue-600 hover:bg-blue-700 rounded-lg disabled:opacity-50 disabled:cursor-not-allowed"
>
<RefreshCw className={`w-4 h-4 ${isUpdating ? 'animate-spin' : ''}`} />
{isUpdating ? 'Crawling...' : 'Crawl Now'}
{!isUpdating && <ChevronDown className="w-4 h-4" />}
</button>
{showUpdateDropdown && !isUpdating && (
<div className="absolute right-0 mt-2 w-48 bg-white rounded-lg shadow-lg border border-gray-200 z-10">
<button
onClick={handleCrawl}
className="w-full text-left px-4 py-2 text-sm text-gray-700 hover:bg-gray-100 rounded-lg"
>
Start Full Crawl
</button>
</div>
)}
</div>
</div>
{/* Store Header */}
@@ -200,7 +177,7 @@ export function StoreDetailPage() {
<div className="flex items-center gap-2 text-sm text-gray-600 bg-gray-50 px-4 py-2 rounded-lg">
<Clock className="w-4 h-4" />
<div>
<span className="font-medium">Last Crawl:</span>
<span className="font-medium">Last Updated:</span>
<span className="ml-2">
{lastCrawl?.completed_at
? new Date(lastCrawl.completed_at).toLocaleDateString('en-US', {
@@ -212,15 +189,6 @@ export function StoreDetailPage() {
})
: 'Never'}
</span>
{lastCrawl?.status && (
<span className={`ml-2 px-2 py-0.5 rounded text-xs ${
lastCrawl.status === 'completed' ? 'bg-green-100 text-green-800' :
lastCrawl.status === 'failed' ? 'bg-red-100 text-red-800' :
'bg-yellow-100 text-yellow-800'
}`}>
{lastCrawl.status}
</span>
)}
</div>
</div>
</div>
@@ -282,8 +250,8 @@ export function StoreDetailPage() {
setStockFilter('in_stock');
setSearchQuery('');
}}
className={`bg-white rounded-lg border p-4 hover:border-blue-300 hover:shadow-md transition-all cursor-pointer text-left ${
stockFilter === 'in_stock' ? 'border-blue-500' : 'border-gray-200'
className={`bg-white rounded-lg border p-4 hover:border-gray-300 hover:shadow-md transition-all cursor-pointer text-left ${
stockFilter === 'in_stock' ? 'border-gray-400' : 'border-gray-200'
}`}
>
<div className="flex items-center gap-3">
@@ -303,8 +271,8 @@ export function StoreDetailPage() {
setStockFilter('out_of_stock');
setSearchQuery('');
}}
className={`bg-white rounded-lg border p-4 hover:border-blue-300 hover:shadow-md transition-all cursor-pointer text-left ${
stockFilter === 'out_of_stock' ? 'border-blue-500' : 'border-gray-200'
className={`bg-white rounded-lg border p-4 hover:border-gray-300 hover:shadow-md transition-all cursor-pointer text-left ${
stockFilter === 'out_of_stock' ? 'border-gray-400' : 'border-gray-200'
}`}
>
<div className="flex items-center gap-3">
@@ -320,8 +288,8 @@ export function StoreDetailPage() {
<button
onClick={() => setActiveTab('brands')}
className={`bg-white rounded-lg border p-4 hover:border-blue-300 hover:shadow-md transition-all cursor-pointer text-left ${
activeTab === 'brands' ? 'border-blue-500' : 'border-gray-200'
className={`bg-white rounded-lg border p-4 hover:border-gray-300 hover:shadow-md transition-all cursor-pointer text-left ${
activeTab === 'brands' ? 'border-gray-400' : 'border-gray-200'
}`}
>
<div className="flex items-center gap-3">
@@ -337,8 +305,8 @@ export function StoreDetailPage() {
<button
onClick={() => setActiveTab('categories')}
className={`bg-white rounded-lg border p-4 hover:border-blue-300 hover:shadow-md transition-all cursor-pointer text-left ${
activeTab === 'categories' ? 'border-blue-500' : 'border-gray-200'
className={`bg-white rounded-lg border p-4 hover:border-gray-300 hover:shadow-md transition-all cursor-pointer text-left ${
activeTab === 'categories' ? 'border-gray-400' : 'border-gray-200'
}`}
>
<div className="flex items-center gap-3">
@@ -364,7 +332,7 @@ export function StoreDetailPage() {
}}
className={`py-4 px-2 text-sm font-medium border-b-2 ${
activeTab === 'products'
? 'border-blue-600 text-blue-600'
? 'border-gray-800 text-gray-900'
: 'border-transparent text-gray-600 hover:text-gray-900'
}`}
>
@@ -374,7 +342,7 @@ export function StoreDetailPage() {
onClick={() => setActiveTab('brands')}
className={`py-4 px-2 text-sm font-medium border-b-2 ${
activeTab === 'brands'
? 'border-blue-600 text-blue-600'
? 'border-gray-800 text-gray-900'
: 'border-transparent text-gray-600 hover:text-gray-900'
}`}
>
@@ -384,7 +352,7 @@ export function StoreDetailPage() {
onClick={() => setActiveTab('categories')}
className={`py-4 px-2 text-sm font-medium border-b-2 ${
activeTab === 'categories'
? 'border-blue-600 text-blue-600'
? 'border-gray-800 text-gray-900'
: 'border-transparent text-gray-600 hover:text-gray-900'
}`}
>
@@ -433,7 +401,7 @@ export function StoreDetailPage() {
{productsLoading ? (
<div className="text-center py-8">
<div className="inline-block animate-spin rounded-full h-6 w-6 border-4 border-blue-500 border-t-transparent"></div>
<div className="inline-block animate-spin rounded-full h-6 w-6 border-4 border-gray-400 border-t-transparent"></div>
<p className="mt-2 text-sm text-gray-600">Loading products...</p>
</div>
) : products.length === 0 ? (
@@ -485,9 +453,9 @@ export function StoreDetailPage() {
<div className="line-clamp-2" title={product.brand || '-'}>{product.brand || '-'}</div>
</td>
<td className="whitespace-nowrap">
<span className="badge badge-ghost badge-sm">{product.type || '-'}</span>
<span className="text-xs text-gray-500 bg-gray-100 px-1.5 py-0.5 rounded">{product.type || '-'}</span>
{product.subcategory && (
<span className="badge badge-ghost badge-sm ml-1">{product.subcategory}</span>
<span className="text-xs text-gray-500 bg-gray-100 px-1.5 py-0.5 rounded ml-1">{product.subcategory}</span>
)}
</td>
<td className="text-right font-semibold whitespace-nowrap">
@@ -500,21 +468,14 @@ export function StoreDetailPage() {
`$${product.regular_price}`
) : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.thc_percentage ? (
<span className="badge badge-success badge-sm">{product.thc_percentage}%</span>
) : '-'}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.thc_percentage ? `${product.thc_percentage}%` : '-'}
</td>
<td className="text-center whitespace-nowrap">
{product.stock_status === 'in_stock' ? (
<span className="badge badge-success badge-sm">In Stock</span>
) : product.stock_status === 'out_of_stock' ? (
<span className="badge badge-error badge-sm">Out</span>
) : (
<span className="badge badge-warning badge-sm">Unknown</span>
)}
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.stock_status === 'in_stock' ? 'In Stock' :
product.stock_status === 'out_of_stock' ? 'Out' : '-'}
</td>
<td className="text-center whitespace-nowrap">
<td className="text-center whitespace-nowrap text-sm text-gray-700">
{product.total_quantity != null ? product.total_quantity : '-'}
</td>
<td className="whitespace-nowrap text-xs text-gray-500">

View File

@@ -14,8 +14,9 @@ import {
ChevronUp,
Gauge,
Users,
Calendar,
Zap,
Power,
Play,
Square,
} from 'lucide-react';
interface Task {
@@ -82,6 +83,27 @@ const STATUS_COLORS: Record<string, string> = {
stale: 'bg-gray-100 text-gray-800',
};
const getStatusIcon = (status: string, poolPaused: boolean): React.ReactNode => {
switch (status) {
case 'pending':
return <Clock className="w-4 h-4" />;
case 'claimed':
return <PlayCircle className="w-4 h-4" />;
case 'running':
// Don't spin when pool is paused
return <RefreshCw className={`w-4 h-4 ${!poolPaused ? 'animate-spin' : ''}`} />;
case 'completed':
return <CheckCircle2 className="w-4 h-4" />;
case 'failed':
return <XCircle className="w-4 h-4" />;
case 'stale':
return <AlertTriangle className="w-4 h-4" />;
default:
return null;
}
};
// Static version for summary cards (always shows animation)
const STATUS_ICONS: Record<string, React.ReactNode> = {
pending: <Clock className="w-4 h-4" />,
claimed: <PlayCircle className="w-4 h-4" />,
@@ -116,6 +138,8 @@ export default function TasksDashboard() {
const [capacity, setCapacity] = useState<CapacityMetric[]>([]);
const [loading, setLoading] = useState(true);
const [error, setError] = useState<string | null>(null);
const [poolPaused, setPoolPaused] = useState(false);
const [poolLoading, setPoolLoading] = useState(false);
// Filters
const [roleFilter, setRoleFilter] = useState<string>('');
@@ -123,13 +147,10 @@ export default function TasksDashboard() {
const [searchQuery, setSearchQuery] = useState('');
const [showCapacity, setShowCapacity] = useState(true);
// Actions
const [actionLoading, setActionLoading] = useState(false);
const [actionMessage, setActionMessage] = useState<string | null>(null);
const fetchData = async () => {
try {
const [tasksRes, countsRes, capacityRes] = await Promise.all([
const [tasksRes, countsRes, capacityRes, poolStatus] = await Promise.all([
api.getTasks({
role: roleFilter || undefined,
status: statusFilter || undefined,
@@ -137,11 +158,13 @@ export default function TasksDashboard() {
}),
api.getTaskCounts(),
api.getTaskCapacity(),
api.getTaskPoolStatus(),
]);
setTasks(tasksRes.tasks || []);
setCounts(countsRes);
setCapacity(capacityRes.metrics || []);
setPoolPaused(poolStatus.paused);
setError(null);
} catch (err: any) {
setError(err.message || 'Failed to load tasks');
@@ -150,40 +173,29 @@ export default function TasksDashboard() {
}
};
const togglePool = async () => {
setPoolLoading(true);
try {
if (poolPaused) {
await api.resumeTaskPool();
setPoolPaused(false);
} else {
await api.pauseTaskPool();
setPoolPaused(true);
}
} catch (err: any) {
setError(err.message || 'Failed to toggle pool');
} finally {
setPoolLoading(false);
}
};
useEffect(() => {
fetchData();
const interval = setInterval(fetchData, 10000); // Refresh every 10 seconds
const interval = setInterval(fetchData, 15000); // Auto-refresh every 15 seconds
return () => clearInterval(interval);
}, [roleFilter, statusFilter]);
const handleGenerateResync = async () => {
setActionLoading(true);
try {
const result = await api.generateResyncTasks();
setActionMessage(`Generated ${result.tasks_created} resync tasks`);
fetchData();
} catch (err: any) {
setActionMessage(`Error: ${err.message}`);
} finally {
setActionLoading(false);
setTimeout(() => setActionMessage(null), 5000);
}
};
const handleRecoverStale = async () => {
setActionLoading(true);
try {
const result = await api.recoverStaleTasks();
setActionMessage(`Recovered ${result.tasks_recovered} stale tasks`);
fetchData();
} catch (err: any) {
setActionMessage(`Error: ${err.message}`);
} finally {
setActionLoading(false);
setTimeout(() => setActionMessage(null), 5000);
}
};
const filteredTasks = tasks.filter((task) => {
if (searchQuery) {
const query = searchQuery.toLowerCase();
@@ -225,45 +237,32 @@ export default function TasksDashboard() {
</p>
</div>
<div className="flex gap-2">
<div className="flex items-center gap-4">
{/* Pool Toggle */}
<button
onClick={handleGenerateResync}
disabled={actionLoading}
className="flex items-center gap-2 px-4 py-2 bg-emerald-600 text-white rounded-lg hover:bg-emerald-700 disabled:opacity-50"
>
<Calendar className="w-4 h-4" />
Generate Resync
</button>
<button
onClick={handleRecoverStale}
disabled={actionLoading}
className="flex items-center gap-2 px-4 py-2 bg-gray-600 text-white rounded-lg hover:bg-gray-700 disabled:opacity-50"
>
<Zap className="w-4 h-4" />
Recover Stale
</button>
<button
onClick={fetchData}
className="flex items-center gap-2 px-4 py-2 bg-gray-100 text-gray-700 rounded-lg hover:bg-gray-200"
>
<RefreshCw className="w-4 h-4" />
Refresh
</button>
</div>
</div>
{/* Action Message */}
{actionMessage && (
<div
className={`p-4 rounded-lg ${
actionMessage.startsWith('Error')
? 'bg-red-50 text-red-700'
: 'bg-green-50 text-green-700'
onClick={togglePool}
disabled={poolLoading}
className={`flex items-center gap-2 px-4 py-2 rounded-lg font-medium transition-colors ${
poolPaused
? 'bg-emerald-100 text-emerald-700 hover:bg-emerald-200'
: 'bg-red-100 text-red-700 hover:bg-red-200'
}`}
>
{actionMessage}
</div>
{poolPaused ? (
<>
<Play className={`w-5 h-5 ${poolLoading ? 'animate-pulse' : ''}`} />
Start Pool
</>
) : (
<>
<Square className={`w-5 h-5 ${poolLoading ? 'animate-pulse' : ''}`} />
Stop Pool
</>
)}
</button>
<span className="text-sm text-gray-400">Auto-refreshes every 15s</span>
</div>
</div>
{error && (
<div className="p-4 bg-red-50 text-red-700 rounded-lg">{error}</div>
@@ -281,7 +280,7 @@ export default function TasksDashboard() {
>
<div className="flex items-center gap-2 mb-2">
<span className={`p-1.5 rounded ${STATUS_COLORS[status]}`}>
{STATUS_ICONS[status]}
{getStatusIcon(status, poolPaused)}
</span>
<span className="text-sm font-medium text-gray-600 capitalize">{status}</span>
</div>
@@ -496,7 +495,7 @@ export default function TasksDashboard() {
STATUS_COLORS[task.status]
}`}
>
{STATUS_ICONS[task.status]}
{getStatusIcon(task.status, poolPaused)}
{task.status}
</span>
</td>

View File

@@ -18,6 +18,9 @@ import {
Server,
MapPin,
Trash2,
Plus,
Minus,
Loader2,
} from 'lucide-react';
// Worker from registry
@@ -69,6 +72,14 @@ interface Task {
worker_id: string | null;
}
// K8s replica info (added 2024-12-10)
interface K8sReplicas {
current: number;
desired: number;
available: number;
updated: number;
}
function formatRelativeTime(dateStr: string | null): string {
if (!dateStr) return '-';
const date = new Date(dateStr);
@@ -215,10 +226,53 @@ export function WorkersDashboard() {
const [loading, setLoading] = useState(true);
const [error, setError] = useState<string | null>(null);
// K8s scaling state (added 2024-12-10)
const [k8sReplicas, setK8sReplicas] = useState<K8sReplicas | null>(null);
const [k8sError, setK8sError] = useState<string | null>(null);
const [scaling, setScaling] = useState(false);
const [targetReplicas, setTargetReplicas] = useState<number | null>(null);
// Pagination
const [page, setPage] = useState(0);
const workersPerPage = 15;
// Fetch K8s replica count (added 2024-12-10)
const fetchK8sReplicas = useCallback(async () => {
try {
const res = await api.get('/api/workers/k8s/replicas');
if (res.data.success && res.data.replicas) {
setK8sReplicas(res.data.replicas);
if (targetReplicas === null) {
setTargetReplicas(res.data.replicas.desired);
}
setK8sError(null);
}
} catch (err: any) {
// K8s not available (local dev or no RBAC)
setK8sError(err.response?.data?.error || 'K8s not available');
setK8sReplicas(null);
}
}, [targetReplicas]);
// Scale workers (added 2024-12-10)
const handleScale = useCallback(async (replicas: number) => {
if (replicas < 0 || replicas > 20) return;
setScaling(true);
try {
const res = await api.post('/api/workers/k8s/scale', { replicas });
if (res.data.success) {
setTargetReplicas(replicas);
// Refresh after a short delay to see the change
setTimeout(fetchK8sReplicas, 1000);
}
} catch (err: any) {
console.error('Scale error:', err);
setK8sError(err.response?.data?.error || 'Failed to scale');
} finally {
setScaling(false);
}
}, [fetchK8sReplicas]);
const fetchData = useCallback(async () => {
try {
// Fetch workers from registry
@@ -238,16 +292,6 @@ export function WorkersDashboard() {
}
}, []);
// Cleanup stale workers
const handleCleanupStale = async () => {
try {
await api.post('/api/worker-registry/cleanup', { stale_threshold_minutes: 2 });
fetchData();
} catch (err: any) {
console.error('Cleanup error:', err);
}
};
// Remove a single worker
const handleRemoveWorker = async (workerId: string) => {
if (!confirm('Remove this worker from the registry?')) return;
@@ -261,9 +305,14 @@ export function WorkersDashboard() {
useEffect(() => {
fetchData();
fetchK8sReplicas(); // Added 2024-12-10
const interval = setInterval(fetchData, 5000);
return () => clearInterval(interval);
}, [fetchData]);
const k8sInterval = setInterval(fetchK8sReplicas, 10000); // K8s refresh every 10s
return () => {
clearInterval(interval);
clearInterval(k8sInterval);
};
}, [fetchData, fetchK8sReplicas]);
// Paginated workers
const paginatedWorkers = workers.slice(
@@ -305,15 +354,6 @@ export function WorkersDashboard() {
{workers.length} registered workers ({busyWorkers.length} busy, {idleWorkers.length} idle)
</p>
</div>
<div className="flex items-center gap-2">
<button
onClick={handleCleanupStale}
className="flex items-center gap-2 px-4 py-2 bg-gray-100 text-gray-700 rounded-lg hover:bg-gray-200 transition-colors"
title="Mark stale workers (no heartbeat > 2 min) as offline"
>
<Trash2 className="w-4 h-4" />
Cleanup Stale
</button>
<button
onClick={() => fetchData()}
className="flex items-center gap-2 px-4 py-2 bg-emerald-600 text-white rounded-lg hover:bg-emerald-700 transition-colors"
@@ -322,7 +362,6 @@ export function WorkersDashboard() {
Refresh
</button>
</div>
</div>
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4">
@@ -330,6 +369,68 @@ export function WorkersDashboard() {
</div>
)}
{/* K8s Scaling Card (added 2024-12-10) */}
{k8sReplicas && (
<div className="bg-white rounded-lg border border-gray-200 p-4">
<div className="flex items-center justify-between">
<div className="flex items-center gap-3">
<div className="w-10 h-10 bg-purple-100 rounded-lg flex items-center justify-center">
<Server className="w-5 h-5 text-purple-600" />
</div>
<div>
<p className="text-sm text-gray-500">K8s Worker Pods</p>
<p className="text-xl font-semibold">
{k8sReplicas.current} / {k8sReplicas.desired}
{k8sReplicas.current !== k8sReplicas.desired && (
<span className="text-sm font-normal text-yellow-600 ml-2">scaling...</span>
)}
</p>
</div>
</div>
<div className="flex items-center gap-2">
<button
onClick={() => handleScale((targetReplicas || k8sReplicas.desired) - 1)}
disabled={scaling || (targetReplicas || k8sReplicas.desired) <= 0}
className="w-8 h-8 flex items-center justify-center bg-gray-100 text-gray-700 rounded-lg hover:bg-gray-200 disabled:opacity-50 disabled:cursor-not-allowed transition-colors"
title="Scale down"
>
<Minus className="w-4 h-4" />
</button>
<input
type="number"
min="0"
max="20"
value={targetReplicas ?? k8sReplicas.desired}
onChange={(e) => setTargetReplicas(Math.max(0, Math.min(20, parseInt(e.target.value) || 0)))}
onBlur={() => {
if (targetReplicas !== null && targetReplicas !== k8sReplicas.desired) {
handleScale(targetReplicas);
}
}}
onKeyDown={(e) => {
if (e.key === 'Enter' && targetReplicas !== null && targetReplicas !== k8sReplicas.desired) {
handleScale(targetReplicas);
}
}}
className="w-16 text-center border border-gray-300 rounded-lg px-2 py-1 text-lg font-semibold"
/>
<button
onClick={() => handleScale((targetReplicas || k8sReplicas.desired) + 1)}
disabled={scaling || (targetReplicas || k8sReplicas.desired) >= 20}
className="w-8 h-8 flex items-center justify-center bg-gray-100 text-gray-700 rounded-lg hover:bg-gray-200 disabled:opacity-50 disabled:cursor-not-allowed transition-colors"
title="Scale up"
>
<Plus className="w-4 h-4" />
</button>
{scaling && <Loader2 className="w-4 h-4 text-purple-600 animate-spin ml-2" />}
</div>
</div>
{k8sError && (
<p className="text-xs text-red-500 mt-2">{k8sError}</p>
)}
</div>
)}
{/* Stats Cards */}
<div className="grid grid-cols-5 gap-4">
<div className="bg-white rounded-lg border border-gray-200 p-4">

View File

@@ -1,4 +1,67 @@
# Task Worker Pods
# Task Worker Deployment
#
# Simple Deployment that runs task-worker.js to process tasks from worker_tasks queue.
# Workers pull tasks using DB-level locking (FOR UPDATE SKIP LOCKED).
#
# The worker will wait up to 60 minutes for active proxies to be added before failing.
# This allows deployment to succeed even if proxies aren't configured yet.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-worker
namespace: dispensary-scraper
spec:
replicas: 25
selector:
matchLabels:
app: scraper-worker
template:
metadata:
labels:
app: scraper-worker
spec:
imagePullSecrets:
- name: regcred
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["node"]
args: ["dist/tasks/task-worker.js"]
envFrom:
- configMapRef:
name: scraper-config
- secretRef:
name: scraper-secrets
env:
- name: WORKER_MODE
value: "true"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "pgrep -f 'task-worker' > /dev/null"
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
terminationGracePeriodSeconds: 60
---
# =============================================================================
# ALTERNATIVE: StatefulSet with multiple workers per pod (not currently used)
# =============================================================================
# Task Worker Pods (StatefulSet)
# Each pod runs 5 role-agnostic workers that pull tasks from worker_tasks queue.
#
# Architecture:

View File

@@ -46,14 +46,17 @@ class CannaIQ_Menus_Plugin {
// Initialize plugin
load_plugin_textdomain('cannaiq-menus', false, dirname(plugin_basename(__FILE__)) . '/languages');
// Register shortcodes
// Register shortcodes - primary CannaIQ shortcodes
add_shortcode('cannaiq_products', [$this, 'products_shortcode']);
add_shortcode('cannaiq_product', [$this, 'single_product_shortcode']);
// Legacy shortcode support (backward compatibility)
add_shortcode('crawlsy_products', [$this, 'products_shortcode']);
add_shortcode('crawlsy_product', [$this, 'single_product_shortcode']);
add_shortcode('dutchie_products', [$this, 'products_shortcode']);
add_shortcode('dutchie_product', [$this, 'single_product_shortcode']);
// DEPRECATED: Legacy shortcode aliases for backward compatibility only
// These allow sites that used the old plugin names to continue working
// New implementations should use [cannaiq_products] and [cannaiq_product]
add_shortcode('crawlsy_products', [$this, 'products_shortcode']); // deprecated
add_shortcode('crawlsy_product', [$this, 'single_product_shortcode']); // deprecated
add_shortcode('dutchie_products', [$this, 'products_shortcode']); // deprecated
add_shortcode('dutchie_product', [$this, 'single_product_shortcode']); // deprecated
}
/**
@@ -114,7 +117,9 @@ class CannaIQ_Menus_Plugin {
public function register_settings() {
register_setting('cannaiq_menus_settings', 'cannaiq_api_token');
// Migrate old settings if they exist
// MIGRATION: Auto-migrate API tokens from old plugin versions
// This runs once - if user had crawlsy or dutchie plugin, their token is preserved
// Can be removed in a future major version once all users have migrated
$old_crawlsy_token = get_option('crawlsy_api_token');
$old_dutchie_token = get_option('dutchie_api_token');

365
workflow-12102025.md Normal file
View File

@@ -0,0 +1,365 @@
# Workflow Documentation - December 10, 2025
## Purpose
This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.
---
## Stealth & Anti-Detection Requirements
### 1. Task Determines Work, Proxy Determines Identity
The task payload contains:
- `dispensary_id` - which store to crawl
- `role` - what type of work (product_resync, entry_point_discovery, etc.)
The **proxy** determines the session identity:
- Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
- Language is always English (`en-US`)
**Flow:**
```
Task claimed
└─► Get proxy from rotation
└─► Proxy has location (city, state, timezone)
└─► Build headers using proxy's timezone
- Accept-Language: en-US,en;q=0.9
- Timezone-consistent behavior
```
### 2. On 403 Block - Immediate Backoff
When a 403 is received:
1. **Immediately** stop using current IP
2. Get a new proxy (new IP)
3. Get a new UA/fingerprint
4. Retry the request
**Per-proxy failure tracking:**
- Track UA rotation attempts per proxy
- After 3 UA/fingerprint rotations on the same proxy → disable that proxy
- This means: if we rotate UA 3 times and still get 403, the proxy is burned
### 3. Fingerprint Rotation Rules
Each request uses:
- Proxy (IP)
- User-Agent
- sec-ch-ua headers (Client Hints)
- Accept-Language (from proxy location)
On 403:
1. Record failure on current proxy
2. Rotate to new proxy
3. Pick new random fingerprint
4. If same proxy fails 3 times with different fingerprints → disable proxy
### 4. Proxy Table Schema
```sql
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http',
active BOOLEAN DEFAULT true,
-- Location (determines session headers)
city VARCHAR(100),
state VARCHAR(50),
country VARCHAR(100),
country_code VARCHAR(10),
timezone VARCHAR(50),
-- Health tracking
failure_count INTEGER DEFAULT 0,
consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically
last_used_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
last_error TEXT,
-- Performance
response_time_ms INTEGER,
max_connections INTEGER DEFAULT 1
);
```
### 5. Failure Threshold
- **3 consecutive 403s** with different fingerprints → disable proxy
- Reset `consecutive_403_count` to 0 on successful request
- General `failure_count` tracks all errors (timeouts, connection errors, etc.)
---
## Implementation Status
### COMPLETED - December 10, 2025
All code changes have been implemented per this specification:
#### 1. crawl-rotator.ts ✅
- [x] Added `consecutive403Count` to Proxy interface
- [x] Added `markBlocked()` method that increments `consecutive_403_count` and disables proxy at 3
- [x] Added `getProxyTimezone()` to return current proxy's timezone
- [x] `markSuccess()` now resets `consecutive_403_count` to 0
- [x] Replaced hardcoded UA list with `intoli/user-agents` library for realistic fingerprints
- [x] `BrowserFingerprint` interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)
#### 2. client.ts ✅
- [x] `startSession()` no longer takes state/timezone params
- [x] `startSession()` gets identity from proxy via `crawlRotator.getProxyLocation()`
- [x] Added `handle403Block()` that:
- Calls `crawlRotator.recordBlock()` (tracks consecutive 403s)
- Immediately rotates both proxy and fingerprint via `rotateBoth()`
- Returns false if no more proxies available
- [x] `executeGraphQL()` calls `handle403Block()` on 403 (not `rotateProxyOn403`)
- [x] `fetchPage()` uses same 403 handling
- [x] 500ms backoff after rotation (not linear delay)
#### 3. Task Handlers ✅
- [x] `entry-point-discovery.ts`: `startSession()` called with no params
- [x] `product-refresh.ts`: `startSession()` called with no params
#### 4. Dependencies ✅
- [x] Added `user-agents` npm package for realistic UA generation
---
## Files Changed
| File | Changes |
|------|---------|
| `backend/src/services/crawl-rotator.ts` | Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents` |
| `backend/src/platforms/dutchie/client.ts` | `startSession()` uses proxy location, `handle403Block()` for 403 handling |
| `backend/src/tasks/handlers/entry-point-discovery.ts` | `startSession()` no params |
| `backend/src/tasks/handlers/product-refresh.ts` | `startSession()` no params |
| `backend/package.json` | Added `user-agents` dependency |
---
## Migration Required
The `proxies` table needs `consecutive_403_count` column if not already present:
```sql
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
```
---
## Key Behaviors Summary
| Behavior | Implementation |
|----------|----------------|
| Session identity | From proxy location (`getProxyLocation()`) |
| Language | Always `en-US,en;q=0.9` |
| 403 handling | `handle403Block()``recordBlock()``rotateBoth()` |
| Proxy disable | After 3 consecutive 403s (`consecutive403Count >= 3`) |
| Success reset | `markSuccess()` resets `consecutive403Count` to 0 |
| UA generation | `intoli/user-agents` library (daily updated, realistic fingerprints) |
| Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers |
---
## User-Agent Generation
### Data Source
The `intoli/user-agents` npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.
### Device Category Distribution (hardcoded)
| Category | Share |
|----------|-------|
| Mobile | 62% |
| Desktop | 36% |
| Tablet | 2% |
### Browser Filter (whitelist only)
Only these browsers are allowed:
- Chrome (67%)
- Safari (20%)
- Edge (6%)
- Firefox (3%)
Samsung Internet, Opera, and other niche browsers are filtered out.
### Desktop OS Distribution (from library)
| OS | Share |
|----|-------|
| Windows | 72% |
| macOS | 17% |
| Linux | 4% |
### UA Lifecycle
1. **Session start** (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
2. **UA sticks** until IP rotates (403 block or manual rotation)
3. **IP rotation** triggers new UA generation
### Failure Handling
- If UA generation fails → Alert admin dashboard, **stop crawl immediately**
- No fallback to static UA list
- This forces investigation rather than silent degradation
### Session Logging
Each session logs:
- Device category (mobile/desktop/tablet)
- Full UA string
- Browser name (Chrome/Safari/Edge/Firefox)
- IP address (from proxy)
- Session start timestamp
Logs are rotated monthly.
### Implementation
Located in `backend/src/services/crawl-rotator.ts`:
```typescript
// Per workflow-12102025.md: Device category distribution
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };
// Per workflow-12102025.md: Browser whitelist
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];
```
---
## HTTP Fingerprinting
### Goal
Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.
### Components
1. **Full Header Set** - All headers a real browser sends
2. **Header Ordering** - Browser-specific order (Chrome vs Firefox vs Safari)
3. **TLS Fingerprint** - Use `curl-impersonate` to match browser TLS signature
4. **Dynamic Referer** - Set per dispensary being crawled
5. **Natural Randomization** - Vary optional headers like real users
### Required Headers
| Header | Chrome | Firefox | Safari | Notes |
|--------|--------|---------|--------|-------|
| `User-Agent` | ✅ | ✅ | ✅ | From UA generation |
| `Accept` | ✅ | ✅ | ✅ | Content types |
| `Accept-Language` | ✅ | ✅ | ✅ | Always `en-US,en;q=0.9` |
| `Accept-Encoding` | ✅ | ✅ | ✅ | `gzip, deflate, br` |
| `Connection` | ✅ | ✅ | ✅ | `keep-alive` |
| `Origin` | ✅ | ✅ | ✅ | `https://dutchie.com` (POST only) |
| `Referer` | ✅ | ✅ | ✅ | Dynamic per dispensary |
| `sec-ch-ua` | ✅ | ❌ | ❌ | Chromium only |
| `sec-ch-ua-mobile` | ✅ | ❌ | ❌ | Chromium only |
| `sec-ch-ua-platform` | ✅ | ❌ | ❌ | Chromium only |
| `sec-fetch-dest` | ✅ | ✅ | ❌ | `empty` for XHR |
| `sec-fetch-mode` | ✅ | ✅ | ❌ | `cors` for XHR |
| `sec-fetch-site` | ✅ | ✅ | ❌ | `same-origin` |
| `Upgrade-Insecure-Requests` | ✅ | ✅ | ✅ | `1` (page loads only) |
| `DNT` | ~30% | ~30% | ~30% | Randomized per session |
### Header Ordering
Each browser sends headers in a specific order. Fingerprinting services detect mismatches.
**Chrome order (GraphQL request):**
1. Host
2. Connection
3. Content-Length (POST)
4. sec-ch-ua
5. DNT (if enabled)
6. sec-ch-ua-mobile
7. User-Agent
8. sec-ch-ua-platform
9. Content-Type (POST)
10. Accept
11. Origin (POST)
12. sec-fetch-site
13. sec-fetch-mode
14. sec-fetch-dest
15. Referer
16. Accept-Encoding
17. Accept-Language
**Firefox order (GraphQL request):**
1. Host
2. User-Agent
3. Accept
4. Accept-Language
5. Accept-Encoding
6. Content-Type (POST)
7. Content-Length (POST)
8. Origin (POST)
9. DNT (if enabled)
10. Connection
11. Referer
12. sec-fetch-dest
13. sec-fetch-mode
14. sec-fetch-site
**Safari order (GraphQL request):**
1. Host
2. Connection
3. Content-Length (POST)
4. Accept
5. User-Agent
6. Content-Type (POST)
7. Origin (POST)
8. Referer
9. Accept-Encoding
10. Accept-Language
### TLS Fingerprinting
Use `curl-impersonate` instead of standard curl:
- `curl_chrome131` - Mimics Chrome 131 TLS handshake
- `curl_ff133` - Mimics Firefox 133 TLS handshake
- `curl_safari17` - Mimics Safari 17 TLS handshake
Match TLS binary to browser in UA.
### Dynamic Referer
Set Referer to the dispensary's actual page URL:
```
Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa
```
Derived from dispensary's `menu_url` field.
### Natural Randomization
Per-session randomization (set once when session starts, consistent for session):
| Feature | Distribution | Implementation |
|---------|--------------|----------------|
| DNT header | 30% have it | `Math.random() < 0.30` |
| Accept quality values | Slight variation | `q=0.9` vs `q=0.8` |
### Implementation Files
| File | Purpose |
|------|---------|
| `src/services/crawl-rotator.ts` | `BrowserFingerprint` includes full header config |
| `src/platforms/dutchie/client.ts` | Build headers from fingerprint, use curl-impersonate |
| `src/services/http-fingerprint.ts` | Header ordering per browser (NEW) |