Compare commits

...

7 Commits

Author SHA1 Message Date
Kelly
b1ab45f662 fix: Fix TypeScript errors for CI
- Add export {} to cli.ts and discover-and-import-store.ts to treat as modules
- Remove scrapeStore reference in crawler-jobs.ts (legacy scraper removed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-08 10:45:47 -07:00
Kelly
20300edbb8 fix: Remove platform_id_source from harmonization INSERT
Production DB doesn't have this column - removing to allow creates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-08 10:23:19 -07:00
Kelly
b7cfec0770 feat: AZ dispensary harmonization with Dutchie source of truth
Major changes:
- Add harmonize-az-dispensaries.ts script to sync dispensaries with Dutchie API
- Add migration 057 for crawl_enabled and dutchie_verified fields
- Remove legacy dutchie-az module (replaced by platforms/dutchie)
- Clean up deprecated crawlers, scrapers, and orchestrator code
- Update location-discovery to not fallback to slug when ID is missing
- Add crawl-rotator service for proxy rotation
- Add types/index.ts for shared type definitions
- Add woodpecker-agent k8s manifest

Harmonization script:
- Queries ConsumerDispensaries API for all 32 AZ cities
- Matches dispensaries by platform_dispensary_id (not slug)
- Updates existing records with full Dutchie data
- Creates new records for unmatched Dutchie dispensaries
- Disables dispensaries not found in Dutchie

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-08 10:19:49 -07:00
Kelly
948a732dd5 feat: Rename WordPress plugin to CannaIQ Menus v1.5.3
- Rename plugin from Crawlsy Menus to CannaIQ Menus
- Update version to 1.5.3
- Update text domain to cannaiq-menus
- Rename all CSS classes from crawlsy-* to cannaiq-*
- Update shortcodes to [cannaiq_products] and [cannaiq_product]
- Add backward compatibility for legacy shortcodes
- Update download links on Home and LandingPage
- Fix health panel Redis timeout issue
- Add clear error message when backend not running

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-08 00:24:43 -07:00
Kelly
bf4ceaf09e fix: Make all migrations idempotent
- 008: Add IF NOT EXISTS to ALTER TABLE ADD COLUMN
- 011: Add IF NOT EXISTS to CREATE TABLE and INDEX
- 012: Add IF NOT EXISTS, DROP TRIGGER IF EXISTS
- 013: Add ON CONFLICT (azdhs_id) DO NOTHING
- 014: Add IF NOT EXISTS to ALTER TABLE ADD COLUMN

All migrations can now be safely re-run without errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-07 23:48:35 -07:00
kelly
fda688b11a Merge pull request 'feat: Responsive UI, SEO pages, AI content generation' (#2) from feature/workers-dashboard into master 2025-12-08 06:43:36 +00:00
kelly
414b97b3c0 Merge pull request 'feature/workers-dashboard' (#1) from feature/workers-dashboard into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/1
2025-12-08 02:54:26 +00:00
129 changed files with 4219 additions and 34970 deletions

View File

@@ -193,6 +193,44 @@ CannaiQ has **TWO databases** with distinct purposes:
| `dutchie_menus` | **Canonical CannaiQ database** - All schema, migrations, and application data | READ/WRITE |
| `dutchie_legacy` | **Legacy read-only archive** - Historical data from old system | READ-ONLY |
### Store vs Dispensary Terminology
**"Store" and "Dispensary" are SYNONYMS in CannaiQ.**
| Term | Usage | DB Table |
|------|-------|----------|
| Store | API routes (`/api/stores`) | `dispensaries` |
| Dispensary | DB table, internal code | `dispensaries` |
- `/api/stores` and `/api/dispensaries` both query the `dispensaries` table
- There is NO `stores` table in use - it's a legacy empty table
- Use these terms interchangeably in code and documentation
### Canonical vs Legacy Tables
**CANONICAL TABLES (USE THESE):**
| Table | Purpose | Row Count |
|-------|---------|-----------|
| `dispensaries` | Store/dispensary records | ~188+ rows |
| `dutchie_products` | Product catalog | ~37,000+ rows |
| `dutchie_product_snapshots` | Price/stock history | ~millions |
| `store_products` | Canonical product schema | ~37,000+ rows |
| `store_product_snapshots` | Canonical snapshot schema | growing |
**LEGACY TABLES (EMPTY - DO NOT USE):**
| Table | Status | Action |
|-------|--------|--------|
| `stores` | EMPTY (0 rows) | Use `dispensaries` instead |
| `products` | EMPTY (0 rows) | Use `dutchie_products` or `store_products` |
| `categories` | EMPTY (0 rows) | Categories stored in product records |
**Code must NEVER:**
- Query the `stores` table (use `dispensaries`)
- Query the `products` table (use `dutchie_products` or `store_products`)
- Query the `categories` table (categories are in product records)
**CRITICAL RULES:**
- **Migrations ONLY run on `dutchie_menus`** - NEVER on `dutchie_legacy`
- **Application code connects ONLY to `dutchie_menus`**
@@ -615,15 +653,28 @@ export default defineConfig({
### Detailed Rules
1) **Dispensary vs Store**
- Dutchie pipeline uses `dispensaries` (not legacy `stores`). For dutchie crawls, always work with dispensary ID.
1) **Dispensary = Store (SAME THING)**
- "Dispensary" and "store" are synonyms in CannaiQ. Use interchangeably.
- **API endpoint**: `/api/stores` (NOT `/api/dispensaries`)
- **DB table**: `dispensaries`
- When you need to create/query stores via API, use `/api/stores`
- Use the record's `menu_url` and `platform_dispensary_id`.
2) **Menu detection and platform IDs**
2) **API Authentication**
- **Trusted Origins (no auth needed)**:
- IPs: `127.0.0.1`, `::1`, `::ffff:127.0.0.1`
- Origins: `https://cannaiq.co`, `https://findadispo.com`, `https://findagram.co`
- Also: `http://localhost:3010`, `http://localhost:8080`, `http://localhost:5173`
- Requests from trusted IPs/origins get automatic admin access (`role: 'internal'`)
- **Remote (non-trusted)**: Use Bearer token (JWT or API token). NO username/password auth.
- Never try to login with username/password via API - use tokens only.
- See `src/auth/middleware.ts` for `TRUSTED_ORIGINS` and `TRUSTED_IPS` lists.
3) **Menu detection and platform IDs**
- Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`.
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.
3) **Queries and mapping**
4) **Queries and mapping**
- The DB returns snake_case; code expects camelCase. Always alias/map:
- `platform_dispensary_id AS "platformDispensaryId"`
- Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl).

View File

@@ -1,30 +1,52 @@
# CannaiQ Backend Environment Configuration
# Copy this file to .env and fill in the values
# Server
PORT=3010
NODE_ENV=development
# =============================================================================
# CannaiQ Database (dutchie_menus) - PRIMARY DATABASE
# CANNAIQ DATABASE (dutchie_menus) - PRIMARY DATABASE
# =============================================================================
# This is where all schema migrations run and where canonical tables live.
# All CANNAIQ_DB_* variables are REQUIRED - connection will fail if missing.
# This is where ALL schema migrations run and where canonical tables live.
# All CANNAIQ_DB_* variables are REQUIRED - no defaults.
# The application will fail to start if any are missing.
CANNAIQ_DB_HOST=localhost
CANNAIQ_DB_PORT=54320
CANNAIQ_DB_NAME=dutchie_menus
CANNAIQ_DB_NAME=dutchie_menus # MUST be dutchie_menus - NOT dutchie_legacy
CANNAIQ_DB_USER=dutchie
CANNAIQ_DB_PASS=dutchie_local_pass
# Alternative: Use a full connection URL instead of individual vars
# If set, this takes priority over individual vars above
# CANNAIQ_DB_URL=postgresql://user:pass@host:port/dutchie_menus
# =============================================================================
# Legacy Database (dutchie_legacy) - READ-ONLY SOURCE
# LEGACY DATABASE (dutchie_legacy) - READ-ONLY FOR ETL
# =============================================================================
# Used ONLY by ETL scripts to read historical data.
# NEVER run migrations against this database.
# These are only needed when running 042_legacy_import.ts
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_NAME=dutchie_legacy
LEGACY_DB_NAME=dutchie_legacy # READ-ONLY - never migrated
LEGACY_DB_USER=dutchie
LEGACY_DB_PASS=dutchie_local_pass
LEGACY_DB_PASS=
# Local image storage (no MinIO per CLAUDE.md)
# Alternative: Use a full connection URL instead of individual vars
# LEGACY_DB_URL=postgresql://user:pass@host:port/dutchie_legacy
# =============================================================================
# LOCAL STORAGE
# =============================================================================
# Local image storage path (no MinIO)
LOCAL_IMAGES_PATH=./public/images
# JWT
# =============================================================================
# AUTHENTICATION
# =============================================================================
JWT_SECRET=your-secret-key-change-in-production
ANTHROPIC_API_KEY=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
OPENAI_API_KEY=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

View File

@@ -1,18 +1,18 @@
-- Add location columns to proxies table
ALTER TABLE proxies
ADD COLUMN city VARCHAR(100),
ADD COLUMN state VARCHAR(100),
ADD COLUMN country VARCHAR(100),
ADD COLUMN country_code VARCHAR(2),
ADD COLUMN location_updated_at TIMESTAMP;
ADD COLUMN IF NOT EXISTS city VARCHAR(100),
ADD COLUMN IF NOT EXISTS state VARCHAR(100),
ADD COLUMN IF NOT EXISTS country VARCHAR(100),
ADD COLUMN IF NOT EXISTS country_code VARCHAR(2),
ADD COLUMN IF NOT EXISTS location_updated_at TIMESTAMP;
-- Add index for location-based queries
CREATE INDEX idx_proxies_location ON proxies(country_code, state, city);
CREATE INDEX IF NOT EXISTS idx_proxies_location ON proxies(country_code, state, city);
-- Add the same to failed_proxies table
ALTER TABLE failed_proxies
ADD COLUMN city VARCHAR(100),
ADD COLUMN state VARCHAR(100),
ADD COLUMN country VARCHAR(100),
ADD COLUMN country_code VARCHAR(2),
ADD COLUMN location_updated_at TIMESTAMP;
ADD COLUMN IF NOT EXISTS city VARCHAR(100),
ADD COLUMN IF NOT EXISTS state VARCHAR(100),
ADD COLUMN IF NOT EXISTS country VARCHAR(100),
ADD COLUMN IF NOT EXISTS country_code VARCHAR(2),
ADD COLUMN IF NOT EXISTS location_updated_at TIMESTAMP;

View File

@@ -1,6 +1,6 @@
-- Create dispensaries table as single source of truth
-- This consolidates azdhs_list (official data) + stores (menu data) into one table
CREATE TABLE dispensaries (
CREATE TABLE IF NOT EXISTS dispensaries (
-- Primary key
id SERIAL PRIMARY KEY,
@@ -43,11 +43,11 @@ CREATE TABLE dispensaries (
);
-- Create indexes for common queries
CREATE INDEX idx_dispensaries_city ON dispensaries(city);
CREATE INDEX idx_dispensaries_state ON dispensaries(state);
CREATE INDEX idx_dispensaries_slug ON dispensaries(slug);
CREATE INDEX idx_dispensaries_azdhs_id ON dispensaries(azdhs_id);
CREATE INDEX idx_dispensaries_menu_status ON dispensaries(menu_scrape_status);
CREATE INDEX IF NOT EXISTS idx_dispensaries_city ON dispensaries(city);
CREATE INDEX IF NOT EXISTS idx_dispensaries_state ON dispensaries(state);
CREATE INDEX IF NOT EXISTS idx_dispensaries_slug ON dispensaries(slug);
CREATE INDEX IF NOT EXISTS idx_dispensaries_azdhs_id ON dispensaries(azdhs_id);
CREATE INDEX IF NOT EXISTS idx_dispensaries_menu_status ON dispensaries(menu_scrape_status);
-- Create index for location-based queries
CREATE INDEX idx_dispensaries_location ON dispensaries(latitude, longitude) WHERE latitude IS NOT NULL AND longitude IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_dispensaries_location ON dispensaries(latitude, longitude) WHERE latitude IS NOT NULL AND longitude IS NOT NULL;

View File

@@ -1,6 +1,6 @@
-- Create dispensary_changes table for change approval workflow
-- This protects against accidental data destruction by requiring manual review
CREATE TABLE dispensary_changes (
CREATE TABLE IF NOT EXISTS dispensary_changes (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE,
@@ -26,10 +26,10 @@ CREATE TABLE dispensary_changes (
);
-- Create indexes for common queries
CREATE INDEX idx_dispensary_changes_status ON dispensary_changes(status);
CREATE INDEX idx_dispensary_changes_dispensary_status ON dispensary_changes(dispensary_id, status);
CREATE INDEX idx_dispensary_changes_created_at ON dispensary_changes(created_at DESC);
CREATE INDEX idx_dispensary_changes_requires_recrawl ON dispensary_changes(requires_recrawl) WHERE requires_recrawl = TRUE;
CREATE INDEX IF NOT EXISTS idx_dispensary_changes_status ON dispensary_changes(status);
CREATE INDEX IF NOT EXISTS idx_dispensary_changes_dispensary_status ON dispensary_changes(dispensary_id, status);
CREATE INDEX IF NOT EXISTS idx_dispensary_changes_created_at ON dispensary_changes(created_at DESC);
CREATE INDEX IF NOT EXISTS idx_dispensary_changes_requires_recrawl ON dispensary_changes(requires_recrawl) WHERE requires_recrawl = TRUE;
-- Create function to automatically set requires_recrawl for website/menu_url changes
CREATE OR REPLACE FUNCTION set_requires_recrawl()
@@ -42,7 +42,8 @@ BEGIN
END;
$$ LANGUAGE plpgsql;
-- Create trigger to call the function
-- Create trigger to call the function (drop first to make idempotent)
DROP TRIGGER IF EXISTS trigger_set_requires_recrawl ON dispensary_changes;
CREATE TRIGGER trigger_set_requires_recrawl
BEFORE INSERT ON dispensary_changes
FOR EACH ROW

View File

@@ -1,6 +1,7 @@
-- Populate dispensaries table from azdhs_list
-- This migrates all 182 AZDHS records with their enriched Google Maps data
-- For multi-location dispensaries with duplicate slugs, append city name to make unique
-- IDEMPOTENT: Uses ON CONFLICT DO NOTHING to skip already-imported records
WITH ranked_dispensaries AS (
SELECT
@@ -78,9 +79,10 @@ SELECT
created_at,
updated_at
FROM ranked_dispensaries
ORDER BY id;
ORDER BY id
ON CONFLICT (azdhs_id) DO NOTHING;
-- Verify the migration
-- Verify the migration (idempotent - just logs, doesn't fail)
DO $$
DECLARE
source_count INTEGER;
@@ -89,9 +91,11 @@ BEGIN
SELECT COUNT(*) INTO source_count FROM azdhs_list;
SELECT COUNT(*) INTO dest_count FROM dispensaries;
RAISE NOTICE 'Migration complete: % records from azdhs_list % records in dispensaries', source_count, dest_count;
RAISE NOTICE 'Migration status: % records in azdhs_list, % records in dispensaries', source_count, dest_count;
IF source_count != dest_count THEN
RAISE EXCEPTION 'Record count mismatch! Expected %, got %', source_count, dest_count;
IF dest_count >= source_count THEN
RAISE NOTICE 'OK: dispensaries table has expected records';
ELSE
RAISE WARNING 'dispensaries has fewer records than azdhs_list (% vs %)', dest_count, source_count;
END IF;
END $$;

View File

@@ -3,15 +3,15 @@
-- Add dispensary_id to products table
ALTER TABLE products
ADD COLUMN dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE CASCADE;
ADD COLUMN IF NOT EXISTS dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE CASCADE;
-- Add dispensary_id to categories table
ALTER TABLE categories
ADD COLUMN dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE CASCADE;
ADD COLUMN IF NOT EXISTS dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE CASCADE;
-- Create indexes for the new foreign keys
CREATE INDEX idx_products_dispensary_id ON products(dispensary_id);
CREATE INDEX idx_categories_dispensary_id ON categories(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_products_dispensary_id ON products(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_categories_dispensary_id ON categories(dispensary_id);
-- NOTE: We'll populate these FKs and migrate data from stores in a separate data migration
-- For now, new scrapers should use dispensary_id, but old store_id still works

View File

@@ -0,0 +1,42 @@
-- Migration 057: Add crawl_enabled and dutchie_verified fields to dispensaries
--
-- Purpose:
-- 1. Add crawl_enabled to control which dispensaries get crawled
-- 2. Add dutchie_verified to track Dutchie source-of-truth verification
-- 3. Default existing records to crawl_enabled = TRUE to preserve behavior
--
-- After this migration, run the harmonization script to:
-- - Match dispensaries to Dutchie discoveries
-- - Update platform_dispensary_id from Dutchie
-- - Set dutchie_verified = TRUE for matches
-- - Set crawl_enabled = FALSE for unverified records
-- Add crawl_enabled column (defaults to true to not break existing crawls)
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS crawl_enabled BOOLEAN DEFAULT TRUE;
-- Add dutchie_verified column to track if record is verified against Dutchie
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS dutchie_verified BOOLEAN DEFAULT FALSE;
-- Add dutchie_verified_at timestamp
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS dutchie_verified_at TIMESTAMP WITH TIME ZONE;
-- Add dutchie_discovery_id to link back to the discovery record
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS dutchie_discovery_id BIGINT REFERENCES dutchie_discovery_locations(id);
-- Create index for crawl queries (only crawl enabled dispensaries)
CREATE INDEX IF NOT EXISTS idx_dispensaries_crawl_enabled
ON dispensaries(crawl_enabled, state)
WHERE crawl_enabled = TRUE;
-- Create index for dutchie verification status
CREATE INDEX IF NOT EXISTS idx_dispensaries_dutchie_verified
ON dispensaries(dutchie_verified, state);
COMMENT ON COLUMN dispensaries.crawl_enabled IS 'Whether this dispensary should be included in crawl jobs. Set to FALSE for unverified or problematic records.';
COMMENT ON COLUMN dispensaries.dutchie_verified IS 'Whether this dispensary has been verified against Dutchie source of truth (matched by slug or manually linked).';
COMMENT ON COLUMN dispensaries.dutchie_verified_at IS 'Timestamp when Dutchie verification was completed.';
COMMENT ON COLUMN dispensaries.dutchie_discovery_id IS 'Link to the dutchie_discovery_locations record this was matched/verified against.';

View File

@@ -0,0 +1,56 @@
-- Migration 065: Slug verification and data source tracking
-- Adds columns to track when slug/menu data was verified and from what source
-- Add slug verification columns to dispensaries
ALTER TABLE dispensaries
ADD COLUMN IF NOT EXISTS slug_source VARCHAR(50),
ADD COLUMN IF NOT EXISTS slug_verified_at TIMESTAMPTZ,
ADD COLUMN IF NOT EXISTS slug_status VARCHAR(20) DEFAULT 'unverified',
ADD COLUMN IF NOT EXISTS menu_url_source VARCHAR(50),
ADD COLUMN IF NOT EXISTS menu_url_verified_at TIMESTAMPTZ,
ADD COLUMN IF NOT EXISTS platform_id_source VARCHAR(50),
ADD COLUMN IF NOT EXISTS platform_id_verified_at TIMESTAMPTZ,
ADD COLUMN IF NOT EXISTS country VARCHAR(2) DEFAULT 'US';
-- Add index for finding unverified stores
CREATE INDEX IF NOT EXISTS idx_dispensaries_slug_status
ON dispensaries(slug_status)
WHERE slug_status != 'verified';
-- Add index for country
CREATE INDEX IF NOT EXISTS idx_dispensaries_country
ON dispensaries(country);
-- Comment on columns
COMMENT ON COLUMN dispensaries.slug_source IS 'Source of slug data: dutchie_api, manual, azdhs, discovery, etc.';
COMMENT ON COLUMN dispensaries.slug_verified_at IS 'When the slug was last verified against the source';
COMMENT ON COLUMN dispensaries.slug_status IS 'Status: unverified, verified, invalid, changed';
COMMENT ON COLUMN dispensaries.menu_url_source IS 'Source of menu_url: dutchie_api, website_scrape, manual, etc.';
COMMENT ON COLUMN dispensaries.menu_url_verified_at IS 'When the menu_url was last verified';
COMMENT ON COLUMN dispensaries.platform_id_source IS 'Source of platform_dispensary_id: dutchie_api, graphql_resolution, etc.';
COMMENT ON COLUMN dispensaries.platform_id_verified_at IS 'When the platform_dispensary_id was last verified';
COMMENT ON COLUMN dispensaries.country IS 'ISO 2-letter country code: US, CA, etc.';
-- Update Green Pharms Mesa with verified Dutchie data
UPDATE dispensaries
SET
slug = 'green-pharms-mesa',
menu_url = 'https://dutchie.com/embedded-menu/green-pharms-mesa',
menu_type = 'dutchie',
platform_dispensary_id = '68dc47a2af90f2e653f8df30',
slug_source = 'dutchie_api',
slug_verified_at = NOW(),
slug_status = 'verified',
menu_url_source = 'dutchie_api',
menu_url_verified_at = NOW(),
platform_id_source = 'dutchie_api',
platform_id_verified_at = NOW(),
updated_at = NOW()
WHERE id = 232;
-- Mark all other AZ dispensaries as needing verification
UPDATE dispensaries
SET slug_status = 'unverified'
WHERE state = 'AZ'
AND id != 232
AND (slug_status IS NULL OR slug_status = 'unverified');

Binary file not shown.

View File

@@ -1,3 +1,14 @@
/**
* CannaiQ Authentication Middleware
*
* AUTH METHODS (in order of priority):
* 1. IP-based: Localhost/trusted IPs get 'internal' role (full access, no token needed)
* 2. Token-based: Bearer token (JWT or API token)
*
* NO username/password auth in API. Use tokens only.
*
* Localhost bypass: curl from 127.0.0.1 gets automatic admin access.
*/
import { Request, Response, NextFunction } from 'express';
import jwt from 'jsonwebtoken';
import bcrypt from 'bcrypt';
@@ -5,6 +16,61 @@ import { pool } from '../db/pool';
const JWT_SECRET = process.env.JWT_SECRET || 'change_this_in_production';
// Trusted origins that bypass auth for internal/same-origin requests
const TRUSTED_ORIGINS = [
'https://cannaiq.co',
'https://www.cannaiq.co',
'https://findadispo.com',
'https://www.findadispo.com',
'https://findagram.co',
'https://www.findagram.co',
'http://localhost:3010',
'http://localhost:8080',
'http://localhost:5173',
];
// Trusted IPs for internal pod-to-pod communication
const TRUSTED_IPS = [
'127.0.0.1',
'::1',
'::ffff:127.0.0.1',
];
/**
* Check if request is from a trusted origin/IP
*/
function isTrustedRequest(req: Request): boolean {
// Check origin header
const origin = req.headers.origin;
if (origin && TRUSTED_ORIGINS.includes(origin)) {
return true;
}
// Check referer header (for same-origin requests without CORS)
const referer = req.headers.referer;
if (referer) {
for (const trusted of TRUSTED_ORIGINS) {
if (referer.startsWith(trusted)) {
return true;
}
}
}
// Check IP for internal requests (pod-to-pod, localhost)
const clientIp = req.ip || req.socket.remoteAddress || '';
if (TRUSTED_IPS.includes(clientIp)) {
return true;
}
// Check for Kubernetes internal header (set by ingress/service mesh)
const internalHeader = req.headers['x-internal-request'];
if (internalHeader === process.env.INTERNAL_REQUEST_SECRET) {
return true;
}
return false;
}
export interface AuthUser {
id: number;
email: string;
@@ -61,6 +127,16 @@ export async function authenticateUser(email: string, password: string): Promise
}
export async function authMiddleware(req: AuthRequest, res: Response, next: NextFunction) {
// Allow trusted origins/IPs to bypass auth (internal services, same-origin)
if (isTrustedRequest(req)) {
req.user = {
id: 0,
email: 'internal@system',
role: 'internal'
};
return next();
}
const authHeader = req.headers.authorization;
if (!authHeader || !authHeader.startsWith('Bearer ')) {
@@ -135,12 +211,23 @@ export async function authMiddleware(req: AuthRequest, res: Response, next: Next
}
}
/**
* Require specific role(s) to access endpoint.
*
* NOTE: 'internal' role (localhost/trusted IPs) bypasses all role checks.
* This allows local development and internal services full access.
*/
export function requireRole(...roles: string[]) {
return (req: AuthRequest, res: Response, next: NextFunction) => {
if (!req.user) {
return res.status(401).json({ error: 'Not authenticated' });
}
// Internal role (localhost) bypasses role checks
if (req.user.role === 'internal') {
return next();
}
if (!roles.includes(req.user.role)) {
return res.status(403).json({ error: 'Insufficient permissions' });
}

View File

@@ -472,7 +472,8 @@ export class CanonicalHydrationService {
}
// Step 3: Create initial snapshots from current product state
const snapshotsWritten = await this.createInitialSnapshots(dispensaryId, crawlRunId);
// crawlRunId is guaranteed to be set at this point (either from existing run or insert)
const snapshotsWritten = await this.createInitialSnapshots(dispensaryId, crawlRunId!);
result.snapshotsWritten += snapshotsWritten;
// Update crawl run with snapshot count

View File

@@ -1,6 +1,7 @@
#!/usr/bin/env node
/**
* CLI Entrypoint for CannaIQ Backend
* @module cli
*
* Usage:
* npx tsx src/cli.ts # Start API server
@@ -50,18 +51,14 @@ async function main() {
showHelp();
}
if (args.includes('--worker')) {
console.log('[CLI] Starting worker process...');
const { startWorker } = await import('./dutchie-az/services/worker');
await startWorker();
} else {
// Default: start API server
console.log('[CLI] Starting API server...');
await import('./index');
}
}
main().catch((error) => {
console.error('[CLI] Fatal error:', error);
process.exit(1);
});
export {};

View File

@@ -1,657 +0,0 @@
/**
* Base Dutchie Crawler Template
*
* This is the base template for all Dutchie store crawlers.
* Per-store crawlers extend this by overriding specific methods.
*
* Exports:
* - crawlProducts(dispensary, options) - Main crawl entry point
* - detectStructure(page) - Detect page structure for sandbox mode
* - extractProducts(document) - Extract product data
* - extractImages(document) - Extract product images
* - extractStock(document) - Extract stock status
* - extractPagination(document) - Extract pagination info
*/
import {
crawlDispensaryProducts as baseCrawlDispensaryProducts,
CrawlResult,
} from '../../dutchie-az/services/product-crawler';
import { Dispensary, CrawlerProfileOptions } from '../../dutchie-az/types';
// Re-export CrawlResult for convenience
export { CrawlResult };
// ============================================================
// TYPES
// ============================================================
/**
* Options passed to the per-store crawler
*/
export interface StoreCrawlOptions {
pricingType?: 'rec' | 'med';
useBothModes?: boolean;
downloadImages?: boolean;
trackStock?: boolean;
timeoutMs?: number;
config?: Record<string, any>;
}
/**
* Progress callback for reporting crawl progress
*/
export interface CrawlProgressCallback {
phase: 'fetching' | 'processing' | 'saving' | 'images' | 'complete';
current: number;
total: number;
message?: string;
}
/**
* Structure detection result for sandbox mode
*/
export interface StructureDetectionResult {
success: boolean;
menuType: 'dutchie' | 'treez' | 'jane' | 'unknown';
iframeUrl?: string;
graphqlEndpoint?: string;
dispensaryId?: string;
selectors: {
productContainer?: string;
productName?: string;
productPrice?: string;
productImage?: string;
productCategory?: string;
pagination?: string;
loadMore?: string;
};
pagination: {
type: 'scroll' | 'click' | 'graphql' | 'none';
hasMore?: boolean;
pageSize?: number;
};
errors: string[];
metadata: Record<string, any>;
}
/**
* Product extraction result
*/
export interface ExtractedProduct {
externalId: string;
name: string;
brand?: string;
category?: string;
subcategory?: string;
price?: number;
priceRec?: number;
priceMed?: number;
weight?: string;
thcContent?: string;
cbdContent?: string;
description?: string;
imageUrl?: string;
stockStatus?: 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown';
quantity?: number;
raw?: Record<string, any>;
}
/**
* Image extraction result
*/
export interface ExtractedImage {
productId: string;
imageUrl: string;
isPrimary: boolean;
position: number;
}
/**
* Stock extraction result
*/
export interface ExtractedStock {
productId: string;
status: 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown';
quantity?: number;
lastChecked: Date;
}
/**
* Pagination extraction result
*/
export interface ExtractedPagination {
hasNextPage: boolean;
currentPage?: number;
totalPages?: number;
totalProducts?: number;
nextCursor?: string;
loadMoreSelector?: string;
}
/**
* Hook points that per-store crawlers can override
*/
export interface DutchieCrawlerHooks {
/**
* Called before fetching products
* Can be used to set up custom headers, cookies, etc.
*/
beforeFetch?: (dispensary: Dispensary) => Promise<void>;
/**
* Called after fetching products, before processing
* Can be used to filter or transform raw products
*/
afterFetch?: (products: any[], dispensary: Dispensary) => Promise<any[]>;
/**
* Called after all processing is complete
* Can be used for cleanup or post-processing
*/
afterComplete?: (result: CrawlResult, dispensary: Dispensary) => Promise<void>;
/**
* Custom selector resolver for iframe detection
*/
resolveIframe?: (page: any) => Promise<string | null>;
/**
* Custom product container selector
*/
getProductContainerSelector?: () => string;
/**
* Custom product extraction from container element
*/
extractProductFromElement?: (element: any) => Promise<ExtractedProduct | null>;
}
/**
* Selectors configuration for per-store overrides
*/
export interface DutchieSelectors {
iframe?: string;
productContainer?: string;
productName?: string;
productPrice?: string;
productPriceRec?: string;
productPriceMed?: string;
productImage?: string;
productCategory?: string;
productBrand?: string;
productWeight?: string;
productThc?: string;
productCbd?: string;
productDescription?: string;
productStock?: string;
loadMore?: string;
pagination?: string;
}
// ============================================================
// DEFAULT SELECTORS
// ============================================================
export const DEFAULT_DUTCHIE_SELECTORS: DutchieSelectors = {
iframe: 'iframe[src*="dutchie.com"]',
productContainer: '[data-testid="product-card"], .product-card, [class*="ProductCard"]',
productName: '[data-testid="product-title"], .product-title, [class*="ProductTitle"]',
productPrice: '[data-testid="product-price"], .product-price, [class*="ProductPrice"]',
productImage: 'img[src*="dutchie"], img[src*="product"], .product-image img',
productCategory: '[data-testid="category-name"], .category-name',
productBrand: '[data-testid="brand-name"], .brand-name, [class*="BrandName"]',
loadMore: 'button[data-testid="load-more"], .load-more-button',
pagination: '.pagination, [class*="Pagination"]',
};
// ============================================================
// BASE CRAWLER CLASS
// ============================================================
/**
* BaseDutchieCrawler - Base class for all Dutchie store crawlers
*
* Per-store crawlers extend this class and override methods as needed.
* The default implementation delegates to the existing shared Dutchie logic.
*/
export class BaseDutchieCrawler {
protected dispensary: Dispensary;
protected options: StoreCrawlOptions;
protected hooks: DutchieCrawlerHooks;
protected selectors: DutchieSelectors;
constructor(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
hooks: DutchieCrawlerHooks = {},
selectors: DutchieSelectors = {}
) {
this.dispensary = dispensary;
this.options = {
pricingType: 'rec',
useBothModes: true,
downloadImages: true,
trackStock: true,
timeoutMs: 30000,
...options,
};
this.hooks = hooks;
this.selectors = { ...DEFAULT_DUTCHIE_SELECTORS, ...selectors };
}
/**
* Main entry point - crawl products for this dispensary
* Override this in per-store crawlers to customize behavior
*/
async crawlProducts(): Promise<CrawlResult> {
// Call beforeFetch hook if defined
if (this.hooks.beforeFetch) {
await this.hooks.beforeFetch(this.dispensary);
}
// Use the existing shared Dutchie crawl logic
const result = await baseCrawlDispensaryProducts(
this.dispensary,
this.options.pricingType || 'rec',
{
useBothModes: this.options.useBothModes,
downloadImages: this.options.downloadImages,
}
);
// Call afterComplete hook if defined
if (this.hooks.afterComplete) {
await this.hooks.afterComplete(result, this.dispensary);
}
return result;
}
/**
* Detect page structure for sandbox discovery mode
* Override in per-store crawlers if needed
*
* @param page - Puppeteer page object or HTML string
* @returns Structure detection result
*/
async detectStructure(page: any): Promise<StructureDetectionResult> {
const result: StructureDetectionResult = {
success: false,
menuType: 'unknown',
selectors: {},
pagination: { type: 'none' },
errors: [],
metadata: {},
};
try {
// Default implementation: check for Dutchie iframe
if (typeof page === 'string') {
// HTML string mode
if (page.includes('dutchie.com')) {
result.menuType = 'dutchie';
result.success = true;
}
} else if (page && typeof page.evaluate === 'function') {
// Puppeteer page mode
const detection = await page.evaluate((selectorConfig: DutchieSelectors) => {
const iframe = document.querySelector(selectorConfig.iframe || '') as HTMLIFrameElement;
const iframeUrl = iframe?.src || null;
// Check for product containers
const containers = document.querySelectorAll(selectorConfig.productContainer || '');
return {
hasIframe: !!iframe,
iframeUrl,
productCount: containers.length,
isDutchie: !!iframeUrl?.includes('dutchie.com'),
};
}, this.selectors);
if (detection.isDutchie) {
result.menuType = 'dutchie';
result.iframeUrl = detection.iframeUrl;
result.success = true;
}
result.metadata = detection;
}
// Set default selectors for Dutchie
if (result.menuType === 'dutchie') {
result.selectors = {
productContainer: this.selectors.productContainer,
productName: this.selectors.productName,
productPrice: this.selectors.productPrice,
productImage: this.selectors.productImage,
productCategory: this.selectors.productCategory,
};
result.pagination = { type: 'graphql' };
}
} catch (error: any) {
result.errors.push(`Detection error: ${error.message}`);
}
return result;
}
/**
* Extract products from page/document
* Override in per-store crawlers for custom extraction
*
* @param document - DOM document, Puppeteer page, or raw products array
* @returns Array of extracted products
*/
async extractProducts(document: any): Promise<ExtractedProduct[]> {
// Default implementation: assume document is already an array of products
// from the GraphQL response
if (Array.isArray(document)) {
return document.map((product) => this.mapRawProduct(product));
}
// If document is a Puppeteer page, extract from DOM
if (document && typeof document.evaluate === 'function') {
return this.extractProductsFromPage(document);
}
return [];
}
/**
* Extract products from Puppeteer page
* Override for custom DOM extraction
*/
protected async extractProductsFromPage(page: any): Promise<ExtractedProduct[]> {
const products = await page.evaluate((selectors: DutchieSelectors) => {
const containers = document.querySelectorAll(selectors.productContainer || '');
return Array.from(containers).map((container) => {
const nameEl = container.querySelector(selectors.productName || '');
const priceEl = container.querySelector(selectors.productPrice || '');
const imageEl = container.querySelector(selectors.productImage || '') as HTMLImageElement;
const brandEl = container.querySelector(selectors.productBrand || '');
return {
name: nameEl?.textContent?.trim() || '',
price: priceEl?.textContent?.trim() || '',
imageUrl: imageEl?.src || '',
brand: brandEl?.textContent?.trim() || '',
};
});
}, this.selectors);
return products.map((p: any, i: number) => ({
externalId: `dom-product-${i}`,
name: p.name,
brand: p.brand,
price: this.parsePrice(p.price),
imageUrl: p.imageUrl,
stockStatus: 'unknown' as const,
}));
}
/**
* Map raw product from GraphQL to ExtractedProduct
* Override for custom mapping
*/
protected mapRawProduct(raw: any): ExtractedProduct {
return {
externalId: raw.id || raw._id || raw.externalId,
name: raw.name || raw.Name,
brand: raw.brand?.name || raw.brandName || raw.brand,
category: raw.type || raw.category || raw.Category,
subcategory: raw.subcategory || raw.Subcategory,
price: raw.recPrice || raw.price || raw.Price,
priceRec: raw.recPrice || raw.Prices?.rec,
priceMed: raw.medPrice || raw.Prices?.med,
weight: raw.weight || raw.Weight,
thcContent: raw.potencyThc?.formatted || raw.THCContent?.formatted,
cbdContent: raw.potencyCbd?.formatted || raw.CBDContent?.formatted,
description: raw.description || raw.Description,
imageUrl: raw.image || raw.Image,
stockStatus: this.mapStockStatus(raw),
quantity: raw.quantity || raw.Quantity,
raw,
};
}
/**
* Map raw stock status to standardized value
*/
protected mapStockStatus(raw: any): 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown' {
const status = raw.Status || raw.status || raw.stockStatus;
if (status === 'Active' || status === 'active' || status === 'in_stock') {
return 'in_stock';
}
if (status === 'Inactive' || status === 'inactive' || status === 'out_of_stock') {
return 'out_of_stock';
}
if (status === 'low_stock') {
return 'low_stock';
}
return 'unknown';
}
/**
* Parse price string to number
*/
protected parsePrice(priceStr: string): number | undefined {
if (!priceStr) return undefined;
const cleaned = priceStr.replace(/[^0-9.]/g, '');
const num = parseFloat(cleaned);
return isNaN(num) ? undefined : num;
}
/**
* Extract images from document
* Override for custom image extraction
*
* @param document - DOM document, Puppeteer page, or products array
* @returns Array of extracted images
*/
async extractImages(document: any): Promise<ExtractedImage[]> {
if (Array.isArray(document)) {
return document
.filter((p) => p.image || p.Image || p.imageUrl)
.map((p, i) => ({
productId: p.id || p._id || `product-${i}`,
imageUrl: p.image || p.Image || p.imageUrl,
isPrimary: true,
position: 0,
}));
}
// Puppeteer page extraction
if (document && typeof document.evaluate === 'function') {
return this.extractImagesFromPage(document);
}
return [];
}
/**
* Extract images from Puppeteer page
*/
protected async extractImagesFromPage(page: any): Promise<ExtractedImage[]> {
const images = await page.evaluate((selector: string) => {
const imgs = document.querySelectorAll(selector);
return Array.from(imgs).map((img, i) => ({
src: (img as HTMLImageElement).src,
position: i,
}));
}, this.selectors.productImage || 'img');
return images.map((img: any, i: number) => ({
productId: `dom-product-${i}`,
imageUrl: img.src,
isPrimary: i === 0,
position: img.position,
}));
}
/**
* Extract stock information from document
* Override for custom stock extraction
*
* @param document - DOM document, Puppeteer page, or products array
* @returns Array of extracted stock statuses
*/
async extractStock(document: any): Promise<ExtractedStock[]> {
if (Array.isArray(document)) {
return document.map((p) => ({
productId: p.id || p._id || p.externalId,
status: this.mapStockStatus(p),
quantity: p.quantity || p.Quantity,
lastChecked: new Date(),
}));
}
return [];
}
/**
* Extract pagination information from document
* Override for custom pagination handling
*
* @param document - DOM document, Puppeteer page, or GraphQL response
* @returns Pagination info
*/
async extractPagination(document: any): Promise<ExtractedPagination> {
// Default: check for page info in GraphQL response
if (document && document.pageInfo) {
return {
hasNextPage: document.pageInfo.hasNextPage || false,
currentPage: document.pageInfo.currentPage,
totalPages: document.pageInfo.totalPages,
totalProducts: document.pageInfo.totalCount || document.totalCount,
nextCursor: document.pageInfo.endCursor,
};
}
// Default: no pagination
return {
hasNextPage: false,
};
}
/**
* Get the cName (Dutchie slug) for this dispensary
* Override to customize cName extraction
*/
getCName(): string {
if (this.dispensary.menuUrl) {
try {
const url = new URL(this.dispensary.menuUrl);
const segments = url.pathname.split('/').filter(Boolean);
if (segments.length >= 2) {
return segments[segments.length - 1];
}
} catch {
// Fall through to default
}
}
return this.dispensary.slug || '';
}
/**
* Get custom headers for API requests
* Override for store-specific headers
*/
getCustomHeaders(): Record<string, string> {
const cName = this.getCName();
return {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
Origin: 'https://dutchie.com',
Referer: `https://dutchie.com/embedded-menu/${cName}`,
};
}
}
// ============================================================
// FACTORY FUNCTION
// ============================================================
/**
* Create a base Dutchie crawler instance
* This is the default export used when no per-store override exists
*/
export function createCrawler(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
hooks: DutchieCrawlerHooks = {},
selectors: DutchieSelectors = {}
): BaseDutchieCrawler {
return new BaseDutchieCrawler(dispensary, options, hooks, selectors);
}
// ============================================================
// STANDALONE FUNCTIONS (required exports for orchestrator)
// ============================================================
/**
* Crawl products using the base Dutchie logic
* Per-store files can call this or override it completely
*/
export async function crawlProducts(
dispensary: Dispensary,
options: StoreCrawlOptions = {}
): Promise<CrawlResult> {
const crawler = createCrawler(dispensary, options);
return crawler.crawlProducts();
}
/**
* Detect structure using the base Dutchie logic
*/
export async function detectStructure(
page: any,
dispensary?: Dispensary
): Promise<StructureDetectionResult> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.detectStructure(page);
}
/**
* Extract products using the base Dutchie logic
*/
export async function extractProducts(
document: any,
dispensary?: Dispensary
): Promise<ExtractedProduct[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractProducts(document);
}
/**
* Extract images using the base Dutchie logic
*/
export async function extractImages(
document: any,
dispensary?: Dispensary
): Promise<ExtractedImage[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractImages(document);
}
/**
* Extract stock using the base Dutchie logic
*/
export async function extractStock(
document: any,
dispensary?: Dispensary
): Promise<ExtractedStock[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractStock(document);
}
/**
* Extract pagination using the base Dutchie logic
*/
export async function extractPagination(
document: any,
dispensary?: Dispensary
): Promise<ExtractedPagination> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractPagination(document);
}

View File

@@ -1,330 +0,0 @@
/**
* Base Jane Crawler Template (PLACEHOLDER)
*
* This is the base template for all Jane (iheartjane) store crawlers.
* Per-store crawlers extend this by overriding specific methods.
*
* TODO: Implement Jane-specific crawling logic (Algolia-based)
*/
import { Dispensary } from '../../dutchie-az/types';
import {
StoreCrawlOptions,
CrawlResult,
StructureDetectionResult,
ExtractedProduct,
ExtractedImage,
ExtractedStock,
ExtractedPagination,
} from './base-dutchie';
// Re-export types
export {
StoreCrawlOptions,
CrawlResult,
StructureDetectionResult,
ExtractedProduct,
ExtractedImage,
ExtractedStock,
ExtractedPagination,
};
// ============================================================
// JANE-SPECIFIC TYPES
// ============================================================
export interface JaneConfig {
algoliaAppId?: string;
algoliaApiKey?: string;
algoliaIndex?: string;
storeId?: string;
}
export interface JaneSelectors {
productContainer?: string;
productName?: string;
productPrice?: string;
productImage?: string;
productCategory?: string;
productBrand?: string;
pagination?: string;
loadMore?: string;
}
export const DEFAULT_JANE_SELECTORS: JaneSelectors = {
productContainer: '[data-testid="product-card"], .product-card',
productName: '[data-testid="product-name"], .product-name',
productPrice: '[data-testid="product-price"], .product-price',
productImage: '.product-image img, [data-testid="product-image"] img',
productCategory: '.product-category',
productBrand: '.product-brand, [data-testid="brand-name"]',
loadMore: '[data-testid="load-more"], .load-more-btn',
};
// ============================================================
// BASE JANE CRAWLER CLASS
// ============================================================
export class BaseJaneCrawler {
protected dispensary: Dispensary;
protected options: StoreCrawlOptions;
protected selectors: JaneSelectors;
protected janeConfig: JaneConfig;
constructor(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
selectors: JaneSelectors = {},
janeConfig: JaneConfig = {}
) {
this.dispensary = dispensary;
this.options = {
pricingType: 'rec',
useBothModes: false,
downloadImages: true,
trackStock: true,
timeoutMs: 30000,
...options,
};
this.selectors = { ...DEFAULT_JANE_SELECTORS, ...selectors };
this.janeConfig = janeConfig;
}
/**
* Main entry point - crawl products for this dispensary
* TODO: Implement Jane/Algolia-specific crawling
*/
async crawlProducts(): Promise<CrawlResult> {
const startTime = Date.now();
console.warn(`[BaseJaneCrawler] Jane crawling not yet implemented for ${this.dispensary.name}`);
return {
success: false,
dispensaryId: this.dispensary.id || 0,
productsFound: 0,
productsFetched: 0,
productsUpserted: 0,
snapshotsCreated: 0,
imagesDownloaded: 0,
errorMessage: 'Jane crawler not yet implemented',
durationMs: Date.now() - startTime,
};
}
/**
* Detect page structure for sandbox discovery mode
* Jane uses Algolia, so we look for Algolia config
*/
async detectStructure(page: any): Promise<StructureDetectionResult> {
const result: StructureDetectionResult = {
success: false,
menuType: 'unknown',
selectors: {},
pagination: { type: 'none' },
errors: [],
metadata: {},
};
try {
if (page && typeof page.evaluate === 'function') {
// Look for Jane/Algolia indicators
const detection = await page.evaluate(() => {
// Check for iheartjane in page
const hasJane = document.documentElement.innerHTML.includes('iheartjane') ||
document.documentElement.innerHTML.includes('jane-menu');
// Look for Algolia config
const scripts = Array.from(document.querySelectorAll('script'));
let algoliaConfig: any = null;
for (const script of scripts) {
const content = script.textContent || '';
if (content.includes('algolia') || content.includes('ALGOLIA')) {
// Try to extract config
const appIdMatch = content.match(/applicationId['":\s]+['"]([^'"]+)['"]/);
const apiKeyMatch = content.match(/apiKey['":\s]+['"]([^'"]+)['"]/);
if (appIdMatch && apiKeyMatch) {
algoliaConfig = {
appId: appIdMatch[1],
apiKey: apiKeyMatch[1],
};
}
}
}
return {
hasJane,
algoliaConfig,
};
});
if (detection.hasJane) {
result.menuType = 'jane';
result.success = true;
result.metadata = detection;
if (detection.algoliaConfig) {
result.metadata.algoliaAppId = detection.algoliaConfig.appId;
result.metadata.algoliaApiKey = detection.algoliaConfig.apiKey;
}
}
}
} catch (error: any) {
result.errors.push(`Detection error: ${error.message}`);
}
return result;
}
/**
* Extract products from Algolia response or page
*/
async extractProducts(document: any): Promise<ExtractedProduct[]> {
// If document is Algolia hits array
if (Array.isArray(document)) {
return document.map((hit) => this.mapAlgoliaHit(hit));
}
console.warn('[BaseJaneCrawler] extractProducts not yet fully implemented');
return [];
}
/**
* Map Algolia hit to ExtractedProduct
*/
protected mapAlgoliaHit(hit: any): ExtractedProduct {
return {
externalId: hit.objectID || hit.id || hit.product_id,
name: hit.name || hit.product_name,
brand: hit.brand || hit.brand_name,
category: hit.category || hit.kind,
subcategory: hit.subcategory,
price: hit.price || hit.bucket_price,
priceRec: hit.prices?.rec || hit.price_rec,
priceMed: hit.prices?.med || hit.price_med,
weight: hit.weight || hit.amount,
thcContent: hit.percent_thc ? `${hit.percent_thc}%` : undefined,
cbdContent: hit.percent_cbd ? `${hit.percent_cbd}%` : undefined,
description: hit.description,
imageUrl: hit.image_url || hit.product_image_url,
stockStatus: hit.available ? 'in_stock' : 'out_of_stock',
quantity: hit.quantity_available,
raw: hit,
};
}
/**
* Extract images from document
*/
async extractImages(document: any): Promise<ExtractedImage[]> {
if (Array.isArray(document)) {
return document
.filter((hit) => hit.image_url || hit.product_image_url)
.map((hit, i) => ({
productId: hit.objectID || hit.id || `jane-product-${i}`,
imageUrl: hit.image_url || hit.product_image_url,
isPrimary: true,
position: 0,
}));
}
return [];
}
/**
* Extract stock information from document
*/
async extractStock(document: any): Promise<ExtractedStock[]> {
if (Array.isArray(document)) {
return document.map((hit) => ({
productId: hit.objectID || hit.id,
status: hit.available ? 'in_stock' as const : 'out_of_stock' as const,
quantity: hit.quantity_available,
lastChecked: new Date(),
}));
}
return [];
}
/**
* Extract pagination information
* Algolia uses cursor-based pagination
*/
async extractPagination(document: any): Promise<ExtractedPagination> {
if (document && typeof document === 'object' && !Array.isArray(document)) {
return {
hasNextPage: document.page < document.nbPages - 1,
currentPage: document.page,
totalPages: document.nbPages,
totalProducts: document.nbHits,
};
}
return { hasNextPage: false };
}
}
// ============================================================
// FACTORY FUNCTION
// ============================================================
export function createCrawler(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
selectors: JaneSelectors = {},
janeConfig: JaneConfig = {}
): BaseJaneCrawler {
return new BaseJaneCrawler(dispensary, options, selectors, janeConfig);
}
// ============================================================
// STANDALONE FUNCTIONS
// ============================================================
export async function crawlProducts(
dispensary: Dispensary,
options: StoreCrawlOptions = {}
): Promise<CrawlResult> {
const crawler = createCrawler(dispensary, options);
return crawler.crawlProducts();
}
export async function detectStructure(
page: any,
dispensary?: Dispensary
): Promise<StructureDetectionResult> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.detectStructure(page);
}
export async function extractProducts(
document: any,
dispensary?: Dispensary
): Promise<ExtractedProduct[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractProducts(document);
}
export async function extractImages(
document: any,
dispensary?: Dispensary
): Promise<ExtractedImage[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractImages(document);
}
export async function extractStock(
document: any,
dispensary?: Dispensary
): Promise<ExtractedStock[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractStock(document);
}
export async function extractPagination(
document: any,
dispensary?: Dispensary
): Promise<ExtractedPagination> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractPagination(document);
}

View File

@@ -1,212 +0,0 @@
/**
* Base Treez Crawler Template (PLACEHOLDER)
*
* This is the base template for all Treez store crawlers.
* Per-store crawlers extend this by overriding specific methods.
*
* TODO: Implement Treez-specific crawling logic
*/
import { Dispensary } from '../../dutchie-az/types';
import {
StoreCrawlOptions,
CrawlResult,
StructureDetectionResult,
ExtractedProduct,
ExtractedImage,
ExtractedStock,
ExtractedPagination,
} from './base-dutchie';
// Re-export types
export {
StoreCrawlOptions,
CrawlResult,
StructureDetectionResult,
ExtractedProduct,
ExtractedImage,
ExtractedStock,
ExtractedPagination,
};
// ============================================================
// TREEZ-SPECIFIC TYPES
// ============================================================
export interface TreezSelectors {
productContainer?: string;
productName?: string;
productPrice?: string;
productImage?: string;
productCategory?: string;
productBrand?: string;
addToCart?: string;
pagination?: string;
}
export const DEFAULT_TREEZ_SELECTORS: TreezSelectors = {
productContainer: '.product-tile, [class*="ProductCard"]',
productName: '.product-name, [class*="ProductName"]',
productPrice: '.product-price, [class*="ProductPrice"]',
productImage: '.product-image img',
productCategory: '.product-category',
productBrand: '.product-brand',
addToCart: '.add-to-cart-btn',
pagination: '.pagination',
};
// ============================================================
// BASE TREEZ CRAWLER CLASS
// ============================================================
export class BaseTreezCrawler {
protected dispensary: Dispensary;
protected options: StoreCrawlOptions;
protected selectors: TreezSelectors;
constructor(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
selectors: TreezSelectors = {}
) {
this.dispensary = dispensary;
this.options = {
pricingType: 'rec',
useBothModes: false,
downloadImages: true,
trackStock: true,
timeoutMs: 30000,
...options,
};
this.selectors = { ...DEFAULT_TREEZ_SELECTORS, ...selectors };
}
/**
* Main entry point - crawl products for this dispensary
* TODO: Implement Treez-specific crawling
*/
async crawlProducts(): Promise<CrawlResult> {
const startTime = Date.now();
console.warn(`[BaseTreezCrawler] Treez crawling not yet implemented for ${this.dispensary.name}`);
return {
success: false,
dispensaryId: this.dispensary.id || 0,
productsFound: 0,
productsFetched: 0,
productsUpserted: 0,
snapshotsCreated: 0,
imagesDownloaded: 0,
errorMessage: 'Treez crawler not yet implemented',
durationMs: Date.now() - startTime,
};
}
/**
* Detect page structure for sandbox discovery mode
*/
async detectStructure(page: any): Promise<StructureDetectionResult> {
return {
success: false,
menuType: 'unknown',
selectors: {},
pagination: { type: 'none' },
errors: ['Treez structure detection not yet implemented'],
metadata: {},
};
}
/**
* Extract products from page/document
*/
async extractProducts(document: any): Promise<ExtractedProduct[]> {
console.warn('[BaseTreezCrawler] extractProducts not yet implemented');
return [];
}
/**
* Extract images from document
*/
async extractImages(document: any): Promise<ExtractedImage[]> {
console.warn('[BaseTreezCrawler] extractImages not yet implemented');
return [];
}
/**
* Extract stock information from document
*/
async extractStock(document: any): Promise<ExtractedStock[]> {
console.warn('[BaseTreezCrawler] extractStock not yet implemented');
return [];
}
/**
* Extract pagination information from document
*/
async extractPagination(document: any): Promise<ExtractedPagination> {
return { hasNextPage: false };
}
}
// ============================================================
// FACTORY FUNCTION
// ============================================================
export function createCrawler(
dispensary: Dispensary,
options: StoreCrawlOptions = {},
selectors: TreezSelectors = {}
): BaseTreezCrawler {
return new BaseTreezCrawler(dispensary, options, selectors);
}
// ============================================================
// STANDALONE FUNCTIONS
// ============================================================
export async function crawlProducts(
dispensary: Dispensary,
options: StoreCrawlOptions = {}
): Promise<CrawlResult> {
const crawler = createCrawler(dispensary, options);
return crawler.crawlProducts();
}
export async function detectStructure(
page: any,
dispensary?: Dispensary
): Promise<StructureDetectionResult> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.detectStructure(page);
}
export async function extractProducts(
document: any,
dispensary?: Dispensary
): Promise<ExtractedProduct[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractProducts(document);
}
export async function extractImages(
document: any,
dispensary?: Dispensary
): Promise<ExtractedImage[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractImages(document);
}
export async function extractStock(
document: any,
dispensary?: Dispensary
): Promise<ExtractedStock[]> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractStock(document);
}
export async function extractPagination(
document: any,
dispensary?: Dispensary
): Promise<ExtractedPagination> {
const crawler = createCrawler(dispensary || ({} as Dispensary));
return crawler.extractPagination(document);
}

View File

@@ -1,27 +0,0 @@
/**
* Base Crawler Templates Index
*
* Exports all base crawler templates for easy importing.
*/
// Dutchie base (primary implementation)
export * from './base-dutchie';
// Treez base (placeholder)
export * as Treez from './base-treez';
// Jane base (placeholder)
export * as Jane from './base-jane';
// Re-export common types from dutchie for convenience
export type {
StoreCrawlOptions,
CrawlResult,
StructureDetectionResult,
ExtractedProduct,
ExtractedImage,
ExtractedStock,
ExtractedPagination,
DutchieCrawlerHooks,
DutchieSelectors,
} from './base-dutchie';

View File

@@ -1,9 +0,0 @@
/**
* Base Dutchie Crawler Template (Re-export for backward compatibility)
*
* DEPRECATED: Import from '../base/base-dutchie' instead.
* This file re-exports everything from the new location for existing code.
*/
// Re-export everything from the new base location
export * from '../base/base-dutchie';

View File

@@ -1,118 +0,0 @@
/**
* Trulieve Scottsdale - Per-Store Dutchie Crawler
*
* Store ID: 101
* Profile Key: trulieve-scottsdale
* Platform Dispensary ID: 5eaf489fa8a61801212577cc
*
* Phase 1: Identity implementation - no overrides, just uses base Dutchie logic.
* Future: Add store-specific selectors, timing, or custom logic as needed.
*/
import {
BaseDutchieCrawler,
StoreCrawlOptions,
CrawlResult,
DutchieSelectors,
crawlProducts as baseCrawlProducts,
} from '../../base/base-dutchie';
import { Dispensary } from '../../../dutchie-az/types';
// Re-export CrawlResult for the orchestrator
export { CrawlResult };
// ============================================================
// STORE CONFIGURATION
// ============================================================
/**
* Store-specific configuration
* These can be used to customize crawler behavior for this store
*/
export const STORE_CONFIG = {
storeId: 101,
profileKey: 'trulieve-scottsdale',
name: 'Trulieve of Scottsdale Dispensary',
platformDispensaryId: '5eaf489fa8a61801212577cc',
// Store-specific overrides (none for Phase 1)
customOptions: {
// Example future overrides:
// pricingType: 'rec',
// useBothModes: true,
// customHeaders: {},
// maxRetries: 3,
},
};
// ============================================================
// STORE CRAWLER CLASS
// ============================================================
/**
* TrulieveScottsdaleCrawler - Per-store crawler for Trulieve Scottsdale
*
* Phase 1: Identity implementation - extends BaseDutchieCrawler with no overrides.
* Future phases can override methods like:
* - getCName() for custom slug handling
* - crawlProducts() for completely custom logic
* - Add hooks for pre/post processing
*/
export class TrulieveScottsdaleCrawler extends BaseDutchieCrawler {
constructor(dispensary: Dispensary, options: StoreCrawlOptions = {}) {
// Merge store-specific options with provided options
const mergedOptions: StoreCrawlOptions = {
...STORE_CONFIG.customOptions,
...options,
};
super(dispensary, mergedOptions);
}
// Phase 1: No overrides - use base implementation
// Future phases can add overrides here:
//
// async crawlProducts(): Promise<CrawlResult> {
// // Custom pre-processing
// // ...
// const result = await super.crawlProducts();
// // Custom post-processing
// // ...
// return result;
// }
}
// ============================================================
// EXPORTED CRAWL FUNCTION
// ============================================================
/**
* Main entry point for the orchestrator
*
* The orchestrator calls: mod.crawlProducts(dispensary, options)
* This function creates a TrulieveScottsdaleCrawler and runs it.
*/
export async function crawlProducts(
dispensary: Dispensary,
options: StoreCrawlOptions = {}
): Promise<CrawlResult> {
console.log(`[TrulieveScottsdale] Using per-store crawler for ${dispensary.name}`);
const crawler = new TrulieveScottsdaleCrawler(dispensary, options);
return crawler.crawlProducts();
}
// ============================================================
// FACTORY FUNCTION (alternative API)
// ============================================================
/**
* Create a crawler instance without running it
* Useful for testing or when you need to configure before running
*/
export function createCrawler(
dispensary: Dispensary,
options: StoreCrawlOptions = {}
): TrulieveScottsdaleCrawler {
return new TrulieveScottsdaleCrawler(dispensary, options);
}

View File

@@ -77,7 +77,9 @@ export function getPool(): Pool {
* This is a getter that lazily initializes on first access.
*/
export const pool = {
query: (...args: Parameters<Pool['query']>) => getPool().query(...args),
query: (queryTextOrConfig: string | import('pg').QueryConfig, values?: any[]): Promise<import('pg').QueryResult<any>> => {
return getPool().query(queryTextOrConfig as any, values);
},
connect: () => getPool().connect(),
end: () => getPool().end(),
on: (event: 'error' | 'connect' | 'acquire' | 'remove' | 'release', listener: (...args: any[]) => void) => getPool().on(event as any, listener),

View File

@@ -26,13 +26,377 @@ import {
mapLocationRowToLocation,
} from './types';
import { DiscoveryCity } from './types';
import {
executeGraphQL,
fetchPage,
extractNextData,
GRAPHQL_HASHES,
setProxy,
} from '../platforms/dutchie/client';
import { getStateProxy, getRandomProxy } from '../utils/proxyManager';
puppeteer.use(StealthPlugin());
// ============================================================
// PROXY INITIALIZATION
// ============================================================
// Call initDiscoveryProxy() before any discovery operations to
// set up proxy if USE_PROXY=true environment variable is set.
// This is opt-in and does NOT break existing behavior.
// ============================================================
let proxyInitialized = false;
/**
* Initialize proxy for discovery operations
* Only runs if USE_PROXY=true is set in environment
* Safe to call multiple times - only initializes once
*
* @param stateCode - Optional state code for state-specific proxy (e.g., 'AZ', 'CA')
* @returns true if proxy was set, false if skipped or failed
*/
export async function initDiscoveryProxy(stateCode?: string): Promise<boolean> {
// Skip if already initialized
if (proxyInitialized) {
return true;
}
// Skip if USE_PROXY is not enabled
if (process.env.USE_PROXY !== 'true') {
console.log('[LocationDiscovery] Proxy disabled (USE_PROXY != true)');
return false;
}
try {
// Get proxy - prefer state-specific if state code provided
const proxyConfig = stateCode
? await getStateProxy(stateCode)
: await getRandomProxy();
if (!proxyConfig) {
console.warn('[LocationDiscovery] No proxy available, proceeding without proxy');
return false;
}
// Build proxy URL with auth if needed
let proxyUrl = proxyConfig.server;
if (proxyConfig.username && proxyConfig.password) {
const url = new URL(proxyConfig.server);
url.username = proxyConfig.username;
url.password = proxyConfig.password;
proxyUrl = url.toString();
}
// Set proxy on the Dutchie client
setProxy(proxyUrl);
proxyInitialized = true;
console.log(`[LocationDiscovery] Proxy initialized for ${stateCode || 'general'} discovery`);
return true;
} catch (error: any) {
console.error(`[LocationDiscovery] Failed to initialize proxy: ${error.message}`);
return false;
}
}
/**
* Reset proxy initialization flag (for testing or re-initialization)
*/
export function resetProxyInit(): void {
proxyInitialized = false;
setProxy(null);
}
const PLATFORM = 'dutchie';
// ============================================================
// GRAPHQL / API FETCHING
// CITY-BASED DISCOVERY (CANONICAL SOURCE OF TRUTH)
// ============================================================
// GraphQL with city+state filter is the SOURCE OF TRUTH for database data.
//
// Method:
// 1. Get city list from statesWithDispensaries (in __NEXT_DATA__)
// 2. Query stores per city using city + state GraphQL filter
// 3. This gives us complete, accurate dispensary data
//
// Geo-coordinate queries (nearLat/nearLng) are ONLY for showing search
// results to users (e.g., "stores within 20 miles of me").
// They are NOT a source of truth for establishing database records.
// ============================================================
/**
* State with dispensary cities from Dutchie's statesWithDispensaries data
*/
export interface StateWithCities {
name: string; // State code (e.g., "CA", "AZ")
country: string; // Country code (e.g., "US")
cities: string[]; // Array of city names
}
/**
* Fetch all states with their cities from Dutchie's __NEXT_DATA__
*
* This fetches a city page and extracts the statesWithDispensaries data
* which contains all states and their cities where Dutchie has dispensaries.
*/
export async function fetchStatesWithDispensaries(
options: { verbose?: boolean } = {}
): Promise<StateWithCities[]> {
const { verbose = false } = options;
// Initialize proxy if USE_PROXY=true
await initDiscoveryProxy();
console.log('[LocationDiscovery] Fetching statesWithDispensaries from Dutchie...');
// Fetch any city page to get the __NEXT_DATA__ with statesWithDispensaries
// Using a known city that's likely to exist
const result = await fetchPage('/dispensaries/az/phoenix', { maxRetries: 3 });
if (!result || result.status !== 200) {
console.error('[LocationDiscovery] Failed to fetch city page');
return [];
}
const nextData = extractNextData(result.html);
if (!nextData) {
console.error('[LocationDiscovery] No __NEXT_DATA__ found');
return [];
}
// Extract statesWithDispensaries from Apollo state
const apolloState = nextData.props?.pageProps?.initialApolloState;
if (!apolloState) {
console.error('[LocationDiscovery] No initialApolloState found');
return [];
}
// Find ROOT_QUERY.statesWithDispensaries
const rootQuery = apolloState['ROOT_QUERY'];
if (!rootQuery) {
console.error('[LocationDiscovery] No ROOT_QUERY found');
return [];
}
// The statesWithDispensaries is at ROOT_QUERY.statesWithDispensaries
const statesRefs = rootQuery.statesWithDispensaries;
if (!Array.isArray(statesRefs)) {
console.error('[LocationDiscovery] statesWithDispensaries not found or not an array');
return [];
}
// Resolve the references to actual state data
const states: StateWithCities[] = [];
for (const ref of statesRefs) {
// ref might be { __ref: "StateWithDispensaries:0" } or direct object
let stateData: any;
if (ref && ref.__ref) {
stateData = apolloState[ref.__ref];
} else {
stateData = ref;
}
if (stateData && stateData.name) {
// Parse cities JSON array if it's a string
let cities = stateData.cities;
if (typeof cities === 'string') {
try {
cities = JSON.parse(cities);
} catch {
cities = [];
}
}
states.push({
name: stateData.name,
country: stateData.country || 'US',
cities: Array.isArray(cities) ? cities : [],
});
}
}
if (verbose) {
console.log(`[LocationDiscovery] Found ${states.length} states`);
for (const state of states) {
console.log(` ${state.name}: ${state.cities.length} cities`);
}
}
console.log(`[LocationDiscovery] Loaded ${states.length} states with cities`);
return states;
}
/**
* Get cities for a specific state
*/
export async function getCitiesForState(
stateCode: string,
options: { verbose?: boolean } = {}
): Promise<string[]> {
const states = await fetchStatesWithDispensaries(options);
const state = states.find(s => s.name.toUpperCase() === stateCode.toUpperCase());
if (!state) {
console.warn(`[LocationDiscovery] No cities found for state: ${stateCode}`);
return [];
}
console.log(`[LocationDiscovery] Found ${state.cities.length} cities for ${stateCode}`);
return state.cities;
}
/**
* Fetch dispensaries for a specific city+state using GraphQL
*
* This is the CORRECT method for establishing database data:
* Uses city + state filter, NOT geo-coordinates.
*/
export async function fetchDispensariesByCityState(
city: string,
stateCode: string,
options: { verbose?: boolean; perPage?: number; maxPages?: number } = {}
): Promise<DutchieLocationResponse[]> {
const { verbose = false, perPage = 200, maxPages = 10 } = options;
// Initialize proxy if USE_PROXY=true (state-specific proxy preferred)
await initDiscoveryProxy(stateCode);
console.log(`[LocationDiscovery] Fetching dispensaries for ${city}, ${stateCode}...`);
const allDispensaries: any[] = [];
let page = 0;
let hasMore = true;
while (hasMore && page < maxPages) {
const variables = {
dispensaryFilter: {
activeOnly: true,
city: city,
state: stateCode,
},
page,
perPage,
};
try {
const result = await executeGraphQL(
'ConsumerDispensaries',
variables,
GRAPHQL_HASHES.ConsumerDispensaries,
{ cName: `${city.toLowerCase().replace(/\s+/g, '-')}-${stateCode.toLowerCase()}`, maxRetries: 2, retryOn403: true }
);
const dispensaries = result?.data?.filteredDispensaries || [];
if (verbose) {
console.log(`[LocationDiscovery] Page ${page}: ${dispensaries.length} dispensaries`);
}
if (dispensaries.length === 0) {
hasMore = false;
} else {
// Filter to ensure we only get dispensaries in the correct state
const stateFiltered = dispensaries.filter((d: any) =>
d.location?.state?.toUpperCase() === stateCode.toUpperCase()
);
allDispensaries.push(...stateFiltered);
if (dispensaries.length < perPage) {
hasMore = false;
} else {
page++;
}
}
} catch (error: any) {
console.error(`[LocationDiscovery] Error fetching page ${page}: ${error.message}`);
hasMore = false;
}
}
// Dedupe by ID
const uniqueMap = new Map<string, any>();
for (const d of allDispensaries) {
const id = d.id || d._id;
if (id && !uniqueMap.has(id)) {
uniqueMap.set(id, d);
}
}
const unique = Array.from(uniqueMap.values());
console.log(`[LocationDiscovery] Found ${unique.length} unique dispensaries in ${city}, ${stateCode}`);
return unique.map(d => normalizeLocationResponse(d));
}
/**
* Fetch ALL dispensaries for a state by querying each city
*
* This is the canonical method for establishing state data:
* 1. Get city list from statesWithDispensaries
* 2. Query each city using city+state filter
* 3. Dedupe and return all dispensaries
*/
export async function fetchAllDispensariesForState(
stateCode: string,
options: { verbose?: boolean; progressCallback?: (city: string, count: number, total: number) => void } = {}
): Promise<{ dispensaries: DutchieLocationResponse[]; citiesQueried: number; citiesWithResults: number }> {
const { verbose = false, progressCallback } = options;
console.log(`[LocationDiscovery] Fetching all dispensaries for ${stateCode}...`);
// Step 1: Get city list
const cities = await getCitiesForState(stateCode, { verbose });
if (cities.length === 0) {
console.warn(`[LocationDiscovery] No cities found for ${stateCode}`);
return { dispensaries: [], citiesQueried: 0, citiesWithResults: 0 };
}
console.log(`[LocationDiscovery] Will query ${cities.length} cities for ${stateCode}`);
// Step 2: Query each city
const allDispensaries = new Map<string, DutchieLocationResponse>();
let citiesWithResults = 0;
for (let i = 0; i < cities.length; i++) {
const city = cities[i];
if (progressCallback) {
progressCallback(city, i + 1, cities.length);
}
try {
const dispensaries = await fetchDispensariesByCityState(city, stateCode, { verbose });
if (dispensaries.length > 0) {
citiesWithResults++;
for (const d of dispensaries) {
const id = d.id || d.slug;
if (id && !allDispensaries.has(id)) {
allDispensaries.set(id, d);
}
}
}
// Small delay between cities to avoid rate limiting
await new Promise(r => setTimeout(r, 300));
} catch (error: any) {
console.error(`[LocationDiscovery] Error querying ${city}: ${error.message}`);
}
}
const result = Array.from(allDispensaries.values());
console.log(`[LocationDiscovery] Total: ${result.length} unique dispensaries across ${citiesWithResults}/${cities.length} cities`);
return {
dispensaries: result,
citiesQueried: cities.length,
citiesWithResults,
};
}
// ============================================================
// GRAPHQL / API FETCHING (LEGACY - PUPPETEER-BASED)
// ============================================================
interface SessionCredentials {
@@ -91,16 +455,38 @@ async function closeSession(session: SessionCredentials): Promise<void> {
}
/**
* Fetch locations for a city using Dutchie's internal search API.
* Fetch locations for a city.
*
* PRIMARY METHOD: Uses city+state GraphQL filter (source of truth)
* FALLBACK: Legacy Puppeteer-based methods for edge cases
*/
export async function fetchLocationsForCity(
city: DiscoveryCity,
options: {
session?: SessionCredentials;
verbose?: boolean;
useLegacyMethods?: boolean;
} = {}
): Promise<DutchieLocationResponse[]> {
const { verbose = false } = options;
const { verbose = false, useLegacyMethods = false } = options;
console.log(`[LocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`);
// PRIMARY METHOD: City+State GraphQL query (SOURCE OF TRUTH)
if (city.cityName && city.stateCode) {
try {
const locations = await fetchDispensariesByCityState(city.cityName, city.stateCode, { verbose });
if (locations.length > 0) {
console.log(`[LocationDiscovery] Found ${locations.length} locations via GraphQL city+state`);
return locations;
}
} catch (error: any) {
console.warn(`[LocationDiscovery] GraphQL city+state failed: ${error.message}`);
}
}
// FALLBACK: Legacy Puppeteer-based methods (only if explicitly enabled)
if (useLegacyMethods) {
let session = options.session;
let shouldCloseSession = false;
@@ -110,38 +496,36 @@ export async function fetchLocationsForCity(
}
try {
console.log(`[LocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`);
// Try multiple approaches to get location data
// Approach 1: Extract from page __NEXT_DATA__ or similar
// Legacy Approach 1: Extract from page __NEXT_DATA__
const locations = await extractLocationsFromPage(session.page, verbose);
if (locations.length > 0) {
console.log(`[LocationDiscovery] Found ${locations.length} locations from page data`);
console.log(`[LocationDiscovery] Found ${locations.length} locations from page data (legacy)`);
return locations;
}
// Approach 2: Try the geo-based GraphQL query
// Legacy Approach 2: Try the geo-based GraphQL query
// NOTE: Geo queries are for SEARCH RESULTS only, not source of truth
const geoLocations = await fetchLocationsViaGraphQL(session, city, verbose);
if (geoLocations.length > 0) {
console.log(`[LocationDiscovery] Found ${geoLocations.length} locations from GraphQL`);
console.log(`[LocationDiscovery] Found ${geoLocations.length} locations from geo GraphQL (legacy)`);
return geoLocations;
}
// Approach 3: Scrape visible location cards
// Legacy Approach 3: Scrape visible location cards
const scrapedLocations = await scrapeLocationCards(session.page, verbose);
if (scrapedLocations.length > 0) {
console.log(`[LocationDiscovery] Found ${scrapedLocations.length} locations from scraping`);
console.log(`[LocationDiscovery] Found ${scrapedLocations.length} locations from scraping (legacy)`);
return scrapedLocations;
}
console.log(`[LocationDiscovery] No locations found for ${city.cityName}`);
return [];
} finally {
if (shouldCloseSession) {
await closeSession(session);
}
}
}
console.log(`[LocationDiscovery] No locations found for ${city.cityName}`);
return [];
}
/**
@@ -202,33 +586,52 @@ async function extractLocationsFromPage(
/**
* Fetch locations via GraphQL geo-based query.
*
* Uses ConsumerDispensaries with geo filtering:
* - dispensaryFilter.nearLat/nearLng for center point
* - dispensaryFilter.distance for radius in miles
* - Response at data.filteredDispensaries
*/
async function fetchLocationsViaGraphQL(
session: SessionCredentials,
city: DiscoveryCity,
verbose: boolean
): Promise<DutchieLocationResponse[]> {
// Use a known center point for the city or default to a central US location
const CITY_COORDS: Record<string, { lat: number; lng: number }> = {
'phoenix': { lat: 33.4484, lng: -112.074 },
'tucson': { lat: 32.2226, lng: -110.9747 },
'scottsdale': { lat: 33.4942, lng: -111.9261 },
'mesa': { lat: 33.4152, lng: -111.8315 },
'tempe': { lat: 33.4255, lng: -111.94 },
'flagstaff': { lat: 35.1983, lng: -111.6513 },
// Add more as needed
// City center coordinates with appropriate radius
const CITY_COORDS: Record<string, { lat: number; lng: number; radius: number }> = {
'phoenix': { lat: 33.4484, lng: -112.074, radius: 50 },
'tucson': { lat: 32.2226, lng: -110.9747, radius: 50 },
'scottsdale': { lat: 33.4942, lng: -111.9261, radius: 30 },
'mesa': { lat: 33.4152, lng: -111.8315, radius: 30 },
'tempe': { lat: 33.4255, lng: -111.94, radius: 30 },
'flagstaff': { lat: 35.1983, lng: -111.6513, radius: 50 },
};
const coords = CITY_COORDS[city.citySlug] || { lat: 33.4484, lng: -112.074 };
// State-wide coordinates for full coverage
const STATE_COORDS: Record<string, { lat: number; lng: number; radius: number }> = {
'AZ': { lat: 33.4484, lng: -112.074, radius: 200 },
'CA': { lat: 36.7783, lng: -119.4179, radius: 400 },
'CO': { lat: 39.5501, lng: -105.7821, radius: 200 },
'FL': { lat: 27.6648, lng: -81.5158, radius: 400 },
'MI': { lat: 44.3148, lng: -85.6024, radius: 250 },
'NV': { lat: 36.1699, lng: -115.1398, radius: 200 },
};
// Try city-specific coords first, then state-wide, then default
const coords = CITY_COORDS[city.citySlug]
|| (city.stateCode && STATE_COORDS[city.stateCode])
|| { lat: 33.4484, lng: -112.074, radius: 200 };
// Correct GraphQL variables for ConsumerDispensaries
const variables = {
dispensariesFilter: {
latitude: coords.lat,
longitude: coords.lng,
distance: 50, // miles
state: city.stateCode,
city: city.cityName,
dispensaryFilter: {
activeOnly: true,
nearLat: coords.lat,
nearLng: coords.lng,
distance: coords.radius,
},
page: 0,
perPage: 200,
};
const hash = '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b';
@@ -263,8 +666,19 @@ async function fetchLocationsViaGraphQL(
return [];
}
const dispensaries = response.data?.data?.consumerDispensaries || [];
return dispensaries.map((d: any) => normalizeLocationResponse(d));
// Response is at data.filteredDispensaries
const dispensaries = response.data?.data?.filteredDispensaries || [];
// Filter to specific state if needed (radius may include neighboring states)
const filtered = city.stateCode
? dispensaries.filter((d: any) => d.location?.state === city.stateCode)
: dispensaries;
if (verbose) {
console.log(`[LocationDiscovery] GraphQL returned ${dispensaries.length} total, ${filtered.length} in ${city.stateCode || 'all states'}`);
}
return filtered.map((d: any) => normalizeLocationResponse(d));
} catch (error: any) {
if (verbose) {
console.log(`[LocationDiscovery] GraphQL error: ${error.message}`);
@@ -373,13 +787,20 @@ function normalizeLocationResponse(raw: any): DutchieLocationResponse {
/**
* Upsert a location into dutchie_discovery_locations.
* REQUIRES a valid platform ID (MongoDB ObjectId) - will skip records without one.
*/
export async function upsertLocation(
pool: Pool,
location: DutchieLocationResponse,
cityId: number | null
): Promise<{ id: number; isNew: boolean }> {
const platformLocationId = location.id || location.slug;
): Promise<{ id: number; isNew: boolean } | null> {
// REQUIRE actual platform ID - NO fallback to slug
const platformLocationId = location.id;
if (!platformLocationId) {
console.warn(`[LocationDiscovery] Skipping location without platform ID: ${location.name} (${location.slug})`);
return null;
}
const menuUrl = location.menuUrl || `https://dutchie.com/dispensary/${location.slug}`;
const result = await pool.query(
@@ -642,6 +1063,12 @@ export async function discoverLocationsForCity(
const result = await upsertLocation(pool, location, city.id);
// Skip locations without valid platform ID
if (!result) {
errors.push(`Location ${location.slug}: No valid platform ID - skipped`);
continue;
}
if (result.isNew) {
newCount++;
} else {

View File

@@ -1,199 +0,0 @@
# Dutchie AZ Pipeline
## Overview
The Dutchie AZ pipeline is the **only** authorized way to crawl Dutchie dispensary menus. It uses Dutchie's GraphQL API directly (no DOM scraping) and writes to an isolated database with a proper snapshot model.
## Key Principles
1. **GraphQL Only** - All Dutchie data is fetched via their FilteredProducts GraphQL API
2. **Isolated Database** - Data lives in `dutchie_az_*` tables, NOT the legacy `products` table
3. **Append-Only Snapshots** - Every crawl creates snapshots, never overwrites historical data
4. **Stock Status Tracking** - Derived from `POSMetaData.children` inventory data
5. **Missing Product Detection** - Products not in feed are marked with `isPresentInFeed=false`
## Directory Structure
```
src/dutchie-az/
├── db/
│ ├── connection.ts # Database connection pool
│ └── schema.ts # Table definitions and migrations
├── routes/
│ └── index.ts # REST API endpoints
├── services/
│ ├── graphql-client.ts # Direct GraphQL fetch (Mode A + Mode B)
│ ├── product-crawler.ts # Main crawler orchestration
│ └── scheduler.ts # Jittered scheduling with wandering intervals
└── types/
└── index.ts # TypeScript interfaces
```
## Data Model
### Tables
- **dispensaries** - Arizona Dutchie stores with `platform_dispensary_id`
- **dutchie_products** - Canonical product identity (one row per product per store)
- **dutchie_product_snapshots** - Historical state per crawl (append-only)
- **job_schedules** - Scheduler configuration with jitter support
- **job_run_logs** - Execution history
### Stock Status
The `stock_status` field is derived from `POSMetaData.children`:
```typescript
function deriveStockStatus(children?: POSChild[]): StockStatus {
if (!children || children.length === 0) return 'unknown';
const totalAvailable = children.reduce((sum, c) =>
sum + (c.quantityAvailable || 0), 0);
return totalAvailable > 0 ? 'in_stock' : 'out_of_stock';
}
```
### Two-Mode Crawling
Mode A (UI Parity):
- `Status: null` - Returns what the UI shows
- Best for "current inventory" snapshot
Mode B (Max Coverage):
- `Status: 'Active'` - Returns all active products
- Catches items with `isBelowThreshold: true`
Both modes are merged to get maximum product coverage.
## API Endpoints
All endpoints are mounted at `/api/dutchie-az/`:
```
GET /api/dutchie-az/dispensaries - List all dispensaries
GET /api/dutchie-az/dispensaries/:id - Get dispensary details
GET /api/dutchie-az/products - List products (with filters)
GET /api/dutchie-az/products/:id - Get product with snapshots
GET /api/dutchie-az/products/:id/snapshots - Get product snapshot history
POST /api/dutchie-az/crawl/:dispensaryId - Trigger manual crawl
GET /api/dutchie-az/schedule - Get scheduler status
POST /api/dutchie-az/schedule/run - Manually run scheduled jobs
GET /api/dutchie-az/stats - Dashboard statistics
```
## Scheduler
The scheduler uses **jitter** to avoid detection patterns:
```typescript
// Each job has independent "wandering" timing
interface JobSchedule {
base_interval_minutes: number; // e.g., 240 (4 hours)
jitter_minutes: number; // e.g., 30 (±30 min)
next_run_at: Date; // Calculated with jitter after each run
}
```
Jobs run when `next_run_at <= NOW()`. After completion, the next run is calculated:
```
next_run_at = NOW() + base_interval + random(-jitter, +jitter)
```
This prevents crawls from clustering at predictable times.
## Manual Testing
### Run a single dispensary crawl:
```bash
DATABASE_URL="..." npx tsx -e "
const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler');
const { query } = require('./src/dutchie-az/db/connection');
async function test() {
const { rows } = await query('SELECT * FROM dispensaries LIMIT 1');
if (!rows[0]) return console.log('No dispensaries found');
const result = await crawlDispensaryProducts(rows[0], 'rec', { useBothModes: true });
console.log(JSON.stringify(result, null, 2));
}
test();
"
```
### Check stock status distribution:
```sql
SELECT stock_status, COUNT(*)
FROM dutchie_products
GROUP BY stock_status;
```
### View recent snapshots:
```sql
SELECT
p.name,
s.stock_status,
s.is_present_in_feed,
s.crawled_at
FROM dutchie_product_snapshots s
JOIN dutchie_products p ON p.id = s.dutchie_product_id
ORDER BY s.crawled_at DESC
LIMIT 20;
```
## Deprecated Code
The following files are **DEPRECATED** and will throw errors if called:
- `src/scrapers/dutchie-graphql.ts` - Wrote to legacy `products` table
- `src/scrapers/dutchie-graphql-direct.ts` - Wrote to legacy `products` table
- `src/scrapers/templates/dutchie.ts` - HTML/DOM scraper (unreliable)
- `src/scraper-v2/engine.ts` DutchieSpider - DOM-based extraction
If `store-crawl-orchestrator.ts` detects `provider='dutchie'` with `mode='production'`, it now routes to this dutchie-az pipeline automatically.
## Integration with Legacy System
The `store-crawl-orchestrator.ts` bridges the legacy stores system with dutchie-az:
1. When a store has `product_provider='dutchie'` and `product_crawler_mode='production'`
2. The orchestrator looks up the corresponding dispensary in `dutchie_az.dispensaries`
3. It calls `crawlDispensaryProducts()` from the dutchie-az pipeline
4. Results are logged but data stays in the dutchie_az tables
To use the dutchie-az pipeline independently:
- Navigate to `/dutchie-az-schedule` in the UI
- Use the REST API endpoints directly
- Run the scheduler service
## Environment Variables
```bash
# Database connection for dutchie-az (same DB, separate tables)
DATABASE_URL=postgresql://user:pass@host:port/database
```
## Troubleshooting
### "Dispensary not found in dutchie-az database"
The dispensary must exist in `dutchie_az.dispensaries` before crawling. Either:
1. Run discovery to populate dispensaries
2. Manually insert the dispensary with `platform_dispensary_id`
### GraphQL returns empty products
1. Check `platform_dispensary_id` is correct (the internal Dutchie ID, not slug)
2. Verify the dispensary is online and has menu data
3. Try both `rec` and `med` pricing types
### Snapshots show `stock_status='unknown'`
The product likely has no `POSMetaData.children` array. This happens for:
- Products without inventory tracking
- Manually managed inventory
---
Last updated: December 2025

View File

@@ -1,129 +0,0 @@
/**
* Dutchie Configuration
*
* Centralized configuration for Dutchie GraphQL API interaction.
* Update hashes here when Dutchie changes their persisted query system.
*/
export const dutchieConfig = {
// ============================================================
// GRAPHQL ENDPOINT
// ============================================================
/** GraphQL endpoint - must be the api-3 graphql endpoint (NOT api-gw.dutchie.com which no longer exists) */
graphqlEndpoint: 'https://dutchie.com/api-3/graphql',
// ============================================================
// GRAPHQL PERSISTED QUERY HASHES
// ============================================================
//
// These hashes identify specific GraphQL operations.
// If Dutchie changes their schema, you may need to capture
// new hashes from live browser traffic (Network tab → graphql requests).
/** FilteredProducts - main product listing query */
filteredProductsHash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
/** GetAddressBasedDispensaryData - resolve slug to internal ID */
getDispensaryDataHash: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
/**
* ConsumerDispensaries - geo-based discovery
* NOTE: This is a placeholder guess. If discovery fails, either:
* 1. Capture the real hash from live traffic
* 2. Rely on known AZDHS slugs instead (set useDiscovery: false)
*/
consumerDispensariesHash: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
// ============================================================
// BEHAVIOR FLAGS
// ============================================================
/** Enable geo-based discovery (false = use known AZDHS slugs only) */
useDiscovery: true,
/** Prefer GET requests (true) or POST (false). GET is default. */
preferGet: true,
/**
* Enable POST fallback when GET fails with 405 or blocked.
* If true, will retry failed GETs as POSTs.
*/
enablePostFallback: true,
// ============================================================
// PAGINATION & RETRY
// ============================================================
/** Products per page for pagination */
perPage: 100,
/** Maximum pages to fetch (safety limit) */
maxPages: 200,
/** Number of retries for failed page fetches */
maxRetries: 1,
/** Delay between pages in ms */
pageDelayMs: 500,
/** Delay between modes in ms */
modeDelayMs: 2000,
// ============================================================
// HTTP HEADERS
// ============================================================
/** Default headers to mimic browser requests */
defaultHeaders: {
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'apollographql-client-name': 'Marketplace (production)',
} as Record<string, string>,
/** User agent string */
userAgent:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
// ============================================================
// BROWSER LAUNCH OPTIONS
// ============================================================
browserArgs: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled',
],
/** Navigation timeout in ms */
navigationTimeout: 60000,
/** Initial page load delay in ms */
pageLoadDelay: 2000,
};
/**
* Get GraphQL hashes object for backward compatibility
*/
export const GRAPHQL_HASHES = {
FilteredProducts: dutchieConfig.filteredProductsHash,
GetAddressBasedDispensaryData: dutchieConfig.getDispensaryDataHash,
ConsumerDispensaries: dutchieConfig.consumerDispensariesHash,
};
/**
* Arizona geo centerpoints for discovery scans
*/
export const ARIZONA_CENTERPOINTS = [
{ name: 'Phoenix', lat: 33.4484, lng: -112.074 },
{ name: 'Tucson', lat: 32.2226, lng: -110.9747 },
{ name: 'Flagstaff', lat: 35.1983, lng: -111.6513 },
{ name: 'Mesa', lat: 33.4152, lng: -111.8315 },
{ name: 'Scottsdale', lat: 33.4942, lng: -111.9261 },
{ name: 'Tempe', lat: 33.4255, lng: -111.94 },
{ name: 'Yuma', lat: 32.6927, lng: -114.6277 },
{ name: 'Prescott', lat: 34.54, lng: -112.4685 },
{ name: 'Lake Havasu', lat: 34.4839, lng: -114.3224 },
{ name: 'Sierra Vista', lat: 31.5455, lng: -110.2773 },
];

View File

@@ -1,131 +0,0 @@
/**
* CannaiQ Database Connection
*
* All database access for the CannaiQ platform goes through this module.
*
* SINGLE DATABASE ARCHITECTURE:
* - All services (auth, orchestrator, crawlers, admin) use this ONE database
* - States are modeled via states table + state_id on dispensaries (not separate DBs)
*
* CONFIGURATION (in priority order):
* 1. CANNAIQ_DB_URL - Full connection string (preferred)
* 2. Individual vars: CANNAIQ_DB_HOST, CANNAIQ_DB_PORT, CANNAIQ_DB_NAME, CANNAIQ_DB_USER, CANNAIQ_DB_PASS
* 3. DATABASE_URL - Legacy fallback for K8s compatibility
*
* IMPORTANT:
* - Do NOT create separate pools elsewhere
* - All services should import from this module
*/
import { Pool, PoolClient } from 'pg';
/**
* Get the database connection string from environment variables.
* Supports multiple configuration methods with fallback for legacy compatibility.
*/
function getConnectionString(): string {
// Priority 1: Full CANNAIQ connection URL
if (process.env.CANNAIQ_DB_URL) {
return process.env.CANNAIQ_DB_URL;
}
// Priority 2: Build from individual CANNAIQ env vars
const host = process.env.CANNAIQ_DB_HOST;
const port = process.env.CANNAIQ_DB_PORT;
const name = process.env.CANNAIQ_DB_NAME;
const user = process.env.CANNAIQ_DB_USER;
const pass = process.env.CANNAIQ_DB_PASS;
if (host && port && name && user && pass) {
return `postgresql://${user}:${pass}@${host}:${port}/${name}`;
}
// Priority 3: Fallback to DATABASE_URL for legacy/K8s compatibility
if (process.env.DATABASE_URL) {
return process.env.DATABASE_URL;
}
// Report what's missing
const required = ['CANNAIQ_DB_HOST', 'CANNAIQ_DB_PORT', 'CANNAIQ_DB_NAME', 'CANNAIQ_DB_USER', 'CANNAIQ_DB_PASS'];
const missing = required.filter((key) => !process.env[key]);
throw new Error(
`[CannaiQ DB] Missing database configuration.\n` +
`Set CANNAIQ_DB_URL, DATABASE_URL, or all of: ${missing.join(', ')}`
);
}
let pool: Pool | null = null;
/**
* Get the CannaiQ database pool (singleton)
*
* This is the canonical pool for all CannaiQ services.
* Do NOT create separate pools elsewhere.
*/
export function getPool(): Pool {
if (!pool) {
pool = new Pool({
connectionString: getConnectionString(),
max: 10,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
pool.on('error', (err) => {
console.error('[CannaiQ DB] Unexpected error on idle client:', err);
});
console.log('[CannaiQ DB] Pool initialized');
}
return pool;
}
/**
* @deprecated Use getPool() instead
*/
export function getDutchieAZPool(): Pool {
console.warn('[CannaiQ DB] getDutchieAZPool() is deprecated. Use getPool() instead.');
return getPool();
}
/**
* Execute a query on the CannaiQ database
*/
export async function query<T = any>(text: string, params?: any[]): Promise<{ rows: T[]; rowCount: number }> {
const p = getPool();
const result = await p.query(text, params);
return { rows: result.rows as T[], rowCount: result.rowCount || 0 };
}
/**
* Get a client from the pool for transaction use
*/
export async function getClient(): Promise<PoolClient> {
const p = getPool();
return p.connect();
}
/**
* Close the pool connection
*/
export async function closePool(): Promise<void> {
if (pool) {
await pool.end();
pool = null;
console.log('[CannaiQ DB] Pool closed');
}
}
/**
* Check if the database is accessible
*/
export async function healthCheck(): Promise<boolean> {
try {
const result = await query('SELECT 1 as ok');
return result.rows.length > 0 && result.rows[0].ok === 1;
} catch (error) {
console.error('[CannaiQ DB] Health check failed:', error);
return false;
}
}

View File

@@ -1,137 +0,0 @@
/**
* Dispensary Column Definitions
*
* Centralized column list for dispensaries table queries.
* Handles optional columns that may not exist in all environments.
*
* USAGE:
* import { DISPENSARY_COLUMNS, DISPENSARY_COLUMNS_WITH_FAILED } from '../db/dispensary-columns';
* const result = await query(`SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE ...`);
*/
/**
* Core dispensary columns that always exist.
* These are guaranteed to be present in all environments.
*/
const CORE_COLUMNS = `
id, name, slug, city, state, zip, address, latitude, longitude,
menu_type, menu_url, platform_dispensary_id, website,
created_at, updated_at
`;
/**
* Optional columns with NULL fallback.
*
* provider_detection_data: Added in migration 044
* active_crawler_profile_id: Added in migration 041
*
* Using COALESCE ensures the query works whether or not the column exists:
* - If column exists: returns the actual value
* - If column doesn't exist: query fails (but migration should be run)
*
* For pre-migration compatibility, we select NULL::jsonb which always works.
* After migration 044 is applied, this can be changed to the real column.
*/
// TEMPORARY: Use NULL fallback until migration 044 is applied
// After running 044, change this to: provider_detection_data
const PROVIDER_DETECTION_COLUMN = `NULL::jsonb AS provider_detection_data`;
// After migration 044 is applied, uncomment this line and remove the above:
// const PROVIDER_DETECTION_COLUMN = `provider_detection_data`;
/**
* Standard dispensary columns for most queries.
* Includes provider_detection_data with NULL fallback for pre-migration compatibility.
*/
export const DISPENSARY_COLUMNS = `${CORE_COLUMNS.trim()},
${PROVIDER_DETECTION_COLUMN}`;
/**
* Dispensary columns including active_crawler_profile_id.
* Used by routes that need profile information.
*/
export const DISPENSARY_COLUMNS_WITH_PROFILE = `${CORE_COLUMNS.trim()},
${PROVIDER_DETECTION_COLUMN},
active_crawler_profile_id`;
/**
* Dispensary columns including failed_at.
* Used by worker for compatibility checks.
*/
export const DISPENSARY_COLUMNS_WITH_FAILED = `${CORE_COLUMNS.trim()},
${PROVIDER_DETECTION_COLUMN},
failed_at`;
/**
* NOTE: After migration 044 is applied, update PROVIDER_DETECTION_COLUMN above
* to use the real column instead of NULL fallback.
*
* To verify migration status:
* SELECT column_name FROM information_schema.columns
* WHERE table_name = 'dispensaries' AND column_name = 'provider_detection_data';
*/
// Cache for column existence check
let _providerDetectionColumnExists: boolean | null = null;
/**
* Check if provider_detection_data column exists in dispensaries table.
* Result is cached after first check.
*/
export async function hasProviderDetectionColumn(pool: { query: (sql: string) => Promise<{ rows: any[] }> }): Promise<boolean> {
if (_providerDetectionColumnExists !== null) {
return _providerDetectionColumnExists;
}
try {
const result = await pool.query(`
SELECT 1 FROM information_schema.columns
WHERE table_name = 'dispensaries' AND column_name = 'provider_detection_data'
`);
_providerDetectionColumnExists = result.rows.length > 0;
} catch {
_providerDetectionColumnExists = false;
}
return _providerDetectionColumnExists;
}
/**
* Safely update provider_detection_data column.
* If column doesn't exist, logs a warning but doesn't crash.
*
* @param pool - Database pool with query method
* @param dispensaryId - ID of dispensary to update
* @param data - JSONB data to merge into provider_detection_data
* @returns true if update succeeded, false if column doesn't exist
*/
export async function safeUpdateProviderDetectionData(
pool: { query: (sql: string, params?: any[]) => Promise<any> },
dispensaryId: number,
data: Record<string, any>
): Promise<boolean> {
const hasColumn = await hasProviderDetectionColumn(pool);
if (!hasColumn) {
console.warn(`[DispensaryColumns] provider_detection_data column not found. Run migration 044 to add it.`);
return false;
}
try {
await pool.query(
`UPDATE dispensaries
SET provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || $1::jsonb,
updated_at = NOW()
WHERE id = $2`,
[JSON.stringify(data), dispensaryId]
);
return true;
} catch (error: any) {
if (error.message?.includes('provider_detection_data')) {
console.warn(`[DispensaryColumns] Failed to update provider_detection_data: ${error.message}`);
return false;
}
throw error;
}
}

View File

@@ -1,29 +0,0 @@
/**
* Dutchie AZ Schema Bootstrap
*
* Run this to create/update the dutchie_az tables (dutchie_products, dutchie_product_snapshots, etc.)
* in the AZ pipeline database. This is separate from the legacy schema.
*
* Usage:
* TS_NODE_TRANSPILE_ONLY=1 npx ts-node src/dutchie-az/db/migrate.ts
* or (after build)
* node dist/dutchie-az/db/migrate.js
*/
import { createSchema } from './schema';
import { closePool } from './connection';
async function main() {
try {
console.log('[DutchieAZ] Running schema migration...');
await createSchema();
console.log('[DutchieAZ] Schema migration complete.');
} catch (err: any) {
console.error('[DutchieAZ] Schema migration failed:', err.message);
process.exitCode = 1;
} finally {
await closePool();
}
}
main();

View File

@@ -1,408 +0,0 @@
/**
* Dutchie AZ Database Schema
*
* Creates all tables for the isolated Dutchie Arizona data pipeline.
* Run this to initialize the dutchie_az database.
*/
import { query, getClient } from './connection';
/**
* SQL statements to create all tables
*/
const SCHEMA_SQL = `
-- ============================================================
-- DISPENSARIES TABLE
-- Stores discovered Dutchie dispensaries in Arizona
-- ============================================================
CREATE TABLE IF NOT EXISTS dispensaries (
id SERIAL PRIMARY KEY,
platform VARCHAR(20) NOT NULL DEFAULT 'dutchie',
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(10) NOT NULL DEFAULT 'AZ',
postal_code VARCHAR(20),
address TEXT,
latitude DECIMAL(10, 7),
longitude DECIMAL(10, 7),
platform_dispensary_id VARCHAR(100),
is_delivery BOOLEAN DEFAULT false,
is_pickup BOOLEAN DEFAULT true,
raw_metadata JSONB,
last_crawled_at TIMESTAMPTZ,
product_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT uk_dispensaries_platform_slug UNIQUE (platform, slug, city, state)
);
CREATE INDEX IF NOT EXISTS idx_dispensaries_platform ON dispensaries(platform);
CREATE INDEX IF NOT EXISTS idx_dispensaries_platform_id ON dispensaries(platform_dispensary_id);
CREATE INDEX IF NOT EXISTS idx_dispensaries_state ON dispensaries(state);
CREATE INDEX IF NOT EXISTS idx_dispensaries_city ON dispensaries(city);
-- ============================================================
-- DUTCHIE_PRODUCTS TABLE
-- Canonical product identity per store
-- ============================================================
CREATE TABLE IF NOT EXISTS dutchie_products (
id SERIAL PRIMARY KEY,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE,
platform VARCHAR(20) NOT NULL DEFAULT 'dutchie',
external_product_id VARCHAR(100) NOT NULL,
platform_dispensary_id VARCHAR(100) NOT NULL,
c_name VARCHAR(500),
name VARCHAR(500) NOT NULL,
-- Brand
brand_name VARCHAR(255),
brand_id VARCHAR(100),
brand_logo_url TEXT,
-- Classification
type VARCHAR(100),
subcategory VARCHAR(100),
strain_type VARCHAR(50),
provider VARCHAR(100),
-- Potency
thc DECIMAL(10, 4),
thc_content DECIMAL(10, 4),
cbd DECIMAL(10, 4),
cbd_content DECIMAL(10, 4),
cannabinoids_v2 JSONB,
effects JSONB,
-- Status / flags
status VARCHAR(50),
medical_only BOOLEAN DEFAULT false,
rec_only BOOLEAN DEFAULT false,
featured BOOLEAN DEFAULT false,
coming_soon BOOLEAN DEFAULT false,
certificate_of_analysis_enabled BOOLEAN DEFAULT false,
is_below_threshold BOOLEAN DEFAULT false,
is_below_kiosk_threshold BOOLEAN DEFAULT false,
options_below_threshold BOOLEAN DEFAULT false,
options_below_kiosk_threshold BOOLEAN DEFAULT false,
-- Derived stock status: 'in_stock', 'out_of_stock', 'unknown'
stock_status VARCHAR(20) DEFAULT 'unknown',
total_quantity_available INTEGER DEFAULT 0,
-- Images
primary_image_url TEXT,
images JSONB,
-- Misc
measurements JSONB,
weight VARCHAR(50),
past_c_names TEXT[],
created_at_dutchie TIMESTAMPTZ,
updated_at_dutchie TIMESTAMPTZ,
latest_raw_payload JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT uk_dutchie_products UNIQUE (dispensary_id, external_product_id)
);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_dispensary ON dutchie_products(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_external_id ON dutchie_products(external_product_id);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_platform_disp ON dutchie_products(platform_dispensary_id);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_brand ON dutchie_products(brand_name);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_type ON dutchie_products(type);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_subcategory ON dutchie_products(subcategory);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_status ON dutchie_products(status);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_strain ON dutchie_products(strain_type);
CREATE INDEX IF NOT EXISTS idx_dutchie_products_stock_status ON dutchie_products(stock_status);
-- ============================================================
-- DUTCHIE_PRODUCT_SNAPSHOTS TABLE
-- Historical state per crawl, includes options[]
-- ============================================================
CREATE TABLE IF NOT EXISTS dutchie_product_snapshots (
id SERIAL PRIMARY KEY,
dutchie_product_id INTEGER NOT NULL REFERENCES dutchie_products(id) ON DELETE CASCADE,
dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE,
platform_dispensary_id VARCHAR(100) NOT NULL,
external_product_id VARCHAR(100) NOT NULL,
pricing_type VARCHAR(20) DEFAULT 'unknown',
crawl_mode VARCHAR(20) DEFAULT 'mode_a', -- 'mode_a' (UI parity) or 'mode_b' (max coverage)
status VARCHAR(50),
featured BOOLEAN DEFAULT false,
special BOOLEAN DEFAULT false,
medical_only BOOLEAN DEFAULT false,
rec_only BOOLEAN DEFAULT false,
-- Flag indicating if product was present in feed (false = missing_from_feed snapshot)
is_present_in_feed BOOLEAN DEFAULT true,
-- Derived stock status
stock_status VARCHAR(20) DEFAULT 'unknown',
-- Price summary (in cents)
rec_min_price_cents INTEGER,
rec_max_price_cents INTEGER,
rec_min_special_price_cents INTEGER,
med_min_price_cents INTEGER,
med_max_price_cents INTEGER,
med_min_special_price_cents INTEGER,
wholesale_min_price_cents INTEGER,
-- Inventory summary
total_quantity_available INTEGER,
total_kiosk_quantity_available INTEGER,
manual_inventory BOOLEAN DEFAULT false,
is_below_threshold BOOLEAN DEFAULT false,
is_below_kiosk_threshold BOOLEAN DEFAULT false,
-- Option-level data (from POSMetaData.children)
options JSONB,
-- Full raw product node
raw_payload JSONB NOT NULL,
crawled_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_snapshots_product ON dutchie_product_snapshots(dutchie_product_id);
CREATE INDEX IF NOT EXISTS idx_snapshots_dispensary ON dutchie_product_snapshots(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_snapshots_crawled_at ON dutchie_product_snapshots(crawled_at);
CREATE INDEX IF NOT EXISTS idx_snapshots_platform_disp ON dutchie_product_snapshots(platform_dispensary_id);
CREATE INDEX IF NOT EXISTS idx_snapshots_external_id ON dutchie_product_snapshots(external_product_id);
CREATE INDEX IF NOT EXISTS idx_snapshots_special ON dutchie_product_snapshots(special) WHERE special = true;
CREATE INDEX IF NOT EXISTS idx_snapshots_stock_status ON dutchie_product_snapshots(stock_status);
CREATE INDEX IF NOT EXISTS idx_snapshots_crawl_mode ON dutchie_product_snapshots(crawl_mode);
-- ============================================================
-- CRAWL_JOBS TABLE
-- Tracks crawl execution status
-- ============================================================
CREATE TABLE IF NOT EXISTS crawl_jobs (
id SERIAL PRIMARY KEY,
job_type VARCHAR(50) NOT NULL,
dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE SET NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error_message TEXT,
products_found INTEGER,
snapshots_created INTEGER,
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_crawl_jobs_type ON crawl_jobs(job_type);
CREATE INDEX IF NOT EXISTS idx_crawl_jobs_status ON crawl_jobs(status);
CREATE INDEX IF NOT EXISTS idx_crawl_jobs_dispensary ON crawl_jobs(dispensary_id);
CREATE INDEX IF NOT EXISTS idx_crawl_jobs_created ON crawl_jobs(created_at);
-- ============================================================
-- JOB_SCHEDULES TABLE
-- Stores schedule configuration for recurring jobs with jitter support
-- Each job has independent timing that "wanders" over time
-- ============================================================
CREATE TABLE IF NOT EXISTS job_schedules (
id SERIAL PRIMARY KEY,
job_name VARCHAR(100) NOT NULL UNIQUE,
description TEXT,
enabled BOOLEAN DEFAULT true,
-- Timing configuration (jitter makes times "wander")
base_interval_minutes INTEGER NOT NULL DEFAULT 240, -- e.g., 4 hours
jitter_minutes INTEGER NOT NULL DEFAULT 30, -- e.g., ±30 min
-- Last run tracking
last_run_at TIMESTAMPTZ,
last_status VARCHAR(20), -- 'success', 'error', 'partial', 'running'
last_error_message TEXT,
last_duration_ms INTEGER,
-- Next run (calculated with jitter after each run)
next_run_at TIMESTAMPTZ,
-- Additional config
job_config JSONB, -- e.g., { pricingType: 'rec', useBothModes: true }
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_job_schedules_enabled ON job_schedules(enabled);
CREATE INDEX IF NOT EXISTS idx_job_schedules_next_run ON job_schedules(next_run_at);
-- ============================================================
-- JOB_RUN_LOGS TABLE
-- Stores history of job runs for monitoring
-- ============================================================
CREATE TABLE IF NOT EXISTS job_run_logs (
id SERIAL PRIMARY KEY,
schedule_id INTEGER NOT NULL REFERENCES job_schedules(id) ON DELETE CASCADE,
job_name VARCHAR(100) NOT NULL,
status VARCHAR(20) NOT NULL, -- 'pending', 'running', 'success', 'error', 'partial'
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
duration_ms INTEGER,
error_message TEXT,
-- Results summary
items_processed INTEGER,
items_succeeded INTEGER,
items_failed INTEGER,
metadata JSONB, -- Additional run details
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_job_run_logs_schedule ON job_run_logs(schedule_id);
CREATE INDEX IF NOT EXISTS idx_job_run_logs_job_name ON job_run_logs(job_name);
CREATE INDEX IF NOT EXISTS idx_job_run_logs_status ON job_run_logs(status);
CREATE INDEX IF NOT EXISTS idx_job_run_logs_created ON job_run_logs(created_at);
-- ============================================================
-- VIEWS FOR EASY QUERYING
-- ============================================================
-- Categories derived from products
CREATE OR REPLACE VIEW v_categories AS
SELECT
type,
subcategory,
COUNT(DISTINCT id) as product_count,
COUNT(DISTINCT dispensary_id) as dispensary_count,
AVG(thc) as avg_thc,
MIN(thc) as min_thc,
MAX(thc) as max_thc
FROM dutchie_products
WHERE type IS NOT NULL
GROUP BY type, subcategory
ORDER BY type, subcategory;
-- Brands derived from products
CREATE OR REPLACE VIEW v_brands AS
SELECT
brand_name,
brand_id,
MAX(brand_logo_url) as brand_logo_url,
COUNT(DISTINCT id) as product_count,
COUNT(DISTINCT dispensary_id) as dispensary_count,
ARRAY_AGG(DISTINCT type) FILTER (WHERE type IS NOT NULL) as product_types
FROM dutchie_products
WHERE brand_name IS NOT NULL
GROUP BY brand_name, brand_id
ORDER BY product_count DESC;
-- Latest snapshot per product (most recent crawl data)
CREATE OR REPLACE VIEW v_latest_snapshots AS
SELECT DISTINCT ON (dutchie_product_id)
s.*
FROM dutchie_product_snapshots s
ORDER BY dutchie_product_id, crawled_at DESC;
-- Dashboard stats
CREATE OR REPLACE VIEW v_dashboard_stats AS
SELECT
(SELECT COUNT(*) FROM dispensaries WHERE state = 'AZ') as dispensary_count,
(SELECT COUNT(*) FROM dutchie_products) as product_count,
(SELECT COUNT(*) FROM dutchie_product_snapshots WHERE crawled_at > NOW() - INTERVAL '24 hours') as snapshots_24h,
(SELECT MAX(crawled_at) FROM dutchie_product_snapshots) as last_crawl_time,
(SELECT COUNT(*) FROM crawl_jobs WHERE status = 'failed' AND created_at > NOW() - INTERVAL '24 hours') as failed_jobs_24h,
(SELECT COUNT(DISTINCT brand_name) FROM dutchie_products WHERE brand_name IS NOT NULL) as brand_count,
(SELECT COUNT(DISTINCT (type, subcategory)) FROM dutchie_products WHERE type IS NOT NULL) as category_count;
`;
/**
* Run the schema migration
*/
export async function createSchema(): Promise<void> {
console.log('[DutchieAZ Schema] Creating database schema...');
const client = await getClient();
try {
await client.query('BEGIN');
// Split into individual statements and execute
const statements = SCHEMA_SQL
.split(';')
.map(s => s.trim())
.filter(s => s.length > 0 && !s.startsWith('--'));
for (const statement of statements) {
if (statement.trim()) {
await client.query(statement + ';');
}
}
await client.query('COMMIT');
console.log('[DutchieAZ Schema] Schema created successfully');
} catch (error) {
await client.query('ROLLBACK');
console.error('[DutchieAZ Schema] Failed to create schema:', error);
throw error;
} finally {
client.release();
}
}
/**
* Drop all tables (for development/testing)
*/
export async function dropSchema(): Promise<void> {
console.log('[DutchieAZ Schema] Dropping all tables...');
await query(`
DROP VIEW IF EXISTS v_dashboard_stats CASCADE;
DROP VIEW IF EXISTS v_latest_snapshots CASCADE;
DROP VIEW IF EXISTS v_brands CASCADE;
DROP VIEW IF EXISTS v_categories CASCADE;
DROP TABLE IF EXISTS crawl_schedule CASCADE;
DROP TABLE IF EXISTS crawl_jobs CASCADE;
DROP TABLE IF EXISTS dutchie_product_snapshots CASCADE;
DROP TABLE IF EXISTS dutchie_products CASCADE;
DROP TABLE IF EXISTS dispensaries CASCADE;
`);
console.log('[DutchieAZ Schema] All tables dropped');
}
/**
* Check if schema exists
*/
export async function schemaExists(): Promise<boolean> {
try {
const result = await query(`
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_name = 'dispensaries'
) as exists
`);
return result.rows[0]?.exists === true;
} catch (error) {
return false;
}
}
/**
* Initialize schema if it doesn't exist
*/
export async function ensureSchema(): Promise<void> {
const exists = await schemaExists();
if (!exists) {
await createSchema();
} else {
console.log('[DutchieAZ Schema] Schema already exists');
}
}

View File

@@ -1,403 +0,0 @@
/**
* DtCityDiscoveryService
*
* Core service for Dutchie city discovery.
* Contains shared logic used by multiple entrypoints.
*
* Responsibilities:
* - Browser/API-based city fetching
* - Manual city seeding
* - City upsert operations
*/
import { Pool } from 'pg';
import axios from 'axios';
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
// ============================================================
// TYPES
// ============================================================
export interface DutchieCity {
name: string;
slug: string;
stateCode: string | null;
countryCode: string;
url?: string;
}
export interface CityDiscoveryResult {
citiesFound: number;
citiesInserted: number;
citiesUpdated: number;
errors: string[];
durationMs: number;
}
export interface ManualSeedResult {
city: DutchieCity;
id: number;
wasInserted: boolean;
}
// ============================================================
// US STATE CODE MAPPING
// ============================================================
export const US_STATE_MAP: Record<string, string> = {
'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR',
'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE',
'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI', 'idaho': 'ID',
'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS',
'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD',
'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS',
'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV',
'new-hampshire': 'NH', 'new-jersey': 'NJ', 'new-mexico': 'NM', 'new-york': 'NY',
'north-carolina': 'NC', 'north-dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK',
'oregon': 'OR', 'pennsylvania': 'PA', 'rhode-island': 'RI', 'south-carolina': 'SC',
'south-dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT',
'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA', 'west-virginia': 'WV',
'wisconsin': 'WI', 'wyoming': 'WY', 'district-of-columbia': 'DC',
};
// Canadian province mapping
export const CA_PROVINCE_MAP: Record<string, string> = {
'alberta': 'AB', 'british-columbia': 'BC', 'manitoba': 'MB',
'new-brunswick': 'NB', 'newfoundland-and-labrador': 'NL',
'northwest-territories': 'NT', 'nova-scotia': 'NS', 'nunavut': 'NU',
'ontario': 'ON', 'prince-edward-island': 'PE', 'quebec': 'QC',
'saskatchewan': 'SK', 'yukon': 'YT',
};
// ============================================================
// CITY FETCHING (AUTO DISCOVERY)
// ============================================================
/**
* Fetch cities from Dutchie's /cities page using Puppeteer.
*/
export async function fetchCitiesFromBrowser(): Promise<DutchieCity[]> {
console.log('[DtCityDiscoveryService] Launching browser to fetch cities...');
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'],
});
try {
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
console.log('[DtCityDiscoveryService] Navigating to https://dutchie.com/cities...');
await page.goto('https://dutchie.com/cities', {
waitUntil: 'networkidle2',
timeout: 60000,
});
await new Promise((r) => setTimeout(r, 3000));
const cities = await page.evaluate(() => {
const cityLinks: Array<{
name: string;
slug: string;
url: string;
stateSlug: string | null;
}> = [];
const links = document.querySelectorAll('a[href*="/city/"]');
links.forEach((link) => {
const href = (link as HTMLAnchorElement).href;
const text = (link as HTMLElement).innerText?.trim();
const match = href.match(/\/city\/([^/]+)\/([^/?]+)/);
if (match && text) {
cityLinks.push({
name: text,
slug: match[2],
url: href,
stateSlug: match[1],
});
}
});
return cityLinks;
});
console.log(`[DtCityDiscoveryService] Extracted ${cities.length} city links from page`);
return cities.map((city) => {
let countryCode = 'US';
let stateCode: string | null = null;
if (city.stateSlug) {
if (US_STATE_MAP[city.stateSlug]) {
stateCode = US_STATE_MAP[city.stateSlug];
countryCode = 'US';
} else if (CA_PROVINCE_MAP[city.stateSlug]) {
stateCode = CA_PROVINCE_MAP[city.stateSlug];
countryCode = 'CA';
} else if (city.stateSlug.length === 2) {
stateCode = city.stateSlug.toUpperCase();
if (Object.values(CA_PROVINCE_MAP).includes(stateCode)) {
countryCode = 'CA';
}
}
}
return {
name: city.name,
slug: city.slug,
stateCode,
countryCode,
url: city.url,
};
});
} finally {
await browser.close();
}
}
/**
* Fetch cities via API endpoints (fallback).
*/
export async function fetchCitiesFromAPI(): Promise<DutchieCity[]> {
console.log('[DtCityDiscoveryService] Attempting API-based city discovery...');
const apiEndpoints = [
'https://dutchie.com/api/cities',
'https://api.dutchie.com/v1/cities',
];
for (const endpoint of apiEndpoints) {
try {
const response = await axios.get(endpoint, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0',
Accept: 'application/json',
},
timeout: 15000,
});
if (response.data && Array.isArray(response.data)) {
console.log(`[DtCityDiscoveryService] API returned ${response.data.length} cities`);
return response.data.map((c: any) => ({
name: c.name || c.city,
slug: c.slug || c.citySlug,
stateCode: c.stateCode || c.state,
countryCode: c.countryCode || c.country || 'US',
}));
}
} catch (error: any) {
console.log(`[DtCityDiscoveryService] API ${endpoint} failed: ${error.message}`);
}
}
return [];
}
// ============================================================
// DATABASE OPERATIONS
// ============================================================
/**
* Upsert a city into dutchie_discovery_cities
*/
export async function upsertCity(
pool: Pool,
city: DutchieCity
): Promise<{ id: number; inserted: boolean; updated: boolean }> {
const result = await pool.query(
`
INSERT INTO dutchie_discovery_cities (
platform,
city_name,
city_slug,
state_code,
country_code,
crawl_enabled,
created_at,
updated_at
) VALUES (
'dutchie',
$1,
$2,
$3,
$4,
TRUE,
NOW(),
NOW()
)
ON CONFLICT (platform, country_code, state_code, city_slug)
DO UPDATE SET
city_name = EXCLUDED.city_name,
crawl_enabled = TRUE,
updated_at = NOW()
RETURNING id, (xmax = 0) AS inserted
`,
[city.name, city.slug, city.stateCode, city.countryCode]
);
const inserted = result.rows[0]?.inserted === true;
return {
id: result.rows[0]?.id,
inserted,
updated: !inserted,
};
}
// ============================================================
// MAIN SERVICE CLASS
// ============================================================
export class DtCityDiscoveryService {
constructor(private pool: Pool) {}
/**
* Run auto-discovery (browser + API fallback)
*/
async runAutoDiscovery(): Promise<CityDiscoveryResult> {
const startTime = Date.now();
const errors: string[] = [];
let citiesFound = 0;
let citiesInserted = 0;
let citiesUpdated = 0;
console.log('[DtCityDiscoveryService] Starting auto city discovery...');
try {
let cities = await fetchCitiesFromBrowser();
if (cities.length === 0) {
console.log('[DtCityDiscoveryService] Browser returned 0 cities, trying API...');
cities = await fetchCitiesFromAPI();
}
citiesFound = cities.length;
console.log(`[DtCityDiscoveryService] Found ${citiesFound} cities`);
for (const city of cities) {
try {
const result = await upsertCity(this.pool, city);
if (result.inserted) citiesInserted++;
else if (result.updated) citiesUpdated++;
} catch (error: any) {
const msg = `Failed to upsert city ${city.slug}: ${error.message}`;
console.error(`[DtCityDiscoveryService] ${msg}`);
errors.push(msg);
}
}
} catch (error: any) {
const msg = `Auto discovery failed: ${error.message}`;
console.error(`[DtCityDiscoveryService] ${msg}`);
errors.push(msg);
}
const durationMs = Date.now() - startTime;
return {
citiesFound,
citiesInserted,
citiesUpdated,
errors,
durationMs,
};
}
/**
* Seed a single city manually
*/
async seedCity(city: DutchieCity): Promise<ManualSeedResult> {
console.log(`[DtCityDiscoveryService] Seeding city: ${city.name} (${city.slug}), ${city.stateCode}, ${city.countryCode}`);
const result = await upsertCity(this.pool, city);
return {
city,
id: result.id,
wasInserted: result.inserted,
};
}
/**
* Seed multiple cities from a list
*/
async seedCities(cities: DutchieCity[]): Promise<{
results: ManualSeedResult[];
errors: string[];
}> {
const results: ManualSeedResult[] = [];
const errors: string[] = [];
for (const city of cities) {
try {
const result = await this.seedCity(city);
results.push(result);
} catch (error: any) {
errors.push(`${city.slug}: ${error.message}`);
}
}
return { results, errors };
}
/**
* Get statistics about discovered cities
*/
async getStats(): Promise<{
total: number;
byCountry: Array<{ countryCode: string; count: number }>;
byState: Array<{ stateCode: string; countryCode: string; count: number }>;
crawlEnabled: number;
neverCrawled: number;
}> {
const [totalRes, byCountryRes, byStateRes, enabledRes, neverRes] = await Promise.all([
this.pool.query('SELECT COUNT(*) as cnt FROM dutchie_discovery_cities WHERE platform = \'dutchie\''),
this.pool.query(`
SELECT country_code, COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE platform = 'dutchie'
GROUP BY country_code
ORDER BY cnt DESC
`),
this.pool.query(`
SELECT state_code, country_code, COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE platform = 'dutchie' AND state_code IS NOT NULL
GROUP BY state_code, country_code
ORDER BY cnt DESC
`),
this.pool.query(`
SELECT COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE platform = 'dutchie' AND crawl_enabled = TRUE
`),
this.pool.query(`
SELECT COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE platform = 'dutchie' AND last_crawled_at IS NULL
`),
]);
return {
total: parseInt(totalRes.rows[0]?.cnt || '0', 10),
byCountry: byCountryRes.rows.map((r) => ({
countryCode: r.country_code,
count: parseInt(r.cnt, 10),
})),
byState: byStateRes.rows.map((r) => ({
stateCode: r.state_code,
countryCode: r.country_code,
count: parseInt(r.cnt, 10),
})),
crawlEnabled: parseInt(enabledRes.rows[0]?.cnt || '0', 10),
neverCrawled: parseInt(neverRes.rows[0]?.cnt || '0', 10),
};
}
}
export default DtCityDiscoveryService;

File diff suppressed because it is too large Load Diff

View File

@@ -1,390 +0,0 @@
/**
* DutchieCityDiscovery
*
* Discovers cities from Dutchie's /cities page and upserts to dutchie_discovery_cities.
*
* Responsibilities:
* - Fetch all cities available on Dutchie
* - For each city derive: city_name, city_slug, state_code, country_code
* - Upsert into dutchie_discovery_cities
*/
import { Pool } from 'pg';
import axios from 'axios';
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import type { Browser, Page } from 'puppeteer';
puppeteer.use(StealthPlugin());
// ============================================================
// TYPES
// ============================================================
export interface DutchieCity {
name: string;
slug: string;
stateCode: string | null;
countryCode: string;
url?: string;
}
export interface CityDiscoveryResult {
citiesFound: number;
citiesInserted: number;
citiesUpdated: number;
errors: string[];
durationMs: number;
}
// ============================================================
// US STATE CODE MAPPING
// ============================================================
const US_STATE_MAP: Record<string, string> = {
'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR',
'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE',
'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI', 'idaho': 'ID',
'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS',
'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD',
'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS',
'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV',
'new-hampshire': 'NH', 'new-jersey': 'NJ', 'new-mexico': 'NM', 'new-york': 'NY',
'north-carolina': 'NC', 'north-dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK',
'oregon': 'OR', 'pennsylvania': 'PA', 'rhode-island': 'RI', 'south-carolina': 'SC',
'south-dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT',
'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA', 'west-virginia': 'WV',
'wisconsin': 'WI', 'wyoming': 'WY', 'district-of-columbia': 'DC',
};
// Canadian province mapping
const CA_PROVINCE_MAP: Record<string, string> = {
'alberta': 'AB', 'british-columbia': 'BC', 'manitoba': 'MB',
'new-brunswick': 'NB', 'newfoundland-and-labrador': 'NL',
'northwest-territories': 'NT', 'nova-scotia': 'NS', 'nunavut': 'NU',
'ontario': 'ON', 'prince-edward-island': 'PE', 'quebec': 'QC',
'saskatchewan': 'SK', 'yukon': 'YT',
};
// ============================================================
// CITY FETCHING
// ============================================================
/**
* Fetch cities from Dutchie's /cities page using Puppeteer to extract data.
*/
async function fetchCitiesFromDutchie(): Promise<DutchieCity[]> {
console.log('[DutchieCityDiscovery] Launching browser to fetch cities...');
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'],
});
try {
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
// Navigate to cities page
console.log('[DutchieCityDiscovery] Navigating to https://dutchie.com/cities...');
await page.goto('https://dutchie.com/cities', {
waitUntil: 'networkidle2',
timeout: 60000,
});
// Wait for content to load
await new Promise((r) => setTimeout(r, 3000));
// Extract city links from the page
const cities = await page.evaluate(() => {
const cityLinks: Array<{
name: string;
slug: string;
url: string;
stateSlug: string | null;
}> = [];
// Find all city links - they typically follow pattern /city/{state}/{city}
const links = document.querySelectorAll('a[href*="/city/"]');
links.forEach((link) => {
const href = (link as HTMLAnchorElement).href;
const text = (link as HTMLElement).innerText?.trim();
// Parse URL: https://dutchie.com/city/{state}/{city}
const match = href.match(/\/city\/([^/]+)\/([^/?]+)/);
if (match && text) {
cityLinks.push({
name: text,
slug: match[2],
url: href,
stateSlug: match[1],
});
}
});
return cityLinks;
});
console.log(`[DutchieCityDiscovery] Extracted ${cities.length} city links from page`);
// Convert to DutchieCity format
const result: DutchieCity[] = [];
for (const city of cities) {
// Determine country and state code
let countryCode = 'US';
let stateCode: string | null = null;
if (city.stateSlug) {
// Check if it's a US state
if (US_STATE_MAP[city.stateSlug]) {
stateCode = US_STATE_MAP[city.stateSlug];
countryCode = 'US';
}
// Check if it's a Canadian province
else if (CA_PROVINCE_MAP[city.stateSlug]) {
stateCode = CA_PROVINCE_MAP[city.stateSlug];
countryCode = 'CA';
}
// Check if it's already a 2-letter code
else if (city.stateSlug.length === 2) {
stateCode = city.stateSlug.toUpperCase();
// Determine country based on state code
if (Object.values(CA_PROVINCE_MAP).includes(stateCode)) {
countryCode = 'CA';
}
}
}
result.push({
name: city.name,
slug: city.slug,
stateCode,
countryCode,
url: city.url,
});
}
return result;
} finally {
await browser.close();
}
}
/**
* Alternative: Fetch cities by making API/GraphQL requests.
* Falls back to this if scraping fails.
*/
async function fetchCitiesFromAPI(): Promise<DutchieCity[]> {
console.log('[DutchieCityDiscovery] Attempting API-based city discovery...');
// Dutchie may have an API endpoint for cities
// Try common patterns
const apiEndpoints = [
'https://dutchie.com/api/cities',
'https://api.dutchie.com/v1/cities',
];
for (const endpoint of apiEndpoints) {
try {
const response = await axios.get(endpoint, {
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0',
Accept: 'application/json',
},
timeout: 15000,
});
if (response.data && Array.isArray(response.data)) {
console.log(`[DutchieCityDiscovery] API returned ${response.data.length} cities`);
return response.data.map((c: any) => ({
name: c.name || c.city,
slug: c.slug || c.citySlug,
stateCode: c.stateCode || c.state,
countryCode: c.countryCode || c.country || 'US',
}));
}
} catch (error: any) {
console.log(`[DutchieCityDiscovery] API ${endpoint} failed: ${error.message}`);
}
}
return [];
}
// ============================================================
// DATABASE OPERATIONS
// ============================================================
/**
* Upsert a city into dutchie_discovery_cities
*/
async function upsertCity(
pool: Pool,
city: DutchieCity
): Promise<{ inserted: boolean; updated: boolean }> {
const result = await pool.query(
`
INSERT INTO dutchie_discovery_cities (
platform,
city_name,
city_slug,
state_code,
country_code,
last_crawled_at,
updated_at
) VALUES (
'dutchie',
$1,
$2,
$3,
$4,
NOW(),
NOW()
)
ON CONFLICT (platform, country_code, state_code, city_slug)
DO UPDATE SET
city_name = EXCLUDED.city_name,
last_crawled_at = NOW(),
updated_at = NOW()
RETURNING (xmax = 0) AS inserted
`,
[city.name, city.slug, city.stateCode, city.countryCode]
);
const inserted = result.rows[0]?.inserted === true;
return { inserted, updated: !inserted };
}
// ============================================================
// MAIN DISCOVERY FUNCTION
// ============================================================
export class DutchieCityDiscovery {
private pool: Pool;
constructor(pool: Pool) {
this.pool = pool;
}
/**
* Run the city discovery process
*/
async run(): Promise<CityDiscoveryResult> {
const startTime = Date.now();
const errors: string[] = [];
let citiesFound = 0;
let citiesInserted = 0;
let citiesUpdated = 0;
console.log('[DutchieCityDiscovery] Starting city discovery...');
try {
// Try scraping first, fall back to API
let cities = await fetchCitiesFromDutchie();
if (cities.length === 0) {
console.log('[DutchieCityDiscovery] Scraping returned 0 cities, trying API...');
cities = await fetchCitiesFromAPI();
}
citiesFound = cities.length;
console.log(`[DutchieCityDiscovery] Found ${citiesFound} cities`);
// Upsert each city
for (const city of cities) {
try {
const result = await upsertCity(this.pool, city);
if (result.inserted) {
citiesInserted++;
} else if (result.updated) {
citiesUpdated++;
}
} catch (error: any) {
const msg = `Failed to upsert city ${city.slug}: ${error.message}`;
console.error(`[DutchieCityDiscovery] ${msg}`);
errors.push(msg);
}
}
} catch (error: any) {
const msg = `City discovery failed: ${error.message}`;
console.error(`[DutchieCityDiscovery] ${msg}`);
errors.push(msg);
}
const durationMs = Date.now() - startTime;
console.log('[DutchieCityDiscovery] Discovery complete:');
console.log(` Cities found: ${citiesFound}`);
console.log(` Inserted: ${citiesInserted}`);
console.log(` Updated: ${citiesUpdated}`);
console.log(` Errors: ${errors.length}`);
console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`);
return {
citiesFound,
citiesInserted,
citiesUpdated,
errors,
durationMs,
};
}
/**
* Get statistics about discovered cities
*/
async getStats(): Promise<{
total: number;
byCountry: Array<{ countryCode: string; count: number }>;
byState: Array<{ stateCode: string; countryCode: string; count: number }>;
crawlEnabled: number;
neverCrawled: number;
}> {
const [totalRes, byCountryRes, byStateRes, enabledRes, neverRes] = await Promise.all([
this.pool.query('SELECT COUNT(*) as cnt FROM dutchie_discovery_cities'),
this.pool.query(`
SELECT country_code, COUNT(*) as cnt
FROM dutchie_discovery_cities
GROUP BY country_code
ORDER BY cnt DESC
`),
this.pool.query(`
SELECT state_code, country_code, COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE state_code IS NOT NULL
GROUP BY state_code, country_code
ORDER BY cnt DESC
`),
this.pool.query(`
SELECT COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE crawl_enabled = TRUE
`),
this.pool.query(`
SELECT COUNT(*) as cnt
FROM dutchie_discovery_cities
WHERE last_crawled_at IS NULL
`),
]);
return {
total: parseInt(totalRes.rows[0]?.cnt || '0', 10),
byCountry: byCountryRes.rows.map((r) => ({
countryCode: r.country_code,
count: parseInt(r.cnt, 10),
})),
byState: byStateRes.rows.map((r) => ({
stateCode: r.state_code,
countryCode: r.country_code,
count: parseInt(r.cnt, 10),
})),
crawlEnabled: parseInt(enabledRes.rows[0]?.cnt || '0', 10),
neverCrawled: parseInt(neverRes.rows[0]?.cnt || '0', 10),
};
}
}
export default DutchieCityDiscovery;

View File

@@ -1,639 +0,0 @@
/**
* DutchieLocationDiscovery
*
* Discovers store locations for each city from Dutchie and upserts to dutchie_discovery_locations.
*
* Responsibilities:
* - Given a dutchie_discovery_cities row, call Dutchie's location/search endpoint
* - For each store: extract platform_location_id, platform_slug, platform_menu_url, name, address, coords
* - Upsert into dutchie_discovery_locations
* - DO NOT overwrite status if already verified/merged/rejected
* - DO NOT overwrite dispensary_id if already set
*/
import { Pool } from 'pg';
import axios from 'axios';
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
// ============================================================
// TYPES
// ============================================================
export interface DiscoveryCity {
id: number;
platform: string;
cityName: string;
citySlug: string;
stateCode: string | null;
countryCode: string;
crawlEnabled: boolean;
}
export interface DutchieLocation {
platformLocationId: string;
platformSlug: string;
platformMenuUrl: string;
name: string;
rawAddress: string | null;
addressLine1: string | null;
addressLine2: string | null;
city: string | null;
stateCode: string | null;
postalCode: string | null;
countryCode: string | null;
latitude: number | null;
longitude: number | null;
timezone: string | null;
offersDelivery: boolean | null;
offersPickup: boolean | null;
isRecreational: boolean | null;
isMedical: boolean | null;
metadata: Record<string, any>;
}
export interface LocationDiscoveryResult {
cityId: number;
citySlug: string;
locationsFound: number;
locationsInserted: number;
locationsUpdated: number;
locationsSkipped: number;
errors: string[];
durationMs: number;
}
// ============================================================
// LOCATION FETCHING
// ============================================================
/**
* Fetch locations for a city using Puppeteer to scrape the city page
*/
async function fetchLocationsForCity(city: DiscoveryCity): Promise<DutchieLocation[]> {
console.log(`[DutchieLocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`);
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'],
});
try {
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
// Navigate to city page - use /us/dispensaries/{city_slug} pattern
const cityUrl = `https://dutchie.com/us/dispensaries/${city.citySlug}`;
console.log(`[DutchieLocationDiscovery] Navigating to ${cityUrl}...`);
await page.goto(cityUrl, {
waitUntil: 'networkidle2',
timeout: 60000,
});
// Wait for content
await new Promise((r) => setTimeout(r, 3000));
// Try to extract __NEXT_DATA__ which often contains store data
const nextData = await page.evaluate(() => {
const script = document.querySelector('script#__NEXT_DATA__');
if (script) {
try {
return JSON.parse(script.textContent || '{}');
} catch {
return null;
}
}
return null;
});
let locations: DutchieLocation[] = [];
if (nextData?.props?.pageProps?.dispensaries) {
// Extract from Next.js data
const dispensaries = nextData.props.pageProps.dispensaries;
console.log(`[DutchieLocationDiscovery] Found ${dispensaries.length} dispensaries in __NEXT_DATA__`);
locations = dispensaries.map((d: any) => parseDispensaryData(d, city));
} else {
// Fall back to DOM scraping
console.log('[DutchieLocationDiscovery] No __NEXT_DATA__, trying DOM scraping...');
const scrapedData = await page.evaluate(() => {
const stores: Array<{
name: string;
href: string;
address: string | null;
}> = [];
// Look for dispensary cards/links
const cards = document.querySelectorAll('[data-testid="dispensary-card"], .dispensary-card, a[href*="/dispensary/"]');
cards.forEach((card) => {
const link = card.querySelector('a[href*="/dispensary/"]') || (card as HTMLAnchorElement);
const href = (link as HTMLAnchorElement).href || '';
const name =
card.querySelector('[data-testid="dispensary-name"]')?.textContent ||
card.querySelector('h2, h3, .name')?.textContent ||
link.textContent ||
'';
const address = card.querySelector('[data-testid="dispensary-address"], .address')?.textContent || null;
if (href && name) {
stores.push({
name: name.trim(),
href,
address: address?.trim() || null,
});
}
});
return stores;
});
console.log(`[DutchieLocationDiscovery] DOM scraping found ${scrapedData.length} stores`);
locations = scrapedData.map((s) => {
// Parse slug from URL
const match = s.href.match(/\/dispensary\/([^/?]+)/);
const slug = match ? match[1] : s.name.toLowerCase().replace(/\s+/g, '-');
return {
platformLocationId: slug, // Will be resolved later
platformSlug: slug,
platformMenuUrl: `https://dutchie.com/dispensary/${slug}`,
name: s.name,
rawAddress: s.address,
addressLine1: null,
addressLine2: null,
city: city.cityName,
stateCode: city.stateCode,
postalCode: null,
countryCode: city.countryCode,
latitude: null,
longitude: null,
timezone: null,
offersDelivery: null,
offersPickup: null,
isRecreational: null,
isMedical: null,
metadata: { source: 'dom_scrape', originalUrl: s.href },
};
});
}
return locations;
} finally {
await browser.close();
}
}
/**
* Parse dispensary data from Dutchie's API/JSON response
*/
function parseDispensaryData(d: any, city: DiscoveryCity): DutchieLocation {
const id = d.id || d._id || d.dispensaryId || '';
const slug = d.slug || d.cName || d.name?.toLowerCase().replace(/\s+/g, '-') || '';
// Build menu URL
let menuUrl = `https://dutchie.com/dispensary/${slug}`;
if (d.menuUrl) {
menuUrl = d.menuUrl;
} else if (d.embeddedMenuUrl) {
menuUrl = d.embeddedMenuUrl;
}
// Parse address
const address = d.address || d.location?.address || {};
const rawAddress = [
address.line1 || address.street1 || d.address1,
address.line2 || address.street2 || d.address2,
[
address.city || d.city,
address.state || address.stateCode || d.state,
address.zip || address.zipCode || address.postalCode || d.zip,
]
.filter(Boolean)
.join(' '),
]
.filter(Boolean)
.join(', ');
return {
platformLocationId: id,
platformSlug: slug,
platformMenuUrl: menuUrl,
name: d.name || d.dispensaryName || '',
rawAddress: rawAddress || null,
addressLine1: address.line1 || address.street1 || d.address1 || null,
addressLine2: address.line2 || address.street2 || d.address2 || null,
city: address.city || d.city || city.cityName,
stateCode: address.state || address.stateCode || d.state || city.stateCode,
postalCode: address.zip || address.zipCode || address.postalCode || d.zip || null,
countryCode: address.country || address.countryCode || d.country || city.countryCode,
latitude: d.latitude ?? d.location?.latitude ?? d.location?.lat ?? null,
longitude: d.longitude ?? d.location?.longitude ?? d.location?.lng ?? null,
timezone: d.timezone || d.timeZone || null,
offersDelivery: d.offerDelivery ?? d.offersDelivery ?? d.delivery ?? null,
offersPickup: d.offerPickup ?? d.offersPickup ?? d.pickup ?? null,
isRecreational: d.isRecreational ?? d.recreational ?? (d.retailType === 'recreational' || d.retailType === 'both'),
isMedical: d.isMedical ?? d.medical ?? (d.retailType === 'medical' || d.retailType === 'both'),
metadata: {
source: 'next_data',
retailType: d.retailType,
brand: d.brand,
logo: d.logo || d.logoUrl,
raw: d,
},
};
}
/**
* Alternative: Use GraphQL to discover locations
*/
async function fetchLocationsViaGraphQL(city: DiscoveryCity): Promise<DutchieLocation[]> {
console.log(`[DutchieLocationDiscovery] Trying GraphQL for ${city.cityName}...`);
// Try geo-based search
// This would require knowing the city's coordinates
// For now, return empty and rely on page scraping
return [];
}
// ============================================================
// DATABASE OPERATIONS
// ============================================================
/**
* Upsert a location into dutchie_discovery_locations
* Does NOT overwrite status if already verified/merged/rejected
* Does NOT overwrite dispensary_id if already set
*/
async function upsertLocation(
pool: Pool,
location: DutchieLocation,
cityId: number
): Promise<{ inserted: boolean; updated: boolean; skipped: boolean }> {
// First check if this location exists and has a protected status
const existing = await pool.query(
`
SELECT id, status, dispensary_id
FROM dutchie_discovery_locations
WHERE platform = 'dutchie' AND platform_location_id = $1
`,
[location.platformLocationId]
);
if (existing.rows.length > 0) {
const row = existing.rows[0];
const protectedStatuses = ['verified', 'merged', 'rejected'];
if (protectedStatuses.includes(row.status)) {
// Only update last_seen_at for protected statuses
await pool.query(
`
UPDATE dutchie_discovery_locations
SET last_seen_at = NOW(), updated_at = NOW()
WHERE id = $1
`,
[row.id]
);
return { inserted: false, updated: false, skipped: true };
}
// Update existing discovered location (but preserve dispensary_id if set)
await pool.query(
`
UPDATE dutchie_discovery_locations
SET
platform_slug = $2,
platform_menu_url = $3,
name = $4,
raw_address = COALESCE($5, raw_address),
address_line1 = COALESCE($6, address_line1),
address_line2 = COALESCE($7, address_line2),
city = COALESCE($8, city),
state_code = COALESCE($9, state_code),
postal_code = COALESCE($10, postal_code),
country_code = COALESCE($11, country_code),
latitude = COALESCE($12, latitude),
longitude = COALESCE($13, longitude),
timezone = COALESCE($14, timezone),
offers_delivery = COALESCE($15, offers_delivery),
offers_pickup = COALESCE($16, offers_pickup),
is_recreational = COALESCE($17, is_recreational),
is_medical = COALESCE($18, is_medical),
metadata = COALESCE($19, metadata),
discovery_city_id = $20,
last_seen_at = NOW(),
updated_at = NOW()
WHERE id = $1
`,
[
row.id,
location.platformSlug,
location.platformMenuUrl,
location.name,
location.rawAddress,
location.addressLine1,
location.addressLine2,
location.city,
location.stateCode,
location.postalCode,
location.countryCode,
location.latitude,
location.longitude,
location.timezone,
location.offersDelivery,
location.offersPickup,
location.isRecreational,
location.isMedical,
JSON.stringify(location.metadata),
cityId,
]
);
return { inserted: false, updated: true, skipped: false };
}
// Insert new location
await pool.query(
`
INSERT INTO dutchie_discovery_locations (
platform,
platform_location_id,
platform_slug,
platform_menu_url,
name,
raw_address,
address_line1,
address_line2,
city,
state_code,
postal_code,
country_code,
latitude,
longitude,
timezone,
status,
offers_delivery,
offers_pickup,
is_recreational,
is_medical,
metadata,
discovery_city_id,
first_seen_at,
last_seen_at,
active,
created_at,
updated_at
) VALUES (
'dutchie',
$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14,
'discovered',
$15, $16, $17, $18, $19, $20,
NOW(), NOW(), TRUE, NOW(), NOW()
)
`,
[
location.platformLocationId,
location.platformSlug,
location.platformMenuUrl,
location.name,
location.rawAddress,
location.addressLine1,
location.addressLine2,
location.city,
location.stateCode,
location.postalCode,
location.countryCode,
location.latitude,
location.longitude,
location.timezone,
location.offersDelivery,
location.offersPickup,
location.isRecreational,
location.isMedical,
JSON.stringify(location.metadata),
cityId,
]
);
return { inserted: true, updated: false, skipped: false };
}
// ============================================================
// MAIN DISCOVERY CLASS
// ============================================================
export class DutchieLocationDiscovery {
private pool: Pool;
constructor(pool: Pool) {
this.pool = pool;
}
/**
* Get a city by slug
*/
async getCityBySlug(citySlug: string): Promise<DiscoveryCity | null> {
const { rows } = await this.pool.query(
`
SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled
FROM dutchie_discovery_cities
WHERE platform = 'dutchie' AND city_slug = $1
LIMIT 1
`,
[citySlug]
);
if (rows.length === 0) return null;
const r = rows[0];
return {
id: r.id,
platform: r.platform,
cityName: r.city_name,
citySlug: r.city_slug,
stateCode: r.state_code,
countryCode: r.country_code,
crawlEnabled: r.crawl_enabled,
};
}
/**
* Get all crawl-enabled cities
*/
async getEnabledCities(limit?: number): Promise<DiscoveryCity[]> {
const { rows } = await this.pool.query(
`
SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled
FROM dutchie_discovery_cities
WHERE platform = 'dutchie' AND crawl_enabled = TRUE
ORDER BY last_crawled_at ASC NULLS FIRST, city_name ASC
${limit ? `LIMIT ${limit}` : ''}
`
);
return rows.map((r) => ({
id: r.id,
platform: r.platform,
cityName: r.city_name,
citySlug: r.city_slug,
stateCode: r.state_code,
countryCode: r.country_code,
crawlEnabled: r.crawl_enabled,
}));
}
/**
* Discover locations for a single city
*/
async discoverForCity(city: DiscoveryCity): Promise<LocationDiscoveryResult> {
const startTime = Date.now();
const errors: string[] = [];
let locationsFound = 0;
let locationsInserted = 0;
let locationsUpdated = 0;
let locationsSkipped = 0;
console.log(`[DutchieLocationDiscovery] Discovering locations for ${city.cityName}, ${city.stateCode}...`);
try {
// Fetch locations
let locations = await fetchLocationsForCity(city);
// If scraping fails, try GraphQL
if (locations.length === 0) {
locations = await fetchLocationsViaGraphQL(city);
}
locationsFound = locations.length;
console.log(`[DutchieLocationDiscovery] Found ${locationsFound} locations`);
// Upsert each location
for (const location of locations) {
try {
const result = await upsertLocation(this.pool, location, city.id);
if (result.inserted) locationsInserted++;
else if (result.updated) locationsUpdated++;
else if (result.skipped) locationsSkipped++;
} catch (error: any) {
const msg = `Failed to upsert location ${location.platformSlug}: ${error.message}`;
console.error(`[DutchieLocationDiscovery] ${msg}`);
errors.push(msg);
}
}
// Update city's last_crawled_at and location_count
await this.pool.query(
`
UPDATE dutchie_discovery_cities
SET last_crawled_at = NOW(),
location_count = $1,
updated_at = NOW()
WHERE id = $2
`,
[locationsFound, city.id]
);
} catch (error: any) {
const msg = `Location discovery failed for ${city.citySlug}: ${error.message}`;
console.error(`[DutchieLocationDiscovery] ${msg}`);
errors.push(msg);
}
const durationMs = Date.now() - startTime;
console.log(`[DutchieLocationDiscovery] City ${city.citySlug} complete:`);
console.log(` Locations found: ${locationsFound}`);
console.log(` Inserted: ${locationsInserted}`);
console.log(` Updated: ${locationsUpdated}`);
console.log(` Skipped (protected): ${locationsSkipped}`);
console.log(` Errors: ${errors.length}`);
console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`);
return {
cityId: city.id,
citySlug: city.citySlug,
locationsFound,
locationsInserted,
locationsUpdated,
locationsSkipped,
errors,
durationMs,
};
}
/**
* Discover locations for all enabled cities
*/
async discoverAllEnabled(options: {
limit?: number;
delayMs?: number;
} = {}): Promise<{
totalCities: number;
totalLocationsFound: number;
totalInserted: number;
totalUpdated: number;
totalSkipped: number;
errors: string[];
durationMs: number;
}> {
const { limit, delayMs = 2000 } = options;
const startTime = Date.now();
let totalLocationsFound = 0;
let totalInserted = 0;
let totalUpdated = 0;
let totalSkipped = 0;
const allErrors: string[] = [];
const cities = await this.getEnabledCities(limit);
console.log(`[DutchieLocationDiscovery] Discovering locations for ${cities.length} cities...`);
for (let i = 0; i < cities.length; i++) {
const city = cities[i];
console.log(`\n[DutchieLocationDiscovery] City ${i + 1}/${cities.length}: ${city.cityName}, ${city.stateCode}`);
try {
const result = await this.discoverForCity(city);
totalLocationsFound += result.locationsFound;
totalInserted += result.locationsInserted;
totalUpdated += result.locationsUpdated;
totalSkipped += result.locationsSkipped;
allErrors.push(...result.errors);
} catch (error: any) {
allErrors.push(`City ${city.citySlug} failed: ${error.message}`);
}
// Delay between cities
if (i < cities.length - 1 && delayMs > 0) {
await new Promise((r) => setTimeout(r, delayMs));
}
}
const durationMs = Date.now() - startTime;
console.log('\n[DutchieLocationDiscovery] All cities complete:');
console.log(` Total cities: ${cities.length}`);
console.log(` Total locations found: ${totalLocationsFound}`);
console.log(` Total inserted: ${totalInserted}`);
console.log(` Total updated: ${totalUpdated}`);
console.log(` Total skipped: ${totalSkipped}`);
console.log(` Total errors: ${allErrors.length}`);
console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`);
return {
totalCities: cities.length,
totalLocationsFound,
totalInserted,
totalUpdated,
totalSkipped,
errors: allErrors,
durationMs,
};
}
}
export default DutchieLocationDiscovery;

View File

@@ -1,73 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Discovery Entrypoint: Dutchie Cities (Auto)
*
* Attempts browser/API-based /cities discovery.
* Even if currently blocked (403), this runner preserves the auto-discovery path.
*
* Usage:
* npm run discovery:dt:cities:auto
* DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-cities-auto.ts
*/
import { Pool } from 'pg';
import { DtCityDiscoveryService } from './DtCityDiscoveryService';
const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL ||
'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus';
async function main() {
console.log('╔══════════════════════════════════════════════════╗');
console.log('║ Dutchie City Discovery (AUTO) ║');
console.log('║ Browser + API fallback ║');
console.log('╚══════════════════════════════════════════════════╝');
console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`);
const pool = new Pool({ connectionString: DB_URL });
try {
const { rows } = await pool.query('SELECT NOW() as time');
console.log(`Connected at: ${rows[0].time}\n`);
const service = new DtCityDiscoveryService(pool);
const result = await service.runAutoDiscovery();
console.log('\n' + '═'.repeat(50));
console.log('SUMMARY');
console.log('═'.repeat(50));
console.log(`Cities found: ${result.citiesFound}`);
console.log(`Cities inserted: ${result.citiesInserted}`);
console.log(`Cities updated: ${result.citiesUpdated}`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('\nErrors:');
result.errors.forEach((e, i) => console.log(` ${i + 1}. ${e}`));
}
const stats = await service.getStats();
console.log('\nCurrent Database Stats:');
console.log(` Total cities: ${stats.total}`);
console.log(` Crawl enabled: ${stats.crawlEnabled}`);
console.log(` Never crawled: ${stats.neverCrawled}`);
if (result.citiesFound === 0) {
console.log('\n⚠ No cities found via auto-discovery.');
console.log(' This may be due to Dutchie blocking scraping/API access.');
console.log(' Use manual seeding instead:');
console.log(' npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY');
process.exit(1);
}
console.log('\n✅ Auto city discovery completed');
process.exit(0);
} catch (error: any) {
console.error('\n❌ Auto city discovery failed:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,137 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Discovery Entrypoint: Dutchie Cities (Manual Seed)
*
* Manually seeds cities into dutchie_discovery_cities via CLI args.
* Use this when auto-discovery is blocked (403).
*
* Usage:
* npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY
* npm run discovery:dt:cities:manual -- --city-slug=ma-boston --city-name=Boston --state-code=MA --country-code=US
*
* Options:
* --city-slug Required. URL slug (e.g., "ny-hudson")
* --city-name Required. Display name (e.g., "Hudson")
* --state-code Required. State/province code (e.g., "NY", "CA", "ON")
* --country-code Optional. Country code (default: "US")
*
* After seeding, run location discovery:
* npm run discovery:dt:locations
*/
import { Pool } from 'pg';
import { DtCityDiscoveryService, DutchieCity } from './DtCityDiscoveryService';
const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL ||
'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus';
interface Args {
citySlug?: string;
cityName?: string;
stateCode?: string;
countryCode: string;
}
function parseArgs(): Args {
const args: Args = { countryCode: 'US' };
for (const arg of process.argv.slice(2)) {
const citySlugMatch = arg.match(/--city-slug=(.+)/);
if (citySlugMatch) args.citySlug = citySlugMatch[1];
const cityNameMatch = arg.match(/--city-name=(.+)/);
if (cityNameMatch) args.cityName = cityNameMatch[1];
const stateCodeMatch = arg.match(/--state-code=(.+)/);
if (stateCodeMatch) args.stateCode = stateCodeMatch[1].toUpperCase();
const countryCodeMatch = arg.match(/--country-code=(.+)/);
if (countryCodeMatch) args.countryCode = countryCodeMatch[1].toUpperCase();
}
return args;
}
function printUsage() {
console.log(`
Usage:
npm run discovery:dt:cities:manual -- --city-slug=<slug> --city-name=<name> --state-code=<state>
Required arguments:
--city-slug URL slug for the city (e.g., "ny-hudson", "ma-boston")
--city-name Display name (e.g., "Hudson", "Boston")
--state-code State/province code (e.g., "NY", "CA", "ON")
Optional arguments:
--country-code Country code (default: "US")
Examples:
npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY
npm run discovery:dt:cities:manual -- --city-slug=ca-los-angeles --city-name="Los Angeles" --state-code=CA
npm run discovery:dt:cities:manual -- --city-slug=on-toronto --city-name=Toronto --state-code=ON --country-code=CA
After seeding, run location discovery:
npm run discovery:dt:locations
`);
}
async function main() {
const args = parseArgs();
console.log('╔══════════════════════════════════════════════════╗');
console.log('║ Dutchie City Discovery (MANUAL SEED) ║');
console.log('╚══════════════════════════════════════════════════╝');
if (!args.citySlug || !args.cityName || !args.stateCode) {
console.error('\n❌ Error: Missing required arguments\n');
printUsage();
process.exit(1);
}
console.log(`\nCity Slug: ${args.citySlug}`);
console.log(`City Name: ${args.cityName}`);
console.log(`State Code: ${args.stateCode}`);
console.log(`Country Code: ${args.countryCode}`);
console.log(`Database: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`);
const pool = new Pool({ connectionString: DB_URL });
try {
const { rows } = await pool.query('SELECT NOW() as time');
console.log(`\nConnected at: ${rows[0].time}`);
const service = new DtCityDiscoveryService(pool);
const city: DutchieCity = {
slug: args.citySlug,
name: args.cityName,
stateCode: args.stateCode,
countryCode: args.countryCode,
};
const result = await service.seedCity(city);
const action = result.wasInserted ? 'INSERTED' : 'UPDATED';
console.log(`\n✅ City ${action}:`);
console.log(` ID: ${result.id}`);
console.log(` City Slug: ${result.city.slug}`);
console.log(` City Name: ${result.city.name}`);
console.log(` State Code: ${result.city.stateCode}`);
console.log(` Country Code: ${result.city.countryCode}`);
const stats = await service.getStats();
console.log(`\nTotal Dutchie cities: ${stats.total} (${stats.crawlEnabled} enabled)`);
console.log('\n📍 Next step: Run location discovery');
console.log(' npm run discovery:dt:locations');
process.exit(0);
} catch (error: any) {
console.error('\n❌ Failed to seed city:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,73 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Discovery Runner: Dutchie Cities
*
* Discovers cities from Dutchie's /cities page and upserts to dutchie_discovery_cities.
*
* Usage:
* npm run discovery:platforms:dt:cities
* DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-cities.ts
*/
import { Pool } from 'pg';
import { DutchieCityDiscovery } from './DutchieCityDiscovery';
const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL ||
'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus';
async function main() {
console.log('╔══════════════════════════════════════════════════╗');
console.log('║ Dutchie City Discovery Runner ║');
console.log('╚══════════════════════════════════════════════════╝');
console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`);
const pool = new Pool({ connectionString: DB_URL });
try {
// Test DB connection
const { rows } = await pool.query('SELECT NOW() as time');
console.log(`Connected at: ${rows[0].time}\n`);
// Run city discovery
const discovery = new DutchieCityDiscovery(pool);
const result = await discovery.run();
// Print summary
console.log('\n' + '═'.repeat(50));
console.log('SUMMARY');
console.log('═'.repeat(50));
console.log(`Cities found: ${result.citiesFound}`);
console.log(`Cities inserted: ${result.citiesInserted}`);
console.log(`Cities updated: ${result.citiesUpdated}`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('\nErrors:');
result.errors.forEach((e, i) => console.log(` ${i + 1}. ${e}`));
}
// Get final stats
const stats = await discovery.getStats();
console.log('\nCurrent Database Stats:');
console.log(` Total cities: ${stats.total}`);
console.log(` Crawl enabled: ${stats.crawlEnabled}`);
console.log(` Never crawled: ${stats.neverCrawled}`);
console.log(` By country: ${stats.byCountry.map(c => `${c.countryCode}=${c.count}`).join(', ')}`);
if (result.errors.length > 0) {
console.log('\n⚠ Completed with errors');
process.exit(1);
}
console.log('\n✅ City discovery completed successfully');
process.exit(0);
} catch (error: any) {
console.error('\n❌ City discovery failed:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,113 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Discovery Entrypoint: Dutchie Locations (From Cities)
*
* Reads from dutchie_discovery_cities (crawl_enabled = true)
* and discovers store locations for each city.
*
* Geo coordinates are captured when available from Dutchie's payloads.
*
* Usage:
* npm run discovery:dt:locations
* npm run discovery:dt:locations -- --limit=10
* npm run discovery:dt:locations -- --delay=3000
* DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-locations-from-cities.ts
*
* Options:
* --limit=N Only process N cities (default: all)
* --delay=N Delay between cities in ms (default: 2000)
*/
import { Pool } from 'pg';
import { DtLocationDiscoveryService } from './DtLocationDiscoveryService';
const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL ||
'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus';
function parseArgs(): { limit?: number; delay?: number } {
const args: { limit?: number; delay?: number } = {};
for (const arg of process.argv.slice(2)) {
const limitMatch = arg.match(/--limit=(\d+)/);
if (limitMatch) args.limit = parseInt(limitMatch[1], 10);
const delayMatch = arg.match(/--delay=(\d+)/);
if (delayMatch) args.delay = parseInt(delayMatch[1], 10);
}
return args;
}
async function main() {
const args = parseArgs();
console.log('╔══════════════════════════════════════════════════╗');
console.log('║ Dutchie Location Discovery (From Cities) ║');
console.log('║ Reads crawl_enabled cities, discovers stores ║');
console.log('╚══════════════════════════════════════════════════╝');
console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`);
if (args.limit) console.log(`City limit: ${args.limit}`);
if (args.delay) console.log(`Delay: ${args.delay}ms`);
const pool = new Pool({ connectionString: DB_URL });
try {
const { rows } = await pool.query('SELECT NOW() as time');
console.log(`Connected at: ${rows[0].time}\n`);
const service = new DtLocationDiscoveryService(pool);
const result = await service.discoverAllEnabled({
limit: args.limit,
delayMs: args.delay ?? 2000,
});
console.log('\n' + '═'.repeat(50));
console.log('SUMMARY');
console.log('═'.repeat(50));
console.log(`Cities processed: ${result.totalCities}`);
console.log(`Locations found: ${result.totalLocationsFound}`);
console.log(`Locations inserted: ${result.totalInserted}`);
console.log(`Locations updated: ${result.totalUpdated}`);
console.log(`Locations skipped: ${result.totalSkipped} (protected status)`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('\nErrors (first 10):');
result.errors.slice(0, 10).forEach((e, i) => console.log(` ${i + 1}. ${e}`));
if (result.errors.length > 10) {
console.log(` ... and ${result.errors.length - 10} more`);
}
}
// Get location stats including coordinates
const stats = await service.getStats();
console.log('\nCurrent Database Stats:');
console.log(` Total locations: ${stats.total}`);
console.log(` With coordinates: ${stats.withCoordinates}`);
console.log(` By status:`);
stats.byStatus.forEach(s => console.log(` ${s.status}: ${s.count}`));
if (result.totalCities === 0) {
console.log('\n⚠ No crawl-enabled cities found.');
console.log(' Seed cities first:');
console.log(' npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY');
process.exit(1);
}
if (result.errors.length > 0) {
console.log('\n⚠ Completed with errors');
process.exit(1);
}
console.log('\n✅ Location discovery completed successfully');
process.exit(0);
} catch (error: any) {
console.error('\n❌ Location discovery failed:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,117 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Discovery Runner: Dutchie Locations
*
* Discovers store locations for all crawl-enabled cities and upserts to dutchie_discovery_locations.
*
* Usage:
* npm run discovery:platforms:dt:locations
* npm run discovery:platforms:dt:locations -- --limit=10
* DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-locations.ts
*
* Options (via args):
* --limit=N Only process N cities (default: all)
* --delay=N Delay between cities in ms (default: 2000)
*/
import { Pool } from 'pg';
import { DutchieLocationDiscovery } from './DutchieLocationDiscovery';
const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL ||
'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus';
// Parse CLI args
function parseArgs(): { limit?: number; delay?: number } {
const args: { limit?: number; delay?: number } = {};
for (const arg of process.argv.slice(2)) {
const limitMatch = arg.match(/--limit=(\d+)/);
if (limitMatch) args.limit = parseInt(limitMatch[1], 10);
const delayMatch = arg.match(/--delay=(\d+)/);
if (delayMatch) args.delay = parseInt(delayMatch[1], 10);
}
return args;
}
async function main() {
const args = parseArgs();
console.log('╔══════════════════════════════════════════════════╗');
console.log('║ Dutchie Location Discovery Runner ║');
console.log('╚══════════════════════════════════════════════════╝');
console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`);
if (args.limit) console.log(`City limit: ${args.limit}`);
if (args.delay) console.log(`Delay: ${args.delay}ms`);
const pool = new Pool({ connectionString: DB_URL });
try {
// Test DB connection
const { rows } = await pool.query('SELECT NOW() as time');
console.log(`Connected at: ${rows[0].time}\n`);
// Run location discovery
const discovery = new DutchieLocationDiscovery(pool);
const result = await discovery.discoverAllEnabled({
limit: args.limit,
delayMs: args.delay ?? 2000,
});
// Print summary
console.log('\n' + '═'.repeat(50));
console.log('SUMMARY');
console.log('═'.repeat(50));
console.log(`Cities processed: ${result.totalCities}`);
console.log(`Locations found: ${result.totalLocationsFound}`);
console.log(`Locations inserted: ${result.totalInserted}`);
console.log(`Locations updated: ${result.totalUpdated}`);
console.log(`Locations skipped: ${result.totalSkipped} (protected status)`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('\nErrors (first 10):');
result.errors.slice(0, 10).forEach((e, i) => console.log(` ${i + 1}. ${e}`));
if (result.errors.length > 10) {
console.log(` ... and ${result.errors.length - 10} more`);
}
}
// Get DB counts
const { rows: countRows } = await pool.query(`
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE status = 'discovered') as discovered,
COUNT(*) FILTER (WHERE status = 'verified') as verified,
COUNT(*) FILTER (WHERE status = 'merged') as merged,
COUNT(*) FILTER (WHERE status = 'rejected') as rejected
FROM dutchie_discovery_locations
WHERE platform = 'dutchie' AND active = TRUE
`);
const counts = countRows[0];
console.log('\nCurrent Database Stats:');
console.log(` Total locations: ${counts.total}`);
console.log(` Status discovered: ${counts.discovered}`);
console.log(` Status verified: ${counts.verified}`);
console.log(` Status merged: ${counts.merged}`);
console.log(` Status rejected: ${counts.rejected}`);
if (result.errors.length > 0) {
console.log('\n⚠ Completed with errors');
process.exit(1);
}
console.log('\n✅ Location discovery completed successfully');
process.exit(0);
} catch (error: any) {
console.error('\n❌ Location discovery failed:', error.message);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,10 +0,0 @@
/**
* Dutchie Discovery Module
*
* Store discovery pipeline for Dutchie platform.
*/
export { DutchieCityDiscovery } from './DutchieCityDiscovery';
export { DutchieLocationDiscovery } from './DutchieLocationDiscovery';
export { createDutchieDiscoveryRoutes } from './routes';
export { promoteDiscoveryLocation } from './promoteDiscoveryLocation';

View File

@@ -1,248 +0,0 @@
/**
* Promote Discovery Location to Crawlable Dispensary
*
* When a discovery location is verified or merged:
* 1. Ensure a crawl profile exists for the dispensary
* 2. Seed/update crawl schedule
* 3. Create initial crawl job
*/
import { Pool } from 'pg';
export interface PromotionResult {
success: boolean;
discoveryId: number;
dispensaryId: number;
crawlProfileId?: number;
scheduleUpdated?: boolean;
crawlJobCreated?: boolean;
error?: string;
}
/**
* Promote a verified/merged discovery location to a crawlable dispensary.
*
* This function:
* 1. Verifies the discovery location is verified/merged and has a dispensary_id
* 2. Ensures the dispensary has platform info (menu_type, platform_dispensary_id)
* 3. Creates/updates a crawler profile if the profile table exists
* 4. Queues an initial crawl job
*/
export async function promoteDiscoveryLocation(
pool: Pool,
discoveryLocationId: number
): Promise<PromotionResult> {
console.log(`[Promote] Starting promotion for discovery location ${discoveryLocationId}...`);
// Get the discovery location
const { rows: locRows } = await pool.query(
`
SELECT
dl.*,
d.id as disp_id,
d.name as disp_name,
d.menu_type as disp_menu_type,
d.platform_dispensary_id as disp_platform_id
FROM dutchie_discovery_locations dl
JOIN dispensaries d ON dl.dispensary_id = d.id
WHERE dl.id = $1
`,
[discoveryLocationId]
);
if (locRows.length === 0) {
return {
success: false,
discoveryId: discoveryLocationId,
dispensaryId: 0,
error: 'Discovery location not found or not linked to a dispensary',
};
}
const location = locRows[0];
// Verify status
if (!['verified', 'merged'].includes(location.status)) {
return {
success: false,
discoveryId: discoveryLocationId,
dispensaryId: location.dispensary_id || 0,
error: `Cannot promote: location status is '${location.status}', must be 'verified' or 'merged'`,
};
}
const dispensaryId = location.dispensary_id;
console.log(`[Promote] Location ${discoveryLocationId} -> Dispensary ${dispensaryId} (${location.disp_name})`);
// Ensure dispensary has platform info
if (!location.disp_platform_id) {
console.log(`[Promote] Updating dispensary with platform info...`);
await pool.query(
`
UPDATE dispensaries
SET platform_dispensary_id = COALESCE(platform_dispensary_id, $1),
menu_url = COALESCE(menu_url, $2),
menu_type = COALESCE(menu_type, 'dutchie'),
updated_at = NOW()
WHERE id = $3
`,
[location.platform_location_id, location.platform_menu_url, dispensaryId]
);
}
let crawlProfileId: number | undefined;
let scheduleUpdated = false;
let crawlJobCreated = false;
// Check if dispensary_crawler_profiles table exists
const { rows: tableCheck } = await pool.query(`
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_name = 'dispensary_crawler_profiles'
) as exists
`);
if (tableCheck[0]?.exists) {
// Create or get crawler profile
console.log(`[Promote] Checking crawler profile...`);
const { rows: profileRows } = await pool.query(
`
SELECT id FROM dispensary_crawler_profiles
WHERE dispensary_id = $1 AND platform = 'dutchie'
`,
[dispensaryId]
);
if (profileRows.length > 0) {
crawlProfileId = profileRows[0].id;
console.log(`[Promote] Using existing profile ${crawlProfileId}`);
} else {
// Create new profile
const profileKey = `dutchie-${location.platform_slug}`;
const { rows: newProfile } = await pool.query(
`
INSERT INTO dispensary_crawler_profiles (
dispensary_id,
profile_key,
profile_name,
platform,
config,
status,
enabled,
created_at,
updated_at
) VALUES (
$1, $2, $3, 'dutchie', $4, 'sandbox', TRUE, NOW(), NOW()
)
ON CONFLICT (dispensary_id, platform) DO UPDATE SET
enabled = TRUE,
updated_at = NOW()
RETURNING id
`,
[
dispensaryId,
profileKey,
`${location.name} (Dutchie)`,
JSON.stringify({
platformDispensaryId: location.platform_location_id,
platformSlug: location.platform_slug,
menuUrl: location.platform_menu_url,
pricingType: 'rec',
useBothModes: true,
}),
]
);
crawlProfileId = newProfile[0]?.id;
console.log(`[Promote] Created new profile ${crawlProfileId}`);
}
// Link profile to dispensary if not already linked
await pool.query(
`
UPDATE dispensaries
SET active_crawler_profile_id = COALESCE(active_crawler_profile_id, $1),
updated_at = NOW()
WHERE id = $2
`,
[crawlProfileId, dispensaryId]
);
}
// Check if crawl_jobs table exists and create initial job
const { rows: jobsTableCheck } = await pool.query(`
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_name = 'crawl_jobs'
) as exists
`);
if (jobsTableCheck[0]?.exists) {
// Check if there's already a pending job
const { rows: existingJobs } = await pool.query(
`
SELECT id FROM crawl_jobs
WHERE dispensary_id = $1 AND status IN ('pending', 'running')
LIMIT 1
`,
[dispensaryId]
);
if (existingJobs.length === 0) {
// Create initial crawl job
console.log(`[Promote] Creating initial crawl job...`);
await pool.query(
`
INSERT INTO crawl_jobs (
dispensary_id,
job_type,
status,
priority,
config,
created_at,
updated_at
) VALUES (
$1, 'dutchie_product_crawl', 'pending', 1, $2, NOW(), NOW()
)
`,
[
dispensaryId,
JSON.stringify({
source: 'discovery_promotion',
discoveryLocationId,
pricingType: 'rec',
useBothModes: true,
}),
]
);
crawlJobCreated = true;
} else {
console.log(`[Promote] Crawl job already exists for dispensary`);
}
}
// Update discovery location notes
await pool.query(
`
UPDATE dutchie_discovery_locations
SET notes = COALESCE(notes || E'\n', '') || $1,
updated_at = NOW()
WHERE id = $2
`,
[`Promoted to crawlable at ${new Date().toISOString()}`, discoveryLocationId]
);
console.log(`[Promote] Promotion complete for discovery location ${discoveryLocationId}`);
return {
success: true,
discoveryId: discoveryLocationId,
dispensaryId,
crawlProfileId,
scheduleUpdated,
crawlJobCreated,
};
}
export default promoteDiscoveryLocation;

View File

@@ -1,973 +0,0 @@
/**
* Platform Discovery API Routes (DT = Dutchie)
*
* Routes for the platform-specific store discovery pipeline.
* Mount at /api/discovery/platforms/dt
*
* Platform Slug Mapping (for trademark-safe URLs):
* dt = Dutchie
* jn = Jane (future)
* wm = Weedmaps (future)
* lf = Leafly (future)
* tz = Treez (future)
*
* Note: The actual platform value stored in the DB remains 'dutchie'.
* Only the URL paths use neutral slugs.
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import { DutchieCityDiscovery } from './DutchieCityDiscovery';
import { DutchieLocationDiscovery } from './DutchieLocationDiscovery';
import { DiscoveryGeoService } from '../../services/DiscoveryGeoService';
import { GeoValidationService } from '../../services/GeoValidationService';
export function createDutchieDiscoveryRoutes(pool: Pool): Router {
const router = Router();
// ============================================================
// LOCATIONS
// ============================================================
/**
* GET /api/discovery/platforms/dt/locations
*
* List discovered locations with filtering.
*
* Query params:
* - status: 'discovered' | 'verified' | 'rejected' | 'merged'
* - state_code: e.g., 'AZ', 'CA'
* - country_code: 'US' | 'CA'
* - unlinked_only: 'true' to show only locations without dispensary_id
* - search: search by name
* - limit: number (default 50)
* - offset: number (default 0)
*/
router.get('/locations', async (req: Request, res: Response) => {
try {
const {
status,
state_code,
country_code,
unlinked_only,
search,
limit = '50',
offset = '0',
} = req.query;
let whereClause = "WHERE platform = 'dutchie' AND active = TRUE";
const params: any[] = [];
let paramIndex = 1;
if (status) {
whereClause += ` AND status = $${paramIndex}`;
params.push(status);
paramIndex++;
}
if (state_code) {
whereClause += ` AND state_code = $${paramIndex}`;
params.push(state_code);
paramIndex++;
}
if (country_code) {
whereClause += ` AND country_code = $${paramIndex}`;
params.push(country_code);
paramIndex++;
}
if (unlinked_only === 'true') {
whereClause += ' AND dispensary_id IS NULL';
}
if (search) {
whereClause += ` AND (name ILIKE $${paramIndex} OR platform_slug ILIKE $${paramIndex})`;
params.push(`%${search}%`);
paramIndex++;
}
const limitVal = parseInt(limit as string, 10);
const offsetVal = parseInt(offset as string, 10);
params.push(limitVal, offsetVal);
const { rows } = await pool.query(
`
SELECT
dl.id,
dl.platform,
dl.platform_location_id,
dl.platform_slug,
dl.platform_menu_url,
dl.name,
dl.raw_address,
dl.address_line1,
dl.city,
dl.state_code,
dl.postal_code,
dl.country_code,
dl.latitude,
dl.longitude,
dl.status,
dl.dispensary_id,
dl.offers_delivery,
dl.offers_pickup,
dl.is_recreational,
dl.is_medical,
dl.first_seen_at,
dl.last_seen_at,
dl.verified_at,
dl.verified_by,
dl.notes,
d.name as dispensary_name
FROM dutchie_discovery_locations dl
LEFT JOIN dispensaries d ON dl.dispensary_id = d.id
${whereClause}
ORDER BY dl.first_seen_at DESC
LIMIT $${paramIndex} OFFSET $${paramIndex + 1}
`,
params
);
// Get total count
const countParams = params.slice(0, -2);
const { rows: countRows } = await pool.query(
`SELECT COUNT(*) as total FROM dutchie_discovery_locations dl ${whereClause}`,
countParams
);
res.json({
success: true,
locations: rows.map((r) => ({
id: r.id,
platform: r.platform,
platformLocationId: r.platform_location_id,
platformSlug: r.platform_slug,
platformMenuUrl: r.platform_menu_url,
name: r.name,
rawAddress: r.raw_address,
addressLine1: r.address_line1,
city: r.city,
stateCode: r.state_code,
postalCode: r.postal_code,
countryCode: r.country_code,
latitude: r.latitude,
longitude: r.longitude,
status: r.status,
dispensaryId: r.dispensary_id,
dispensaryName: r.dispensary_name,
offersDelivery: r.offers_delivery,
offersPickup: r.offers_pickup,
isRecreational: r.is_recreational,
isMedical: r.is_medical,
firstSeenAt: r.first_seen_at,
lastSeenAt: r.last_seen_at,
verifiedAt: r.verified_at,
verifiedBy: r.verified_by,
notes: r.notes,
})),
total: parseInt(countRows[0]?.total || '0', 10),
limit: limitVal,
offset: offsetVal,
});
} catch (error: any) {
console.error('[Discovery Routes] Error fetching locations:', error);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/discovery/platforms/dt/locations/:id
*
* Get a single location by ID.
*/
router.get('/locations/:id', async (req: Request, res: Response) => {
try {
const { id } = req.params;
const { rows } = await pool.query(
`
SELECT
dl.*,
d.name as dispensary_name,
d.menu_url as dispensary_menu_url
FROM dutchie_discovery_locations dl
LEFT JOIN dispensaries d ON dl.dispensary_id = d.id
WHERE dl.id = $1
`,
[parseInt(id, 10)]
);
if (rows.length === 0) {
return res.status(404).json({ success: false, error: 'Location not found' });
}
const r = rows[0];
res.json({
success: true,
location: {
id: r.id,
platform: r.platform,
platformLocationId: r.platform_location_id,
platformSlug: r.platform_slug,
platformMenuUrl: r.platform_menu_url,
name: r.name,
rawAddress: r.raw_address,
addressLine1: r.address_line1,
addressLine2: r.address_line2,
city: r.city,
stateCode: r.state_code,
postalCode: r.postal_code,
countryCode: r.country_code,
latitude: r.latitude,
longitude: r.longitude,
timezone: r.timezone,
status: r.status,
dispensaryId: r.dispensary_id,
dispensaryName: r.dispensary_name,
dispensaryMenuUrl: r.dispensary_menu_url,
offersDelivery: r.offers_delivery,
offersPickup: r.offers_pickup,
isRecreational: r.is_recreational,
isMedical: r.is_medical,
firstSeenAt: r.first_seen_at,
lastSeenAt: r.last_seen_at,
verifiedAt: r.verified_at,
verifiedBy: r.verified_by,
notes: r.notes,
metadata: r.metadata,
},
});
} catch (error: any) {
console.error('[Discovery Routes] Error fetching location:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// ============================================================
// VERIFICATION ACTIONS
// ============================================================
/**
* POST /api/discovery/platforms/dt/locations/:id/verify-create
*
* Verify a discovered location and create a new canonical dispensary.
*/
router.post('/locations/:id/verify-create', async (req: Request, res: Response) => {
const client = await pool.connect();
try {
const { id } = req.params;
const { verifiedBy = 'admin' } = req.body;
await client.query('BEGIN');
// Get the discovery location
const { rows: locRows } = await client.query(
`SELECT * FROM dutchie_discovery_locations WHERE id = $1 FOR UPDATE`,
[parseInt(id, 10)]
);
if (locRows.length === 0) {
await client.query('ROLLBACK');
return res.status(404).json({ success: false, error: 'Location not found' });
}
const location = locRows[0];
if (location.status !== 'discovered') {
await client.query('ROLLBACK');
return res.status(400).json({
success: false,
error: `Cannot verify: location status is '${location.status}'`,
});
}
// Look up state_id if we have a state_code
let stateId: number | null = null;
if (location.state_code) {
const { rows: stateRows } = await client.query(
`SELECT id FROM states WHERE code = $1`,
[location.state_code]
);
if (stateRows.length > 0) {
stateId = stateRows[0].id;
}
}
// Create the canonical dispensary
const { rows: dispRows } = await client.query(
`
INSERT INTO dispensaries (
name,
slug,
address,
city,
state,
zip,
latitude,
longitude,
timezone,
menu_type,
menu_url,
platform_dispensary_id,
state_id,
active,
created_at,
updated_at
) VALUES (
$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, TRUE, NOW(), NOW()
)
RETURNING id
`,
[
location.name,
location.platform_slug,
location.address_line1,
location.city,
location.state_code,
location.postal_code,
location.latitude,
location.longitude,
location.timezone,
'dutchie',
location.platform_menu_url,
location.platform_location_id,
stateId,
]
);
const dispensaryId = dispRows[0].id;
// Update the discovery location
await client.query(
`
UPDATE dutchie_discovery_locations
SET status = 'verified',
dispensary_id = $1,
verified_at = NOW(),
verified_by = $2,
updated_at = NOW()
WHERE id = $3
`,
[dispensaryId, verifiedBy, id]
);
await client.query('COMMIT');
res.json({
success: true,
action: 'created',
discoveryId: parseInt(id, 10),
dispensaryId,
message: `Created new dispensary (ID: ${dispensaryId})`,
});
} catch (error: any) {
await client.query('ROLLBACK');
console.error('[Discovery Routes] Error in verify-create:', error);
res.status(500).json({ success: false, error: error.message });
} finally {
client.release();
}
});
/**
* POST /api/discovery/platforms/dt/locations/:id/verify-link
*
* Link a discovered location to an existing dispensary.
*
* Body:
* - dispensaryId: number (required)
* - verifiedBy: string (optional)
*/
router.post('/locations/:id/verify-link', async (req: Request, res: Response) => {
const client = await pool.connect();
try {
const { id } = req.params;
const { dispensaryId, verifiedBy = 'admin' } = req.body;
if (!dispensaryId) {
return res.status(400).json({ success: false, error: 'dispensaryId is required' });
}
await client.query('BEGIN');
// Verify dispensary exists
const { rows: dispRows } = await client.query(
`SELECT id, name FROM dispensaries WHERE id = $1`,
[dispensaryId]
);
if (dispRows.length === 0) {
await client.query('ROLLBACK');
return res.status(404).json({ success: false, error: 'Dispensary not found' });
}
// Get the discovery location
const { rows: locRows } = await client.query(
`SELECT * FROM dutchie_discovery_locations WHERE id = $1 FOR UPDATE`,
[parseInt(id, 10)]
);
if (locRows.length === 0) {
await client.query('ROLLBACK');
return res.status(404).json({ success: false, error: 'Location not found' });
}
const location = locRows[0];
if (location.status !== 'discovered') {
await client.query('ROLLBACK');
return res.status(400).json({
success: false,
error: `Cannot link: location status is '${location.status}'`,
});
}
// Update dispensary with platform info if missing
await client.query(
`
UPDATE dispensaries
SET platform_dispensary_id = COALESCE(platform_dispensary_id, $1),
menu_url = COALESCE(menu_url, $2),
menu_type = COALESCE(menu_type, 'dutchie'),
updated_at = NOW()
WHERE id = $3
`,
[location.platform_location_id, location.platform_menu_url, dispensaryId]
);
// Update the discovery location
await client.query(
`
UPDATE dutchie_discovery_locations
SET status = 'merged',
dispensary_id = $1,
verified_at = NOW(),
verified_by = $2,
updated_at = NOW()
WHERE id = $3
`,
[dispensaryId, verifiedBy, id]
);
await client.query('COMMIT');
res.json({
success: true,
action: 'linked',
discoveryId: parseInt(id, 10),
dispensaryId,
dispensaryName: dispRows[0].name,
message: `Linked to existing dispensary: ${dispRows[0].name}`,
});
} catch (error: any) {
await client.query('ROLLBACK');
console.error('[Discovery Routes] Error in verify-link:', error);
res.status(500).json({ success: false, error: error.message });
} finally {
client.release();
}
});
/**
* POST /api/discovery/platforms/dt/locations/:id/reject
*
* Reject a discovered location.
*
* Body:
* - reason: string (optional)
* - verifiedBy: string (optional)
*/
router.post('/locations/:id/reject', async (req: Request, res: Response) => {
try {
const { id } = req.params;
const { reason, verifiedBy = 'admin' } = req.body;
// Get current status
const { rows } = await pool.query(
`SELECT status FROM dutchie_discovery_locations WHERE id = $1`,
[parseInt(id, 10)]
);
if (rows.length === 0) {
return res.status(404).json({ success: false, error: 'Location not found' });
}
if (rows[0].status !== 'discovered') {
return res.status(400).json({
success: false,
error: `Cannot reject: location status is '${rows[0].status}'`,
});
}
await pool.query(
`
UPDATE dutchie_discovery_locations
SET status = 'rejected',
verified_at = NOW(),
verified_by = $1,
notes = COALESCE($2, notes),
updated_at = NOW()
WHERE id = $3
`,
[verifiedBy, reason, id]
);
res.json({
success: true,
action: 'rejected',
discoveryId: parseInt(id, 10),
message: 'Location rejected',
});
} catch (error: any) {
console.error('[Discovery Routes] Error in reject:', error);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* POST /api/discovery/platforms/dt/locations/:id/unreject
*
* Restore a rejected location to discovered status.
*/
router.post('/locations/:id/unreject', async (req: Request, res: Response) => {
try {
const { id } = req.params;
// Get current status
const { rows } = await pool.query(
`SELECT status FROM dutchie_discovery_locations WHERE id = $1`,
[parseInt(id, 10)]
);
if (rows.length === 0) {
return res.status(404).json({ success: false, error: 'Location not found' });
}
if (rows[0].status !== 'rejected') {
return res.status(400).json({
success: false,
error: `Cannot unreject: location status is '${rows[0].status}'`,
});
}
await pool.query(
`
UPDATE dutchie_discovery_locations
SET status = 'discovered',
verified_at = NULL,
verified_by = NULL,
updated_at = NOW()
WHERE id = $1
`,
[id]
);
res.json({
success: true,
action: 'unrejected',
discoveryId: parseInt(id, 10),
message: 'Location restored to discovered status',
});
} catch (error: any) {
console.error('[Discovery Routes] Error in unreject:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// ============================================================
// SUMMARY / REPORTING
// ============================================================
/**
* GET /api/discovery/platforms/dt/summary
*
* Get discovery summary statistics.
*/
router.get('/summary', async (_req: Request, res: Response) => {
try {
// Total counts by status
const { rows: statusRows } = await pool.query(`
SELECT status, COUNT(*) as cnt
FROM dutchie_discovery_locations
WHERE platform = 'dutchie' AND active = TRUE
GROUP BY status
`);
const statusCounts: Record<string, number> = {};
let totalLocations = 0;
for (const row of statusRows) {
statusCounts[row.status] = parseInt(row.cnt, 10);
totalLocations += parseInt(row.cnt, 10);
}
// By state
const { rows: stateRows } = await pool.query(`
SELECT
state_code,
COUNT(*) as total,
COUNT(*) FILTER (WHERE status = 'verified') as verified,
COUNT(*) FILTER (WHERE dispensary_id IS NULL AND status = 'discovered') as unlinked
FROM dutchie_discovery_locations
WHERE platform = 'dutchie' AND active = TRUE AND state_code IS NOT NULL
GROUP BY state_code
ORDER BY total DESC
`);
res.json({
success: true,
summary: {
total_locations: totalLocations,
discovered: statusCounts['discovered'] || 0,
verified: statusCounts['verified'] || 0,
merged: statusCounts['merged'] || 0,
rejected: statusCounts['rejected'] || 0,
},
by_state: stateRows.map((r) => ({
state_code: r.state_code,
total: parseInt(r.total, 10),
verified: parseInt(r.verified, 10),
unlinked: parseInt(r.unlinked, 10),
})),
});
} catch (error: any) {
console.error('[Discovery Routes] Error in summary:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// ============================================================
// CITIES
// ============================================================
/**
* GET /api/discovery/platforms/dt/cities
*
* List discovery cities.
*/
router.get('/cities', async (req: Request, res: Response) => {
try {
const { state_code, country_code, crawl_enabled, limit = '100', offset = '0' } = req.query;
let whereClause = "WHERE platform = 'dutchie'";
const params: any[] = [];
let paramIndex = 1;
if (state_code) {
whereClause += ` AND state_code = $${paramIndex}`;
params.push(state_code);
paramIndex++;
}
if (country_code) {
whereClause += ` AND country_code = $${paramIndex}`;
params.push(country_code);
paramIndex++;
}
if (crawl_enabled === 'true') {
whereClause += ' AND crawl_enabled = TRUE';
} else if (crawl_enabled === 'false') {
whereClause += ' AND crawl_enabled = FALSE';
}
params.push(parseInt(limit as string, 10), parseInt(offset as string, 10));
const { rows } = await pool.query(
`
SELECT
id,
platform,
city_name,
city_slug,
state_code,
country_code,
last_crawled_at,
crawl_enabled,
location_count
FROM dutchie_discovery_cities
${whereClause}
ORDER BY country_code, state_code, city_name
LIMIT $${paramIndex} OFFSET $${paramIndex + 1}
`,
params
);
const { rows: countRows } = await pool.query(
`SELECT COUNT(*) as total FROM dutchie_discovery_cities ${whereClause}`,
params.slice(0, -2)
);
res.json({
success: true,
cities: rows.map((r) => ({
id: r.id,
platform: r.platform,
cityName: r.city_name,
citySlug: r.city_slug,
stateCode: r.state_code,
countryCode: r.country_code,
lastCrawledAt: r.last_crawled_at,
crawlEnabled: r.crawl_enabled,
locationCount: r.location_count,
})),
total: parseInt(countRows[0]?.total || '0', 10),
});
} catch (error: any) {
console.error('[Discovery Routes] Error fetching cities:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// ============================================================
// MATCH CANDIDATES
// ============================================================
/**
* GET /api/discovery/platforms/dt/locations/:id/match-candidates
*
* Find potential dispensary matches for a discovery location.
*/
router.get('/locations/:id/match-candidates', async (req: Request, res: Response) => {
try {
const { id } = req.params;
// Get the discovery location
const { rows: locRows } = await pool.query(
`SELECT * FROM dutchie_discovery_locations WHERE id = $1`,
[parseInt(id, 10)]
);
if (locRows.length === 0) {
return res.status(404).json({ success: false, error: 'Location not found' });
}
const location = locRows[0];
// Find potential matches
const { rows: candidates } = await pool.query(
`
SELECT
d.id,
d.name,
d.city,
d.state,
d.address,
d.menu_type,
d.platform_dispensary_id,
d.menu_url,
d.latitude,
d.longitude,
CASE
WHEN d.name ILIKE $1 THEN 'exact_name'
WHEN d.name ILIKE $2 THEN 'partial_name'
WHEN d.city ILIKE $3 AND d.state = $4 THEN 'same_city'
ELSE 'location_match'
END as match_type,
CASE
WHEN d.latitude IS NOT NULL AND d.longitude IS NOT NULL
AND $5::float IS NOT NULL AND $6::float IS NOT NULL
THEN (3959 * acos(
LEAST(1.0, GREATEST(-1.0,
cos(radians($5::float)) * cos(radians(d.latitude)) *
cos(radians(d.longitude) - radians($6::float)) +
sin(radians($5::float)) * sin(radians(d.latitude))
))
))
ELSE NULL
END as distance_miles
FROM dispensaries d
WHERE d.state = $4
AND (
d.name ILIKE $1
OR d.name ILIKE $2
OR d.city ILIKE $3
OR (
d.latitude IS NOT NULL
AND d.longitude IS NOT NULL
AND $5::float IS NOT NULL
AND $6::float IS NOT NULL
)
)
ORDER BY
CASE
WHEN d.name ILIKE $1 THEN 1
WHEN d.name ILIKE $2 THEN 2
ELSE 3
END,
distance_miles NULLS LAST
LIMIT 10
`,
[
location.name,
`%${location.name.split(' ')[0]}%`,
location.city,
location.state_code,
location.latitude,
location.longitude,
]
);
res.json({
success: true,
location: {
id: location.id,
name: location.name,
city: location.city,
stateCode: location.state_code,
},
candidates: candidates.map((c) => ({
id: c.id,
name: c.name,
city: c.city,
state: c.state,
address: c.address,
menuType: c.menu_type,
platformDispensaryId: c.platform_dispensary_id,
menuUrl: c.menu_url,
matchType: c.match_type,
distanceMiles: c.distance_miles ? Math.round(c.distance_miles * 10) / 10 : null,
})),
});
} catch (error: any) {
console.error('[Discovery Routes] Error fetching match candidates:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// ============================================================
// GEO / NEARBY (Admin/Debug Only)
// ============================================================
/**
* GET /api/discovery/platforms/dt/nearby
*
* Find discovery locations near a given coordinate.
* This is an internal/debug endpoint for admin use.
*
* Query params:
* - lat: number (required)
* - lon: number (required)
* - radiusKm: number (optional, default 50)
* - limit: number (optional, default 20)
* - status: string (optional, filter by status)
*/
router.get('/nearby', async (req: Request, res: Response) => {
try {
const { lat, lon, radiusKm = '50', limit = '20', status } = req.query;
// Validate required params
if (!lat || !lon) {
return res.status(400).json({
success: false,
error: 'lat and lon are required query parameters',
});
}
const latNum = parseFloat(lat as string);
const lonNum = parseFloat(lon as string);
const radiusNum = parseFloat(radiusKm as string);
const limitNum = parseInt(limit as string, 10);
if (isNaN(latNum) || isNaN(lonNum)) {
return res.status(400).json({
success: false,
error: 'lat and lon must be valid numbers',
});
}
const geoService = new DiscoveryGeoService(pool);
const locations = await geoService.findNearbyDiscoveryLocations(latNum, lonNum, {
radiusKm: radiusNum,
limit: limitNum,
platform: 'dutchie',
status: status as string | undefined,
});
res.json({
success: true,
center: { lat: latNum, lon: lonNum },
radiusKm: radiusNum,
count: locations.length,
locations,
});
} catch (error: any) {
console.error('[Discovery Routes] Error in nearby:', error);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/discovery/platforms/dt/geo-stats
*
* Get coordinate coverage statistics for discovery locations.
* This is an internal/debug endpoint for admin use.
*/
router.get('/geo-stats', async (_req: Request, res: Response) => {
try {
const geoService = new DiscoveryGeoService(pool);
const stats = await geoService.getCoordinateCoverageStats();
res.json({
success: true,
stats,
});
} catch (error: any) {
console.error('[Discovery Routes] Error in geo-stats:', error);
res.status(500).json({ success: false, error: error.message });
}
});
/**
* GET /api/discovery/platforms/dt/locations/:id/validate-geo
*
* Validate the geographic data for a discovery location.
* This is an internal/debug endpoint for admin use.
*/
router.get('/locations/:id/validate-geo', async (req: Request, res: Response) => {
try {
const { id } = req.params;
// Get the location
const { rows } = await pool.query(
`SELECT latitude, longitude, state_code, country_code, name
FROM dutchie_discovery_locations WHERE id = $1`,
[parseInt(id, 10)]
);
if (rows.length === 0) {
return res.status(404).json({ success: false, error: 'Location not found' });
}
const location = rows[0];
const geoValidation = new GeoValidationService();
const result = geoValidation.validateLocationState({
latitude: location.latitude,
longitude: location.longitude,
state_code: location.state_code,
country_code: location.country_code,
});
res.json({
success: true,
location: {
id: parseInt(id, 10),
name: location.name,
latitude: location.latitude,
longitude: location.longitude,
stateCode: location.state_code,
countryCode: location.country_code,
},
validation: result,
});
} catch (error: any) {
console.error('[Discovery Routes] Error in validate-geo:', error);
res.status(500).json({ success: false, error: error.message });
}
});
return router;
}
export default createDutchieDiscoveryRoutes;

View File

@@ -1,92 +0,0 @@
/**
* Dutchie AZ Data Pipeline
*
* Isolated data pipeline for crawling and storing Dutchie Arizona dispensary data.
* This module is completely separate from the main application database.
*
* Features:
* - Two-mode crawling (Mode A: UI parity, Mode B: MAX COVERAGE)
* - Derived stockStatus field (in_stock, out_of_stock, unknown)
* - Full raw payload storage for 100% data preservation
* - AZDHS dispensary list as canonical source
*/
// Types
export * from './types';
// Database
export {
getDutchieAZPool,
query,
getClient,
closePool,
healthCheck,
} from './db/connection';
export {
createSchema,
dropSchema,
schemaExists,
ensureSchema,
} from './db/schema';
// Services - GraphQL Client
export {
GRAPHQL_HASHES,
ARIZONA_CENTERPOINTS,
resolveDispensaryId,
fetchAllProducts,
fetchAllProductsBothModes,
discoverArizonaDispensaries,
// Alias for backward compatibility
discoverArizonaDispensaries as discoverDispensaries,
} from './services/graphql-client';
// Services - Discovery
export {
importFromExistingDispensaries,
discoverDispensaries as discoverAndSaveDispensaries,
resolvePlatformDispensaryIds,
getAllDispensaries,
getDispensaryById,
getDispensariesWithPlatformIds,
} from './services/discovery';
// Services - Product Crawler
export {
normalizeProduct,
normalizeSnapshot,
crawlDispensaryProducts,
crawlAllArizonaDispensaries,
} from './services/product-crawler';
export type { CrawlResult } from './services/product-crawler';
// Services - Scheduler
export {
startScheduler,
stopScheduler,
triggerImmediateCrawl,
getSchedulerStatus,
crawlSingleDispensary,
// Schedule config CRUD
getAllSchedules,
getScheduleById,
createSchedule,
updateSchedule,
deleteSchedule,
triggerScheduleNow,
initializeDefaultSchedules,
// Run logs
getRunLogs,
} from './services/scheduler';
// Services - AZDHS Import
export {
importAZDHSDispensaries,
importFromJSON,
getImportStats,
} from './services/azdhs-import';
// Routes
export { default as dutchieAZRouter } from './routes';

View File

@@ -1,682 +0,0 @@
/**
* Analytics API Routes
*
* Provides REST API endpoints for all analytics services.
* All routes are prefixed with /api/analytics
*
* Phase 3: Analytics Dashboards
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import {
AnalyticsCache,
PriceTrendService,
PenetrationService,
CategoryAnalyticsService,
StoreChangeService,
BrandOpportunityService,
} from '../services/analytics';
export function createAnalyticsRouter(pool: Pool): Router {
const router = Router();
// Initialize services
const cache = new AnalyticsCache(pool, { defaultTtlMinutes: 15 });
const priceService = new PriceTrendService(pool, cache);
const penetrationService = new PenetrationService(pool, cache);
const categoryService = new CategoryAnalyticsService(pool, cache);
const storeService = new StoreChangeService(pool, cache);
const brandOpportunityService = new BrandOpportunityService(pool, cache);
// ============================================================
// PRICE ANALYTICS
// ============================================================
/**
* GET /api/analytics/price/product/:id
* Get price trend for a specific product
*/
router.get('/price/product/:id', async (req: Request, res: Response) => {
try {
const productId = parseInt(req.params.id);
const storeId = req.query.storeId ? parseInt(req.query.storeId as string) : undefined;
const days = req.query.days ? parseInt(req.query.days as string) : 30;
const result = await priceService.getProductPriceTrend(productId, storeId, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Price product error:', error);
res.status(500).json({ error: 'Failed to fetch product price trend' });
}
});
/**
* GET /api/analytics/price/brand/:name
* Get price trend for a brand
*/
router.get('/price/brand/:name', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const filters = {
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
category: req.query.category as string | undefined,
state: req.query.state as string | undefined,
days: req.query.days ? parseInt(req.query.days as string) : 30,
};
const result = await priceService.getBrandPriceTrend(brandName, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Price brand error:', error);
res.status(500).json({ error: 'Failed to fetch brand price trend' });
}
});
/**
* GET /api/analytics/price/category/:name
* Get price trend for a category
*/
router.get('/price/category/:name', async (req: Request, res: Response) => {
try {
const category = decodeURIComponent(req.params.name);
const filters = {
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
brandName: req.query.brand as string | undefined,
state: req.query.state as string | undefined,
days: req.query.days ? parseInt(req.query.days as string) : 30,
};
const result = await priceService.getCategoryPriceTrend(category, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Price category error:', error);
res.status(500).json({ error: 'Failed to fetch category price trend' });
}
});
/**
* GET /api/analytics/price/summary
* Get price summary statistics
*/
router.get('/price/summary', async (req: Request, res: Response) => {
try {
const filters = {
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
brandName: req.query.brand as string | undefined,
category: req.query.category as string | undefined,
state: req.query.state as string | undefined,
};
const result = await priceService.getPriceSummary(filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Price summary error:', error);
res.status(500).json({ error: 'Failed to fetch price summary' });
}
});
/**
* GET /api/analytics/price/compression/:category
* Get price compression analysis for a category
*/
router.get('/price/compression/:category', async (req: Request, res: Response) => {
try {
const category = decodeURIComponent(req.params.category);
const state = req.query.state as string | undefined;
const result = await priceService.detectPriceCompression(category, state);
res.json(result);
} catch (error) {
console.error('[Analytics] Price compression error:', error);
res.status(500).json({ error: 'Failed to analyze price compression' });
}
});
/**
* GET /api/analytics/price/global
* Get global price statistics
*/
router.get('/price/global', async (_req: Request, res: Response) => {
try {
const result = await priceService.getGlobalPriceStats();
res.json(result);
} catch (error) {
console.error('[Analytics] Global price error:', error);
res.status(500).json({ error: 'Failed to fetch global price stats' });
}
});
// ============================================================
// PENETRATION ANALYTICS
// ============================================================
/**
* GET /api/analytics/penetration/brand/:name
* Get penetration data for a brand
*/
router.get('/penetration/brand/:name', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const filters = {
state: req.query.state as string | undefined,
category: req.query.category as string | undefined,
};
const result = await penetrationService.getBrandPenetration(brandName, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Brand penetration error:', error);
res.status(500).json({ error: 'Failed to fetch brand penetration' });
}
});
/**
* GET /api/analytics/penetration/top
* Get top brands by penetration
*/
router.get('/penetration/top', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 20;
const filters = {
state: req.query.state as string | undefined,
category: req.query.category as string | undefined,
minStores: req.query.minStores ? parseInt(req.query.minStores as string) : 2,
minSkus: req.query.minSkus ? parseInt(req.query.minSkus as string) : 5,
};
const result = await penetrationService.getTopBrandsByPenetration(limit, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Top penetration error:', error);
res.status(500).json({ error: 'Failed to fetch top brands' });
}
});
/**
* GET /api/analytics/penetration/trend/:brand
* Get penetration trend for a brand
*/
router.get('/penetration/trend/:brand', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.brand);
const days = req.query.days ? parseInt(req.query.days as string) : 30;
const result = await penetrationService.getPenetrationTrend(brandName, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Penetration trend error:', error);
res.status(500).json({ error: 'Failed to fetch penetration trend' });
}
});
/**
* GET /api/analytics/penetration/shelf-share/:brand
* Get shelf share by category for a brand
*/
router.get('/penetration/shelf-share/:brand', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.brand);
const result = await penetrationService.getShelfShareByCategory(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Shelf share error:', error);
res.status(500).json({ error: 'Failed to fetch shelf share' });
}
});
/**
* GET /api/analytics/penetration/by-state/:brand
* Get brand presence by state
*/
router.get('/penetration/by-state/:brand', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.brand);
const result = await penetrationService.getBrandPresenceByState(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Brand by state error:', error);
res.status(500).json({ error: 'Failed to fetch brand presence by state' });
}
});
/**
* GET /api/analytics/penetration/stores/:brand
* Get stores carrying a brand
*/
router.get('/penetration/stores/:brand', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.brand);
const result = await penetrationService.getStoresCarryingBrand(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Stores carrying brand error:', error);
res.status(500).json({ error: 'Failed to fetch stores' });
}
});
/**
* GET /api/analytics/penetration/heatmap
* Get penetration heatmap data
*/
router.get('/penetration/heatmap', async (req: Request, res: Response) => {
try {
const brandName = req.query.brand as string | undefined;
const result = await penetrationService.getPenetrationHeatmap(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Heatmap error:', error);
res.status(500).json({ error: 'Failed to fetch heatmap data' });
}
});
// ============================================================
// CATEGORY ANALYTICS
// ============================================================
/**
* GET /api/analytics/category/summary
* Get category summary
*/
router.get('/category/summary', async (req: Request, res: Response) => {
try {
const category = req.query.category as string | undefined;
const filters = {
state: req.query.state as string | undefined,
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
};
const result = await categoryService.getCategorySummary(category, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Category summary error:', error);
res.status(500).json({ error: 'Failed to fetch category summary' });
}
});
/**
* GET /api/analytics/category/growth
* Get category growth data
*/
router.get('/category/growth', async (req: Request, res: Response) => {
try {
const days = req.query.days ? parseInt(req.query.days as string) : 7;
const filters = {
state: req.query.state as string | undefined,
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
minSkus: req.query.minSkus ? parseInt(req.query.minSkus as string) : 10,
};
const result = await categoryService.getCategoryGrowth(days, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Category growth error:', error);
res.status(500).json({ error: 'Failed to fetch category growth' });
}
});
/**
* GET /api/analytics/category/trend/:category
* Get category growth trend over time
*/
router.get('/category/trend/:category', async (req: Request, res: Response) => {
try {
const category = decodeURIComponent(req.params.category);
const days = req.query.days ? parseInt(req.query.days as string) : 90;
const result = await categoryService.getCategoryGrowthTrend(category, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Category trend error:', error);
res.status(500).json({ error: 'Failed to fetch category trend' });
}
});
/**
* GET /api/analytics/category/heatmap
* Get category heatmap data
*/
router.get('/category/heatmap', async (req: Request, res: Response) => {
try {
const metric = (req.query.metric as 'skus' | 'growth' | 'price') || 'skus';
const periods = req.query.periods ? parseInt(req.query.periods as string) : 12;
const result = await categoryService.getCategoryHeatmap(metric, periods);
res.json(result);
} catch (error) {
console.error('[Analytics] Category heatmap error:', error);
res.status(500).json({ error: 'Failed to fetch heatmap' });
}
});
/**
* GET /api/analytics/category/top-movers
* Get top growing and declining categories
*/
router.get('/category/top-movers', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 5;
const days = req.query.days ? parseInt(req.query.days as string) : 30;
const result = await categoryService.getTopMovers(limit, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Top movers error:', error);
res.status(500).json({ error: 'Failed to fetch top movers' });
}
});
/**
* GET /api/analytics/category/:category/subcategories
* Get subcategory breakdown
*/
router.get('/category/:category/subcategories', async (req: Request, res: Response) => {
try {
const category = decodeURIComponent(req.params.category);
const result = await categoryService.getSubcategoryBreakdown(category);
res.json(result);
} catch (error) {
console.error('[Analytics] Subcategory error:', error);
res.status(500).json({ error: 'Failed to fetch subcategories' });
}
});
// ============================================================
// STORE CHANGE TRACKING
// ============================================================
/**
* GET /api/analytics/store/:id/summary
* Get change summary for a store
*/
router.get('/store/:id/summary', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.id);
const result = await storeService.getStoreChangeSummary(storeId);
if (!result) {
return res.status(404).json({ error: 'Store not found' });
}
res.json(result);
} catch (error) {
console.error('[Analytics] Store summary error:', error);
res.status(500).json({ error: 'Failed to fetch store summary' });
}
});
/**
* GET /api/analytics/store/:id/events
* Get recent change events for a store
*/
router.get('/store/:id/events', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.id);
const filters = {
eventType: req.query.type as string | undefined,
days: req.query.days ? parseInt(req.query.days as string) : 30,
limit: req.query.limit ? parseInt(req.query.limit as string) : 100,
};
const result = await storeService.getStoreChangeEvents(storeId, filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Store events error:', error);
res.status(500).json({ error: 'Failed to fetch store events' });
}
});
/**
* GET /api/analytics/store/:id/brands/new
* Get new brands added to a store
*/
router.get('/store/:id/brands/new', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.id);
const days = req.query.days ? parseInt(req.query.days as string) : 30;
const result = await storeService.getNewBrands(storeId, days);
res.json(result);
} catch (error) {
console.error('[Analytics] New brands error:', error);
res.status(500).json({ error: 'Failed to fetch new brands' });
}
});
/**
* GET /api/analytics/store/:id/brands/lost
* Get brands lost from a store
*/
router.get('/store/:id/brands/lost', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.id);
const days = req.query.days ? parseInt(req.query.days as string) : 30;
const result = await storeService.getLostBrands(storeId, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Lost brands error:', error);
res.status(500).json({ error: 'Failed to fetch lost brands' });
}
});
/**
* GET /api/analytics/store/:id/products/changes
* Get product changes for a store
*/
router.get('/store/:id/products/changes', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.id);
const changeType = req.query.type as 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock' | undefined;
const days = req.query.days ? parseInt(req.query.days as string) : 7;
const result = await storeService.getProductChanges(storeId, changeType, days);
res.json(result);
} catch (error) {
console.error('[Analytics] Product changes error:', error);
res.status(500).json({ error: 'Failed to fetch product changes' });
}
});
/**
* GET /api/analytics/store/leaderboard/:category
* Get category leaderboard across stores
*/
router.get('/store/leaderboard/:category', async (req: Request, res: Response) => {
try {
const category = decodeURIComponent(req.params.category);
const limit = req.query.limit ? parseInt(req.query.limit as string) : 20;
const result = await storeService.getCategoryLeaderboard(category, limit);
res.json(result);
} catch (error) {
console.error('[Analytics] Leaderboard error:', error);
res.status(500).json({ error: 'Failed to fetch leaderboard' });
}
});
/**
* GET /api/analytics/store/most-active
* Get most active stores (by changes)
*/
router.get('/store/most-active', async (req: Request, res: Response) => {
try {
const days = req.query.days ? parseInt(req.query.days as string) : 7;
const limit = req.query.limit ? parseInt(req.query.limit as string) : 10;
const result = await storeService.getMostActiveStores(days, limit);
res.json(result);
} catch (error) {
console.error('[Analytics] Most active error:', error);
res.status(500).json({ error: 'Failed to fetch active stores' });
}
});
/**
* GET /api/analytics/store/compare
* Compare two stores
*/
router.get('/store/compare', async (req: Request, res: Response) => {
try {
const store1 = parseInt(req.query.store1 as string);
const store2 = parseInt(req.query.store2 as string);
if (!store1 || !store2) {
return res.status(400).json({ error: 'Both store1 and store2 are required' });
}
const result = await storeService.compareStores(store1, store2);
res.json(result);
} catch (error) {
console.error('[Analytics] Compare stores error:', error);
res.status(500).json({ error: 'Failed to compare stores' });
}
});
// ============================================================
// BRAND OPPORTUNITY / RISK
// ============================================================
/**
* GET /api/analytics/brand/:name/opportunity
* Get full opportunity analysis for a brand
*/
router.get('/brand/:name/opportunity', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const result = await brandOpportunityService.getBrandOpportunity(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Brand opportunity error:', error);
res.status(500).json({ error: 'Failed to fetch brand opportunity' });
}
});
/**
* GET /api/analytics/brand/:name/position
* Get market position summary for a brand
*/
router.get('/brand/:name/position', async (req: Request, res: Response) => {
try {
const brandName = decodeURIComponent(req.params.name);
const result = await brandOpportunityService.getMarketPositionSummary(brandName);
res.json(result);
} catch (error) {
console.error('[Analytics] Brand position error:', error);
res.status(500).json({ error: 'Failed to fetch brand position' });
}
});
// ============================================================
// ALERTS
// ============================================================
/**
* GET /api/analytics/alerts
* Get analytics alerts
*/
router.get('/alerts', async (req: Request, res: Response) => {
try {
const filters = {
brandName: req.query.brand as string | undefined,
storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined,
alertType: req.query.type as string | undefined,
unreadOnly: req.query.unreadOnly === 'true',
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
};
const result = await brandOpportunityService.getAlerts(filters);
res.json(result);
} catch (error) {
console.error('[Analytics] Alerts error:', error);
res.status(500).json({ error: 'Failed to fetch alerts' });
}
});
/**
* POST /api/analytics/alerts/mark-read
* Mark alerts as read
*/
router.post('/alerts/mark-read', async (req: Request, res: Response) => {
try {
const { alertIds } = req.body;
if (!Array.isArray(alertIds)) {
return res.status(400).json({ error: 'alertIds must be an array' });
}
await brandOpportunityService.markAlertsRead(alertIds);
res.json({ success: true });
} catch (error) {
console.error('[Analytics] Mark read error:', error);
res.status(500).json({ error: 'Failed to mark alerts as read' });
}
});
// ============================================================
// CACHE MANAGEMENT
// ============================================================
/**
* GET /api/analytics/cache/stats
* Get cache statistics
*/
router.get('/cache/stats', async (_req: Request, res: Response) => {
try {
const stats = await cache.getStats();
res.json(stats);
} catch (error) {
console.error('[Analytics] Cache stats error:', error);
res.status(500).json({ error: 'Failed to get cache stats' });
}
});
/**
* POST /api/analytics/cache/clear
* Clear cache (admin only)
*/
router.post('/cache/clear', async (req: Request, res: Response) => {
try {
const pattern = req.query.pattern as string | undefined;
if (pattern) {
const cleared = await cache.invalidatePattern(pattern);
res.json({ success: true, clearedCount: cleared });
} else {
await cache.cleanExpired();
res.json({ success: true, message: 'Expired entries cleaned' });
}
} catch (error) {
console.error('[Analytics] Cache clear error:', error);
res.status(500).json({ error: 'Failed to clear cache' });
}
});
// ============================================================
// SNAPSHOT CAPTURE (for cron/scheduled jobs)
// ============================================================
/**
* POST /api/analytics/snapshots/capture
* Capture daily snapshots (run by scheduler)
*/
router.post('/snapshots/capture', async (_req: Request, res: Response) => {
try {
const [brandResult, categoryResult] = await Promise.all([
pool.query('SELECT capture_brand_snapshots() as count'),
pool.query('SELECT capture_category_snapshots() as count'),
]);
res.json({
success: true,
brandSnapshots: parseInt(brandResult.rows[0]?.count || '0'),
categorySnapshots: parseInt(categoryResult.rows[0]?.count || '0'),
});
} catch (error) {
console.error('[Analytics] Snapshot capture error:', error);
res.status(500).json({ error: 'Failed to capture snapshots' });
}
});
return router;
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,486 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Crawler Reliability Stress Test
*
* Simulates various failure scenarios to test:
* - Retry logic with exponential backoff
* - Error taxonomy classification
* - Self-healing (proxy/UA rotation)
* - Status transitions (active -> degraded -> failed)
* - Minimum crawl gap enforcement
*
* Phase 1: Crawler Reliability & Stabilization
*
* Usage:
* DATABASE_URL="postgresql://..." npx tsx src/dutchie-az/scripts/stress-test.ts [test-name]
*
* Available tests:
* retry - Test retry manager with various error types
* backoff - Test exponential backoff calculation
* status - Test status transitions
* gap - Test minimum crawl gap enforcement
* rotation - Test proxy/UA rotation
* all - Run all tests
*/
import {
CrawlErrorCode,
classifyError,
isRetryable,
shouldRotateProxy,
shouldRotateUserAgent,
getBackoffMultiplier,
getErrorMetadata,
} from '../services/error-taxonomy';
import {
RetryManager,
withRetry,
calculateNextCrawlDelay,
calculateNextCrawlAt,
determineCrawlStatus,
shouldAttemptRecovery,
sleep,
} from '../services/retry-manager';
import {
UserAgentRotator,
USER_AGENTS,
} from '../services/proxy-rotator';
import {
validateStoreConfig,
isCrawlable,
DEFAULT_CONFIG,
RawStoreConfig,
} from '../services/store-validator';
// ============================================================
// TEST UTILITIES
// ============================================================
let testsPassed = 0;
let testsFailed = 0;
function assert(condition: boolean, message: string): void {
if (condition) {
console.log(`${message}`);
testsPassed++;
} else {
console.log(`${message}`);
testsFailed++;
}
}
function section(name: string): void {
console.log(`\n${'='.repeat(60)}`);
console.log(`TEST: ${name}`);
console.log('='.repeat(60));
}
// ============================================================
// TEST: Error Classification
// ============================================================
function testErrorClassification(): void {
section('Error Classification');
// HTTP status codes
assert(classifyError(null, 429) === CrawlErrorCode.RATE_LIMITED, '429 -> RATE_LIMITED');
assert(classifyError(null, 407) === CrawlErrorCode.BLOCKED_PROXY, '407 -> BLOCKED_PROXY');
assert(classifyError(null, 401) === CrawlErrorCode.AUTH_FAILED, '401 -> AUTH_FAILED');
assert(classifyError(null, 403) === CrawlErrorCode.AUTH_FAILED, '403 -> AUTH_FAILED');
assert(classifyError(null, 503) === CrawlErrorCode.SERVICE_UNAVAILABLE, '503 -> SERVICE_UNAVAILABLE');
assert(classifyError(null, 500) === CrawlErrorCode.SERVER_ERROR, '500 -> SERVER_ERROR');
// Error messages
assert(classifyError('rate limit exceeded') === CrawlErrorCode.RATE_LIMITED, 'rate limit message -> RATE_LIMITED');
assert(classifyError('request timed out') === CrawlErrorCode.TIMEOUT, 'timeout message -> TIMEOUT');
assert(classifyError('proxy blocked') === CrawlErrorCode.BLOCKED_PROXY, 'proxy blocked -> BLOCKED_PROXY');
assert(classifyError('ECONNREFUSED') === CrawlErrorCode.NETWORK_ERROR, 'ECONNREFUSED -> NETWORK_ERROR');
assert(classifyError('ENOTFOUND') === CrawlErrorCode.DNS_ERROR, 'ENOTFOUND -> DNS_ERROR');
assert(classifyError('selector not found') === CrawlErrorCode.HTML_CHANGED, 'selector error -> HTML_CHANGED');
assert(classifyError('JSON parse error') === CrawlErrorCode.PARSE_ERROR, 'parse error -> PARSE_ERROR');
assert(classifyError('0 products found') === CrawlErrorCode.NO_PRODUCTS, 'no products -> NO_PRODUCTS');
// Retryability
assert(isRetryable(CrawlErrorCode.RATE_LIMITED) === true, 'RATE_LIMITED is retryable');
assert(isRetryable(CrawlErrorCode.TIMEOUT) === true, 'TIMEOUT is retryable');
assert(isRetryable(CrawlErrorCode.HTML_CHANGED) === false, 'HTML_CHANGED is NOT retryable');
assert(isRetryable(CrawlErrorCode.INVALID_CONFIG) === false, 'INVALID_CONFIG is NOT retryable');
// Rotation decisions
assert(shouldRotateProxy(CrawlErrorCode.BLOCKED_PROXY) === true, 'BLOCKED_PROXY -> rotate proxy');
assert(shouldRotateProxy(CrawlErrorCode.RATE_LIMITED) === true, 'RATE_LIMITED -> rotate proxy');
assert(shouldRotateUserAgent(CrawlErrorCode.AUTH_FAILED) === true, 'AUTH_FAILED -> rotate UA');
}
// ============================================================
// TEST: Retry Manager
// ============================================================
function testRetryManager(): void {
section('Retry Manager');
const manager = new RetryManager({ maxRetries: 3, baseBackoffMs: 100 });
// Initial state
assert(manager.shouldAttempt() === true, 'Should attempt initially');
assert(manager.getAttemptNumber() === 1, 'Attempt number starts at 1');
// First attempt
manager.recordAttempt();
assert(manager.getAttemptNumber() === 2, 'Attempt number increments');
// Evaluate retryable error
const decision1 = manager.evaluateError(new Error('rate limit exceeded'), 429);
assert(decision1.shouldRetry === true, 'Should retry on rate limit');
assert(decision1.errorCode === CrawlErrorCode.RATE_LIMITED, 'Error code is RATE_LIMITED');
assert(decision1.rotateProxy === true, 'Should rotate proxy');
assert(decision1.backoffMs > 0, 'Backoff is positive');
// More attempts
manager.recordAttempt();
manager.recordAttempt();
// Now at max retries
const decision2 = manager.evaluateError(new Error('timeout'), 504);
assert(decision2.shouldRetry === true, 'Should still retry (at limit but not exceeded)');
manager.recordAttempt();
const decision3 = manager.evaluateError(new Error('timeout'));
assert(decision3.shouldRetry === false, 'Should NOT retry after max');
assert(decision3.reason.includes('exhausted'), 'Reason mentions exhausted');
// Reset
manager.reset();
assert(manager.shouldAttempt() === true, 'Should attempt after reset');
assert(manager.getAttemptNumber() === 1, 'Attempt number resets');
// Non-retryable error
const manager2 = new RetryManager({ maxRetries: 3 });
manager2.recordAttempt();
const nonRetryable = manager2.evaluateError(new Error('HTML structure changed'));
assert(nonRetryable.shouldRetry === false, 'Non-retryable error stops immediately');
assert(nonRetryable.errorCode === CrawlErrorCode.HTML_CHANGED, 'Error code is HTML_CHANGED');
}
// ============================================================
// TEST: Exponential Backoff
// ============================================================
function testExponentialBackoff(): void {
section('Exponential Backoff');
// Calculate next crawl delay
const delay0 = calculateNextCrawlDelay(0, 240); // No failures
const delay1 = calculateNextCrawlDelay(1, 240); // 1 failure
const delay2 = calculateNextCrawlDelay(2, 240); // 2 failures
const delay3 = calculateNextCrawlDelay(3, 240); // 3 failures
const delay5 = calculateNextCrawlDelay(5, 240); // 5 failures (should cap)
console.log(` Delay with 0 failures: ${delay0} minutes`);
console.log(` Delay with 1 failure: ${delay1} minutes`);
console.log(` Delay with 2 failures: ${delay2} minutes`);
console.log(` Delay with 3 failures: ${delay3} minutes`);
console.log(` Delay with 5 failures: ${delay5} minutes`);
assert(delay1 > delay0, 'Delay increases with failures');
assert(delay2 > delay1, 'Delay keeps increasing');
assert(delay3 > delay2, 'More delay with more failures');
// With jitter, exact values vary but ratio should be close to 2x
assert(delay5 <= 240 * 4 * 1.2, 'Delay is capped at max multiplier');
// Next crawl time calculation
const now = new Date();
const nextAt = calculateNextCrawlAt(2, 240);
assert(nextAt > now, 'Next crawl is in future');
assert(nextAt.getTime() - now.getTime() > 240 * 60 * 1000, 'Includes backoff');
}
// ============================================================
// TEST: Status Transitions
// ============================================================
function testStatusTransitions(): void {
section('Status Transitions');
// Active status
assert(determineCrawlStatus(0) === 'active', '0 failures -> active');
assert(determineCrawlStatus(1) === 'active', '1 failure -> active');
assert(determineCrawlStatus(2) === 'active', '2 failures -> active');
// Degraded status
assert(determineCrawlStatus(3) === 'degraded', '3 failures -> degraded');
assert(determineCrawlStatus(5) === 'degraded', '5 failures -> degraded');
assert(determineCrawlStatus(9) === 'degraded', '9 failures -> degraded');
// Failed status
assert(determineCrawlStatus(10) === 'failed', '10 failures -> failed');
assert(determineCrawlStatus(15) === 'failed', '15 failures -> failed');
// Custom thresholds
const customStatus = determineCrawlStatus(5, { degraded: 5, failed: 8 });
assert(customStatus === 'degraded', 'Custom threshold: 5 -> degraded');
// Recovery check
const recentFailure = new Date(Date.now() - 1 * 60 * 60 * 1000); // 1 hour ago
const oldFailure = new Date(Date.now() - 48 * 60 * 60 * 1000); // 48 hours ago
assert(shouldAttemptRecovery(recentFailure, 1) === false, 'No recovery for recent failure');
assert(shouldAttemptRecovery(oldFailure, 1) === true, 'Recovery allowed for old failure');
assert(shouldAttemptRecovery(null, 0) === true, 'Recovery allowed if no previous failure');
}
// ============================================================
// TEST: Store Validation
// ============================================================
function testStoreValidation(): void {
section('Store Validation');
// Valid config
const validConfig: RawStoreConfig = {
id: 1,
name: 'Test Store',
platformDispensaryId: '123abc',
menuType: 'dutchie',
};
const validResult = validateStoreConfig(validConfig);
assert(validResult.isValid === true, 'Valid config passes');
assert(validResult.config !== null, 'Valid config returns config');
assert(validResult.config?.slug === 'test-store', 'Slug is generated');
// Missing required fields
const missingId: RawStoreConfig = {
id: 0,
name: 'Test',
platformDispensaryId: '123',
menuType: 'dutchie',
};
const missingIdResult = validateStoreConfig(missingId);
assert(missingIdResult.isValid === false, 'Missing ID fails');
// Missing platform ID
const missingPlatform: RawStoreConfig = {
id: 1,
name: 'Test',
menuType: 'dutchie',
};
const missingPlatformResult = validateStoreConfig(missingPlatform);
assert(missingPlatformResult.isValid === false, 'Missing platform ID fails');
// Unknown menu type
const unknownMenu: RawStoreConfig = {
id: 1,
name: 'Test',
platformDispensaryId: '123',
menuType: 'unknown',
};
const unknownMenuResult = validateStoreConfig(unknownMenu);
assert(unknownMenuResult.isValid === false, 'Unknown menu type fails');
// Crawlable check
assert(isCrawlable(validConfig) === true, 'Valid config is crawlable');
assert(isCrawlable(missingPlatform) === false, 'Missing platform not crawlable');
assert(isCrawlable({ ...validConfig, crawlStatus: 'failed' }) === false, 'Failed status not crawlable');
assert(isCrawlable({ ...validConfig, crawlStatus: 'paused' }) === false, 'Paused status not crawlable');
}
// ============================================================
// TEST: User Agent Rotation
// ============================================================
function testUserAgentRotation(): void {
section('User Agent Rotation');
const rotator = new UserAgentRotator();
const first = rotator.getCurrent();
const second = rotator.getNext();
const third = rotator.getNext();
assert(first !== second, 'User agents rotate');
assert(second !== third, 'User agents keep rotating');
assert(USER_AGENTS.includes(first), 'Returns valid UA');
assert(USER_AGENTS.includes(second), 'Returns valid UA');
// Random UA
const random = rotator.getRandom();
assert(USER_AGENTS.includes(random), 'Random returns valid UA');
// Count
assert(rotator.getCount() === USER_AGENTS.length, 'Reports correct count');
}
// ============================================================
// TEST: WithRetry Helper
// ============================================================
async function testWithRetryHelper(): Promise<void> {
section('WithRetry Helper');
// Successful on first try
let attempts = 0;
const successResult = await withRetry(async () => {
attempts++;
return 'success';
}, { maxRetries: 3 });
assert(attempts === 1, 'Succeeds on first try');
assert(successResult.result === 'success', 'Returns result');
// Fails then succeeds
let failThenSucceedAttempts = 0;
const failThenSuccessResult = await withRetry(async () => {
failThenSucceedAttempts++;
if (failThenSucceedAttempts < 3) {
throw new Error('temporary error');
}
return 'finally succeeded';
}, { maxRetries: 5, baseBackoffMs: 10 });
assert(failThenSucceedAttempts === 3, 'Retries until success');
assert(failThenSuccessResult.result === 'finally succeeded', 'Returns final result');
assert(failThenSuccessResult.summary.attemptsMade === 3, 'Summary tracks attempts');
// Exhausts retries
let alwaysFailAttempts = 0;
try {
await withRetry(async () => {
alwaysFailAttempts++;
throw new Error('always fails');
}, { maxRetries: 2, baseBackoffMs: 10 });
assert(false, 'Should have thrown');
} catch (error: any) {
assert(alwaysFailAttempts === 3, 'Attempts all retries'); // 1 initial + 2 retries
assert(error.name === 'RetryExhaustedError', 'Throws RetryExhaustedError');
}
// Non-retryable error stops immediately
let nonRetryableAttempts = 0;
try {
await withRetry(async () => {
nonRetryableAttempts++;
const err = new Error('HTML structure changed - selector not found');
throw err;
}, { maxRetries: 3, baseBackoffMs: 10 });
assert(false, 'Should have thrown');
} catch {
assert(nonRetryableAttempts === 1, 'Non-retryable stops immediately');
}
}
// ============================================================
// TEST: Minimum Crawl Gap
// ============================================================
function testMinimumCrawlGap(): void {
section('Minimum Crawl Gap');
// Default config
assert(DEFAULT_CONFIG.minCrawlGapMinutes === 2, 'Default gap is 2 minutes');
assert(DEFAULT_CONFIG.crawlFrequencyMinutes === 240, 'Default frequency is 4 hours');
// Gap calculation
const gapMs = DEFAULT_CONFIG.minCrawlGapMinutes * 60 * 1000;
assert(gapMs === 120000, 'Gap is 2 minutes in ms');
console.log(' Note: Gap enforcement is tested at DB level (trigger) and application level');
}
// ============================================================
// TEST: Error Metadata
// ============================================================
function testErrorMetadata(): void {
section('Error Metadata');
// RATE_LIMITED
const rateLimited = getErrorMetadata(CrawlErrorCode.RATE_LIMITED);
assert(rateLimited.retryable === true, 'RATE_LIMITED is retryable');
assert(rateLimited.rotateProxy === true, 'RATE_LIMITED rotates proxy');
assert(rateLimited.backoffMultiplier === 2.0, 'RATE_LIMITED has 2x backoff');
assert(rateLimited.severity === 'medium', 'RATE_LIMITED is medium severity');
// HTML_CHANGED
const htmlChanged = getErrorMetadata(CrawlErrorCode.HTML_CHANGED);
assert(htmlChanged.retryable === false, 'HTML_CHANGED is NOT retryable');
assert(htmlChanged.severity === 'high', 'HTML_CHANGED is high severity');
// INVALID_CONFIG
const invalidConfig = getErrorMetadata(CrawlErrorCode.INVALID_CONFIG);
assert(invalidConfig.retryable === false, 'INVALID_CONFIG is NOT retryable');
assert(invalidConfig.severity === 'critical', 'INVALID_CONFIG is critical');
}
// ============================================================
// MAIN
// ============================================================
async function runTests(testName?: string): Promise<void> {
console.log('\n');
console.log('╔══════════════════════════════════════════════════════════╗');
console.log('║ CRAWLER RELIABILITY STRESS TEST - PHASE 1 ║');
console.log('╚══════════════════════════════════════════════════════════╝');
const allTests = !testName || testName === 'all';
if (allTests || testName === 'error' || testName === 'classification') {
testErrorClassification();
}
if (allTests || testName === 'retry') {
testRetryManager();
}
if (allTests || testName === 'backoff') {
testExponentialBackoff();
}
if (allTests || testName === 'status') {
testStatusTransitions();
}
if (allTests || testName === 'validation' || testName === 'store') {
testStoreValidation();
}
if (allTests || testName === 'rotation' || testName === 'ua') {
testUserAgentRotation();
}
if (allTests || testName === 'withRetry' || testName === 'helper') {
await testWithRetryHelper();
}
if (allTests || testName === 'gap') {
testMinimumCrawlGap();
}
if (allTests || testName === 'metadata') {
testErrorMetadata();
}
// Summary
console.log('\n');
console.log('═'.repeat(60));
console.log('SUMMARY');
console.log('═'.repeat(60));
console.log(` Passed: ${testsPassed}`);
console.log(` Failed: ${testsFailed}`);
console.log(` Total: ${testsPassed + testsFailed}`);
if (testsFailed > 0) {
console.log('\n❌ SOME TESTS FAILED\n');
process.exit(1);
} else {
console.log('\n✅ ALL TESTS PASSED\n');
process.exit(0);
}
}
// Run tests
const testName = process.argv[2];
runTests(testName).catch((error) => {
console.error('Fatal error:', error);
process.exit(1);
});

View File

@@ -1,659 +0,0 @@
/**
* Brand Opportunity / Risk Analytics Service
*
* Provides brand-level opportunity and risk analysis including:
* - Under/overpriced vs market
* - Missing SKU opportunities
* - Stores with declining/growing shelf share
* - Competitor intrusion alerts
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
import { AnalyticsCache, cacheKey } from './cache';
export interface BrandOpportunity {
brandName: string;
underpricedVsMarket: PricePosition[];
overpricedVsMarket: PricePosition[];
missingSkuOpportunities: MissingSkuOpportunity[];
storesWithDecliningShelfShare: StoreShelfShareChange[];
storesWithGrowingShelfShare: StoreShelfShareChange[];
competitorIntrusionAlerts: CompetitorAlert[];
overallScore: number; // 0-100, higher = more opportunity
riskScore: number; // 0-100, higher = more risk
}
export interface PricePosition {
category: string;
brandAvgPrice: number;
marketAvgPrice: number;
priceDifferencePercent: number;
skuCount: number;
suggestion: string;
}
export interface MissingSkuOpportunity {
category: string;
subcategory: string | null;
marketSkuCount: number;
brandSkuCount: number;
gapPercent: number;
topCompetitors: string[];
opportunityScore: number; // 0-100
}
export interface StoreShelfShareChange {
storeId: number;
storeName: string;
city: string;
state: string;
currentShelfShare: number;
previousShelfShare: number;
changePercent: number;
currentSkus: number;
competitors: string[];
}
export interface CompetitorAlert {
competitorBrand: string;
storeId: number;
storeName: string;
alertType: 'new_entry' | 'expanding' | 'price_undercut';
details: string;
severity: 'low' | 'medium' | 'high';
date: string;
}
export interface MarketPositionSummary {
brandName: string;
marketSharePercent: number;
avgPriceVsMarket: number; // -X% to +X%
categoryStrengths: Array<{ category: string; shelfSharePercent: number }>;
categoryWeaknesses: Array<{ category: string; shelfSharePercent: number; marketLeader: string }>;
growthTrend: 'growing' | 'stable' | 'declining';
competitorThreats: string[];
}
export class BrandOpportunityService {
private pool: Pool;
private cache: AnalyticsCache;
constructor(pool: Pool, cache: AnalyticsCache) {
this.pool = pool;
this.cache = cache;
}
/**
* Get full opportunity analysis for a brand
*/
async getBrandOpportunity(brandName: string): Promise<BrandOpportunity> {
const key = cacheKey('brand_opportunity', { brandName });
return (await this.cache.getOrCompute(key, async () => {
const [
underpriced,
overpriced,
missingSkus,
decliningStores,
growingStores,
alerts,
] = await Promise.all([
this.getUnderpricedPositions(brandName),
this.getOverpricedPositions(brandName),
this.getMissingSkuOpportunities(brandName),
this.getStoresWithDecliningShare(brandName),
this.getStoresWithGrowingShare(brandName),
this.getCompetitorAlerts(brandName),
]);
// Calculate opportunity score (higher = more opportunity)
const opportunityFactors = [
missingSkus.length > 0 ? 20 : 0,
underpriced.length > 0 ? 15 : 0,
growingStores.length > 5 ? 20 : growingStores.length * 3,
missingSkus.reduce((sum, m) => sum + m.opportunityScore, 0) / Math.max(1, missingSkus.length) * 0.3,
];
const opportunityScore = Math.min(100, opportunityFactors.reduce((a, b) => a + b, 0));
// Calculate risk score (higher = more risk)
const riskFactors = [
decliningStores.length > 5 ? 30 : decliningStores.length * 5,
alerts.filter(a => a.severity === 'high').length * 15,
alerts.filter(a => a.severity === 'medium').length * 8,
overpriced.length > 3 ? 15 : overpriced.length * 3,
];
const riskScore = Math.min(100, riskFactors.reduce((a, b) => a + b, 0));
return {
brandName,
underpricedVsMarket: underpriced,
overpricedVsMarket: overpriced,
missingSkuOpportunities: missingSkus,
storesWithDecliningShelfShare: decliningStores,
storesWithGrowingShelfShare: growingStores,
competitorIntrusionAlerts: alerts,
overallScore: Math.round(opportunityScore),
riskScore: Math.round(riskScore),
};
}, 30)).data;
}
/**
* Get categories where brand is underpriced vs market
*/
async getUnderpricedPositions(brandName: string): Promise<PricePosition[]> {
const result = await this.pool.query(`
WITH brand_prices AS (
SELECT
type as category,
AVG(extract_min_price(latest_raw_payload)) as brand_avg,
COUNT(*) as sku_count
FROM dutchie_products
WHERE brand_name = $1 AND type IS NOT NULL
GROUP BY type
HAVING COUNT(*) >= 3
),
market_prices AS (
SELECT
type as category,
AVG(extract_min_price(latest_raw_payload)) as market_avg
FROM dutchie_products
WHERE type IS NOT NULL AND brand_name != $1
GROUP BY type
)
SELECT
bp.category,
bp.brand_avg,
mp.market_avg,
bp.sku_count,
((bp.brand_avg - mp.market_avg) / NULLIF(mp.market_avg, 0)) * 100 as diff_pct
FROM brand_prices bp
JOIN market_prices mp ON bp.category = mp.category
WHERE bp.brand_avg < mp.market_avg * 0.9 -- 10% or more below market
AND bp.brand_avg IS NOT NULL
AND mp.market_avg IS NOT NULL
ORDER BY diff_pct
`, [brandName]);
return result.rows.map(row => ({
category: row.category,
brandAvgPrice: Math.round(parseFloat(row.brand_avg) * 100) / 100,
marketAvgPrice: Math.round(parseFloat(row.market_avg) * 100) / 100,
priceDifferencePercent: Math.round(parseFloat(row.diff_pct) * 10) / 10,
skuCount: parseInt(row.sku_count) || 0,
suggestion: `Consider price increase - ${Math.abs(Math.round(parseFloat(row.diff_pct)))}% below market average`,
}));
}
/**
* Get categories where brand is overpriced vs market
*/
async getOverpricedPositions(brandName: string): Promise<PricePosition[]> {
const result = await this.pool.query(`
WITH brand_prices AS (
SELECT
type as category,
AVG(extract_min_price(latest_raw_payload)) as brand_avg,
COUNT(*) as sku_count
FROM dutchie_products
WHERE brand_name = $1 AND type IS NOT NULL
GROUP BY type
HAVING COUNT(*) >= 3
),
market_prices AS (
SELECT
type as category,
AVG(extract_min_price(latest_raw_payload)) as market_avg
FROM dutchie_products
WHERE type IS NOT NULL AND brand_name != $1
GROUP BY type
)
SELECT
bp.category,
bp.brand_avg,
mp.market_avg,
bp.sku_count,
((bp.brand_avg - mp.market_avg) / NULLIF(mp.market_avg, 0)) * 100 as diff_pct
FROM brand_prices bp
JOIN market_prices mp ON bp.category = mp.category
WHERE bp.brand_avg > mp.market_avg * 1.15 -- 15% or more above market
AND bp.brand_avg IS NOT NULL
AND mp.market_avg IS NOT NULL
ORDER BY diff_pct DESC
`, [brandName]);
return result.rows.map(row => ({
category: row.category,
brandAvgPrice: Math.round(parseFloat(row.brand_avg) * 100) / 100,
marketAvgPrice: Math.round(parseFloat(row.market_avg) * 100) / 100,
priceDifferencePercent: Math.round(parseFloat(row.diff_pct) * 10) / 10,
skuCount: parseInt(row.sku_count) || 0,
suggestion: `Price sensitivity risk - ${Math.round(parseFloat(row.diff_pct))}% above market average`,
}));
}
/**
* Get missing SKU opportunities (category gaps)
*/
async getMissingSkuOpportunities(brandName: string): Promise<MissingSkuOpportunity[]> {
const result = await this.pool.query(`
WITH market_categories AS (
SELECT
type as category,
subcategory,
COUNT(*) as market_skus,
ARRAY_AGG(DISTINCT brand_name ORDER BY brand_name) FILTER (WHERE brand_name IS NOT NULL) as top_brands
FROM dutchie_products
WHERE type IS NOT NULL
GROUP BY type, subcategory
HAVING COUNT(*) >= 20
),
brand_presence AS (
SELECT
type as category,
subcategory,
COUNT(*) as brand_skus
FROM dutchie_products
WHERE brand_name = $1 AND type IS NOT NULL
GROUP BY type, subcategory
)
SELECT
mc.category,
mc.subcategory,
mc.market_skus,
COALESCE(bp.brand_skus, 0) as brand_skus,
mc.top_brands[1:5] as competitors
FROM market_categories mc
LEFT JOIN brand_presence bp ON mc.category = bp.category
AND (mc.subcategory = bp.subcategory OR (mc.subcategory IS NULL AND bp.subcategory IS NULL))
WHERE COALESCE(bp.brand_skus, 0) < mc.market_skus * 0.05 -- Brand has <5% of market presence
ORDER BY mc.market_skus DESC
LIMIT 10
`, [brandName]);
return result.rows.map(row => {
const marketSkus = parseInt(row.market_skus) || 0;
const brandSkus = parseInt(row.brand_skus) || 0;
const gapPercent = marketSkus > 0 ? ((marketSkus - brandSkus) / marketSkus) * 100 : 100;
const opportunityScore = Math.min(100, Math.round((marketSkus / 100) * (gapPercent / 100) * 100));
return {
category: row.category,
subcategory: row.subcategory,
marketSkuCount: marketSkus,
brandSkuCount: brandSkus,
gapPercent: Math.round(gapPercent),
topCompetitors: (row.competitors || []).filter((c: string) => c !== brandName).slice(0, 5),
opportunityScore,
};
});
}
/**
* Get stores where brand's shelf share is declining
*/
async getStoresWithDecliningShare(brandName: string): Promise<StoreShelfShareChange[]> {
// Use brand_snapshots for historical comparison
const result = await this.pool.query(`
WITH current_share AS (
SELECT
dp.dispensary_id as store_id,
d.name as store_name,
d.city,
d.state,
COUNT(*) FILTER (WHERE dp.brand_name = $1) as brand_skus,
COUNT(*) as total_skus,
ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name != $1 AND dp.brand_name IS NOT NULL) as competitors
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
GROUP BY dp.dispensary_id, d.name, d.city, d.state
HAVING COUNT(*) FILTER (WHERE dp.brand_name = $1) > 0
)
SELECT
cs.store_id,
cs.store_name,
cs.city,
cs.state,
cs.brand_skus as current_skus,
cs.total_skus,
ROUND((cs.brand_skus::NUMERIC / cs.total_skus) * 100, 2) as current_share,
cs.competitors[1:5] as top_competitors
FROM current_share cs
WHERE cs.brand_skus < 10 -- Low presence
ORDER BY cs.brand_skus
LIMIT 10
`, [brandName]);
return result.rows.map(row => ({
storeId: row.store_id,
storeName: row.store_name,
city: row.city,
state: row.state,
currentShelfShare: parseFloat(row.current_share) || 0,
previousShelfShare: parseFloat(row.current_share) || 0, // Would need historical data
changePercent: 0,
currentSkus: parseInt(row.current_skus) || 0,
competitors: row.top_competitors || [],
}));
}
/**
* Get stores where brand's shelf share is growing
*/
async getStoresWithGrowingShare(brandName: string): Promise<StoreShelfShareChange[]> {
const result = await this.pool.query(`
WITH store_share AS (
SELECT
dp.dispensary_id as store_id,
d.name as store_name,
d.city,
d.state,
COUNT(*) FILTER (WHERE dp.brand_name = $1) as brand_skus,
COUNT(*) as total_skus,
ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name != $1 AND dp.brand_name IS NOT NULL) as competitors
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
GROUP BY dp.dispensary_id, d.name, d.city, d.state
HAVING COUNT(*) FILTER (WHERE dp.brand_name = $1) > 0
)
SELECT
ss.store_id,
ss.store_name,
ss.city,
ss.state,
ss.brand_skus as current_skus,
ss.total_skus,
ROUND((ss.brand_skus::NUMERIC / ss.total_skus) * 100, 2) as current_share,
ss.competitors[1:5] as top_competitors
FROM store_share ss
ORDER BY current_share DESC
LIMIT 10
`, [brandName]);
return result.rows.map(row => ({
storeId: row.store_id,
storeName: row.store_name,
city: row.city,
state: row.state,
currentShelfShare: parseFloat(row.current_share) || 0,
previousShelfShare: parseFloat(row.current_share) || 0,
changePercent: 0,
currentSkus: parseInt(row.current_skus) || 0,
competitors: row.top_competitors || [],
}));
}
/**
* Get competitor intrusion alerts
*/
async getCompetitorAlerts(brandName: string): Promise<CompetitorAlert[]> {
// Check for competitor entries in stores where this brand has presence
const result = await this.pool.query(`
WITH brand_stores AS (
SELECT DISTINCT dispensary_id
FROM dutchie_products
WHERE brand_name = $1
),
competitor_presence AS (
SELECT
dp.brand_name as competitor,
dp.dispensary_id as store_id,
d.name as store_name,
COUNT(*) as sku_count,
MAX(dp.created_at) as latest_add
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.dispensary_id IN (SELECT dispensary_id FROM brand_stores)
AND dp.brand_name != $1
AND dp.brand_name IS NOT NULL
AND dp.created_at >= NOW() - INTERVAL '30 days'
GROUP BY dp.brand_name, dp.dispensary_id, d.name
HAVING COUNT(*) >= 5
)
SELECT
competitor,
store_id,
store_name,
sku_count,
latest_add
FROM competitor_presence
ORDER BY sku_count DESC
LIMIT 10
`, [brandName]);
return result.rows.map(row => {
const skuCount = parseInt(row.sku_count) || 0;
let severity: 'low' | 'medium' | 'high' = 'low';
if (skuCount >= 20) severity = 'high';
else if (skuCount >= 10) severity = 'medium';
return {
competitorBrand: row.competitor,
storeId: row.store_id,
storeName: row.store_name,
alertType: 'expanding' as const,
details: `${row.competitor} has ${skuCount} SKUs in ${row.store_name}`,
severity,
date: new Date(row.latest_add).toISOString().split('T')[0],
};
});
}
/**
* Get market position summary for a brand
*/
async getMarketPositionSummary(brandName: string): Promise<MarketPositionSummary> {
const key = cacheKey('market_position', { brandName });
return (await this.cache.getOrCompute(key, async () => {
const [shareResult, priceResult, categoryResult, threatResult] = await Promise.all([
// Market share
this.pool.query(`
SELECT
(SELECT COUNT(*) FROM dutchie_products WHERE brand_name = $1) as brand_count,
(SELECT COUNT(*) FROM dutchie_products) as total_count
`, [brandName]),
// Price vs market
this.pool.query(`
SELECT
(SELECT AVG(extract_min_price(latest_raw_payload)) FROM dutchie_products WHERE brand_name = $1) as brand_avg,
(SELECT AVG(extract_min_price(latest_raw_payload)) FROM dutchie_products WHERE brand_name != $1) as market_avg
`, [brandName]),
// Category strengths/weaknesses
this.pool.query(`
WITH brand_by_cat AS (
SELECT type as category, COUNT(*) as brand_count
FROM dutchie_products
WHERE brand_name = $1 AND type IS NOT NULL
GROUP BY type
),
market_by_cat AS (
SELECT type as category, COUNT(*) as total_count
FROM dutchie_products WHERE type IS NOT NULL
GROUP BY type
),
leaders AS (
SELECT type as category, brand_name, COUNT(*) as cnt,
RANK() OVER (PARTITION BY type ORDER BY COUNT(*) DESC) as rnk
FROM dutchie_products WHERE type IS NOT NULL AND brand_name IS NOT NULL
GROUP BY type, brand_name
)
SELECT
mc.category,
COALESCE(bc.brand_count, 0) as brand_count,
mc.total_count,
ROUND((COALESCE(bc.brand_count, 0)::NUMERIC / mc.total_count) * 100, 2) as share_pct,
(SELECT brand_name FROM leaders WHERE category = mc.category AND rnk = 1) as leader
FROM market_by_cat mc
LEFT JOIN brand_by_cat bc ON mc.category = bc.category
ORDER BY share_pct DESC
`, [brandName]),
// Top competitors
this.pool.query(`
SELECT brand_name, COUNT(*) as cnt
FROM dutchie_products
WHERE brand_name IS NOT NULL AND brand_name != $1
GROUP BY brand_name
ORDER BY cnt DESC
LIMIT 5
`, [brandName]),
]);
const brandCount = parseInt(shareResult.rows[0]?.brand_count) || 0;
const totalCount = parseInt(shareResult.rows[0]?.total_count) || 1;
const marketSharePercent = Math.round((brandCount / totalCount) * 1000) / 10;
const brandAvg = parseFloat(priceResult.rows[0]?.brand_avg) || 0;
const marketAvg = parseFloat(priceResult.rows[0]?.market_avg) || 1;
const avgPriceVsMarket = Math.round(((brandAvg - marketAvg) / marketAvg) * 1000) / 10;
const categories = categoryResult.rows;
const strengths = categories
.filter(c => parseFloat(c.share_pct) > 5)
.map(c => ({ category: c.category, shelfSharePercent: parseFloat(c.share_pct) }));
const weaknesses = categories
.filter(c => parseFloat(c.share_pct) < 2 && c.leader !== brandName)
.map(c => ({
category: c.category,
shelfSharePercent: parseFloat(c.share_pct),
marketLeader: c.leader || 'Unknown',
}));
return {
brandName,
marketSharePercent,
avgPriceVsMarket,
categoryStrengths: strengths.slice(0, 5),
categoryWeaknesses: weaknesses.slice(0, 5),
growthTrend: 'stable' as const, // Would need historical data
competitorThreats: threatResult.rows.map(r => r.brand_name),
};
}, 30)).data;
}
/**
* Create an analytics alert
*/
async createAlert(alert: {
alertType: string;
severity: 'info' | 'warning' | 'critical';
title: string;
description?: string;
storeId?: number;
brandName?: string;
productId?: number;
category?: string;
metadata?: Record<string, unknown>;
}): Promise<void> {
await this.pool.query(`
INSERT INTO analytics_alerts
(alert_type, severity, title, description, store_id, brand_name, product_id, category, metadata)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
`, [
alert.alertType,
alert.severity,
alert.title,
alert.description || null,
alert.storeId || null,
alert.brandName || null,
alert.productId || null,
alert.category || null,
alert.metadata ? JSON.stringify(alert.metadata) : null,
]);
}
/**
* Get recent alerts
*/
async getAlerts(filters: {
brandName?: string;
storeId?: number;
alertType?: string;
unreadOnly?: boolean;
limit?: number;
} = {}): Promise<Array<{
id: number;
alertType: string;
severity: string;
title: string;
description: string | null;
storeName: string | null;
brandName: string | null;
createdAt: string;
isRead: boolean;
}>> {
const { brandName, storeId, alertType, unreadOnly = false, limit = 50 } = filters;
const params: (string | number | boolean)[] = [limit];
const conditions: string[] = [];
let paramIndex = 2;
if (brandName) {
conditions.push(`a.brand_name = $${paramIndex++}`);
params.push(brandName);
}
if (storeId) {
conditions.push(`a.store_id = $${paramIndex++}`);
params.push(storeId);
}
if (alertType) {
conditions.push(`a.alert_type = $${paramIndex++}`);
params.push(alertType);
}
if (unreadOnly) {
conditions.push('a.is_read = false');
}
const whereClause = conditions.length > 0
? 'WHERE ' + conditions.join(' AND ')
: '';
const result = await this.pool.query(`
SELECT
a.id,
a.alert_type,
a.severity,
a.title,
a.description,
d.name as store_name,
a.brand_name,
a.created_at,
a.is_read
FROM analytics_alerts a
LEFT JOIN dispensaries d ON a.store_id = d.id
${whereClause}
ORDER BY a.created_at DESC
LIMIT $1
`, params);
return result.rows.map(row => ({
id: row.id,
alertType: row.alert_type,
severity: row.severity,
title: row.title,
description: row.description,
storeName: row.store_name,
brandName: row.brand_name,
createdAt: row.created_at.toISOString(),
isRead: row.is_read,
}));
}
/**
* Mark alerts as read
*/
async markAlertsRead(alertIds: number[]): Promise<void> {
if (alertIds.length === 0) return;
await this.pool.query(`
UPDATE analytics_alerts
SET is_read = true
WHERE id = ANY($1)
`, [alertIds]);
}
}

View File

@@ -1,227 +0,0 @@
/**
* Analytics Cache Service
*
* Provides caching layer for expensive analytics queries.
* Uses PostgreSQL for persistence with configurable TTLs.
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
export interface CacheEntry<T = unknown> {
key: string;
data: T;
computedAt: Date;
expiresAt: Date;
queryTimeMs?: number;
}
export interface CacheConfig {
defaultTtlMinutes: number;
}
const DEFAULT_CONFIG: CacheConfig = {
defaultTtlMinutes: 15,
};
export class AnalyticsCache {
private pool: Pool;
private config: CacheConfig;
private memoryCache: Map<string, CacheEntry> = new Map();
constructor(pool: Pool, config: Partial<CacheConfig> = {}) {
this.pool = pool;
this.config = { ...DEFAULT_CONFIG, ...config };
}
/**
* Get cached data or compute and cache it
*/
async getOrCompute<T>(
key: string,
computeFn: () => Promise<T>,
ttlMinutes?: number
): Promise<{ data: T; fromCache: boolean; queryTimeMs: number }> {
const ttl = ttlMinutes ?? this.config.defaultTtlMinutes;
// Check memory cache first
const memEntry = this.memoryCache.get(key);
if (memEntry && new Date() < memEntry.expiresAt) {
return { data: memEntry.data as T, fromCache: true, queryTimeMs: memEntry.queryTimeMs || 0 };
}
// Check database cache
const dbEntry = await this.getFromDb<T>(key);
if (dbEntry && new Date() < dbEntry.expiresAt) {
this.memoryCache.set(key, dbEntry);
return { data: dbEntry.data, fromCache: true, queryTimeMs: dbEntry.queryTimeMs || 0 };
}
// Compute fresh data
const startTime = Date.now();
const data = await computeFn();
const queryTimeMs = Date.now() - startTime;
// Cache result
const entry: CacheEntry<T> = {
key,
data,
computedAt: new Date(),
expiresAt: new Date(Date.now() + ttl * 60 * 1000),
queryTimeMs,
};
await this.saveToDb(entry);
this.memoryCache.set(key, entry);
return { data, fromCache: false, queryTimeMs };
}
/**
* Get from database cache
*/
private async getFromDb<T>(key: string): Promise<CacheEntry<T> | null> {
try {
const result = await this.pool.query(`
SELECT cache_data, computed_at, expires_at, query_time_ms
FROM analytics_cache
WHERE cache_key = $1
AND expires_at > NOW()
`, [key]);
if (result.rows.length === 0) return null;
const row = result.rows[0];
return {
key,
data: row.cache_data as T,
computedAt: row.computed_at,
expiresAt: row.expires_at,
queryTimeMs: row.query_time_ms,
};
} catch (error) {
console.warn(`[AnalyticsCache] Failed to get from DB: ${error}`);
return null;
}
}
/**
* Save to database cache
*/
private async saveToDb<T>(entry: CacheEntry<T>): Promise<void> {
try {
await this.pool.query(`
INSERT INTO analytics_cache (cache_key, cache_data, computed_at, expires_at, query_time_ms)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (cache_key)
DO UPDATE SET
cache_data = EXCLUDED.cache_data,
computed_at = EXCLUDED.computed_at,
expires_at = EXCLUDED.expires_at,
query_time_ms = EXCLUDED.query_time_ms
`, [entry.key, JSON.stringify(entry.data), entry.computedAt, entry.expiresAt, entry.queryTimeMs]);
} catch (error) {
console.warn(`[AnalyticsCache] Failed to save to DB: ${error}`);
}
}
/**
* Invalidate a cache entry
*/
async invalidate(key: string): Promise<void> {
this.memoryCache.delete(key);
try {
await this.pool.query('DELETE FROM analytics_cache WHERE cache_key = $1', [key]);
} catch (error) {
console.warn(`[AnalyticsCache] Failed to invalidate: ${error}`);
}
}
/**
* Invalidate all entries matching a pattern
*/
async invalidatePattern(pattern: string): Promise<number> {
// Clear memory cache
for (const key of this.memoryCache.keys()) {
if (key.includes(pattern)) {
this.memoryCache.delete(key);
}
}
try {
const result = await this.pool.query(
'DELETE FROM analytics_cache WHERE cache_key LIKE $1',
[`%${pattern}%`]
);
return result.rowCount || 0;
} catch (error) {
console.warn(`[AnalyticsCache] Failed to invalidate pattern: ${error}`);
return 0;
}
}
/**
* Clean expired entries
*/
async cleanExpired(): Promise<number> {
// Clean memory cache
const now = new Date();
for (const [key, entry] of this.memoryCache.entries()) {
if (now >= entry.expiresAt) {
this.memoryCache.delete(key);
}
}
try {
const result = await this.pool.query('DELETE FROM analytics_cache WHERE expires_at < NOW()');
return result.rowCount || 0;
} catch (error) {
console.warn(`[AnalyticsCache] Failed to clean expired: ${error}`);
return 0;
}
}
/**
* Get cache statistics
*/
async getStats(): Promise<{
memoryCacheSize: number;
dbCacheSize: number;
expiredCount: number;
}> {
try {
const result = await this.pool.query(`
SELECT
COUNT(*) FILTER (WHERE expires_at > NOW()) as active,
COUNT(*) FILTER (WHERE expires_at <= NOW()) as expired
FROM analytics_cache
`);
return {
memoryCacheSize: this.memoryCache.size,
dbCacheSize: parseInt(result.rows[0]?.active || '0'),
expiredCount: parseInt(result.rows[0]?.expired || '0'),
};
} catch (error) {
return {
memoryCacheSize: this.memoryCache.size,
dbCacheSize: 0,
expiredCount: 0,
};
}
}
}
/**
* Generate cache key with parameters
*/
export function cacheKey(prefix: string, params: Record<string, unknown> = {}): string {
const sortedParams = Object.keys(params)
.sort()
.filter(k => params[k] !== undefined && params[k] !== null)
.map(k => `${k}=${params[k]}`)
.join('&');
return sortedParams ? `${prefix}:${sortedParams}` : prefix;
}

View File

@@ -1,530 +0,0 @@
/**
* Category Growth Analytics Service
*
* Provides category-level analytics including:
* - SKU count growth
* - Price growth trends
* - New product additions
* - Category shrinkage
* - Seasonality patterns
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
import { AnalyticsCache, cacheKey } from './cache';
export interface CategoryGrowth {
category: string;
currentSkuCount: number;
previousSkuCount: number;
skuGrowthPercent: number;
currentBrandCount: number;
previousBrandCount: number;
brandGrowthPercent: number;
currentAvgPrice: number | null;
previousAvgPrice: number | null;
priceChangePercent: number | null;
newProducts: number;
discontinuedProducts: number;
trend: 'growing' | 'declining' | 'stable';
}
export interface CategorySummary {
category: string;
totalSkus: number;
brandCount: number;
storeCount: number;
avgPrice: number | null;
minPrice: number | null;
maxPrice: number | null;
inStockSkus: number;
outOfStockSkus: number;
stockHealthPercent: number;
}
export interface CategoryGrowthTrend {
category: string;
dataPoints: Array<{
date: string;
skuCount: number;
brandCount: number;
avgPrice: number | null;
storeCount: number;
}>;
growth7d: number | null;
growth30d: number | null;
growth90d: number | null;
}
export interface CategoryHeatmapData {
categories: string[];
periods: string[];
data: Array<{
category: string;
period: string;
value: number; // SKU count, growth %, or price
changeFromPrevious: number | null;
}>;
}
export interface SeasonalityPattern {
category: string;
monthlyPattern: Array<{
month: number;
monthName: string;
avgSkuCount: number;
avgPrice: number | null;
seasonalityIndex: number; // 100 = average, >100 = above, <100 = below
}>;
peakMonth: number;
troughMonth: number;
}
export interface CategoryFilters {
state?: string;
storeId?: number;
minSkus?: number;
}
export class CategoryAnalyticsService {
private pool: Pool;
private cache: AnalyticsCache;
constructor(pool: Pool, cache: AnalyticsCache) {
this.pool = pool;
this.cache = cache;
}
/**
* Get current category summary
*/
async getCategorySummary(
category?: string,
filters: CategoryFilters = {}
): Promise<CategorySummary[]> {
const { state, storeId } = filters;
const key = cacheKey('category_summary', { category, state, storeId });
return (await this.cache.getOrCompute(key, async () => {
const params: (string | number)[] = [];
const conditions: string[] = [];
let paramIndex = 1;
if (category) {
conditions.push(`dp.type = $${paramIndex++}`);
params.push(category);
}
if (state) {
conditions.push(`d.state = $${paramIndex++}`);
params.push(state);
}
if (storeId) {
conditions.push(`dp.dispensary_id = $${paramIndex++}`);
params.push(storeId);
}
const whereClause = conditions.length > 0
? 'WHERE dp.type IS NOT NULL AND ' + conditions.join(' AND ')
: 'WHERE dp.type IS NOT NULL';
const result = await this.pool.query(`
SELECT
dp.type as category,
COUNT(*) as total_skus,
COUNT(DISTINCT dp.brand_name) as brand_count,
COUNT(DISTINCT dp.dispensary_id) as store_count,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
MIN(extract_min_price(dp.latest_raw_payload)) as min_price,
MAX(extract_max_price(dp.latest_raw_payload)) as max_price,
SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock,
SUM(CASE WHEN dp.stock_status != 'in_stock' OR dp.stock_status IS NULL THEN 1 ELSE 0 END) as out_of_stock
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
${whereClause}
GROUP BY dp.type
ORDER BY total_skus DESC
`, params);
return result.rows.map(row => {
const totalSkus = parseInt(row.total_skus) || 0;
const inStock = parseInt(row.in_stock) || 0;
return {
category: row.category,
totalSkus,
brandCount: parseInt(row.brand_count) || 0,
storeCount: parseInt(row.store_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
minPrice: row.min_price ? Math.round(parseFloat(row.min_price) * 100) / 100 : null,
maxPrice: row.max_price ? Math.round(parseFloat(row.max_price) * 100) / 100 : null,
inStockSkus: inStock,
outOfStockSkus: parseInt(row.out_of_stock) || 0,
stockHealthPercent: totalSkus > 0
? Math.round((inStock / totalSkus) * 100)
: 0,
};
});
}, 15)).data;
}
/**
* Get category growth (comparing periods)
*/
async getCategoryGrowth(
days: number = 7,
filters: CategoryFilters = {}
): Promise<CategoryGrowth[]> {
const { state, storeId, minSkus = 10 } = filters;
const key = cacheKey('category_growth', { days, state, storeId, minSkus });
return (await this.cache.getOrCompute(key, async () => {
// Use category_snapshots for historical comparison
const result = await this.pool.query(`
WITH current_data AS (
SELECT
category,
total_skus,
brand_count,
avg_price,
store_count
FROM category_snapshots
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM category_snapshots)
),
previous_data AS (
SELECT
category,
total_skus,
brand_count,
avg_price,
store_count
FROM category_snapshots
WHERE snapshot_date = (
SELECT MAX(snapshot_date)
FROM category_snapshots
WHERE snapshot_date < (SELECT MAX(snapshot_date) FROM category_snapshots) - ($1 || ' days')::INTERVAL
)
)
SELECT
c.category,
c.total_skus as current_skus,
COALESCE(p.total_skus, c.total_skus) as previous_skus,
c.brand_count as current_brands,
COALESCE(p.brand_count, c.brand_count) as previous_brands,
c.avg_price as current_price,
p.avg_price as previous_price
FROM current_data c
LEFT JOIN previous_data p ON c.category = p.category
WHERE c.total_skus >= $2
ORDER BY c.total_skus DESC
`, [days, minSkus]);
// If no snapshots exist, use current data
if (result.rows.length === 0) {
const fallbackResult = await this.pool.query(`
SELECT
type as category,
COUNT(*) as total_skus,
COUNT(DISTINCT brand_name) as brand_count,
AVG(extract_min_price(latest_raw_payload)) as avg_price
FROM dutchie_products
WHERE type IS NOT NULL
GROUP BY type
HAVING COUNT(*) >= $1
ORDER BY total_skus DESC
`, [minSkus]);
return fallbackResult.rows.map(row => ({
category: row.category,
currentSkuCount: parseInt(row.total_skus) || 0,
previousSkuCount: parseInt(row.total_skus) || 0,
skuGrowthPercent: 0,
currentBrandCount: parseInt(row.brand_count) || 0,
previousBrandCount: parseInt(row.brand_count) || 0,
brandGrowthPercent: 0,
currentAvgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
previousAvgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
priceChangePercent: null,
newProducts: 0,
discontinuedProducts: 0,
trend: 'stable' as const,
}));
}
return result.rows.map(row => {
const currentSkus = parseInt(row.current_skus) || 0;
const previousSkus = parseInt(row.previous_skus) || currentSkus;
const currentBrands = parseInt(row.current_brands) || 0;
const previousBrands = parseInt(row.previous_brands) || currentBrands;
const currentPrice = row.current_price ? parseFloat(row.current_price) : null;
const previousPrice = row.previous_price ? parseFloat(row.previous_price) : null;
const skuGrowth = previousSkus > 0
? ((currentSkus - previousSkus) / previousSkus) * 100
: 0;
const brandGrowth = previousBrands > 0
? ((currentBrands - previousBrands) / previousBrands) * 100
: 0;
const priceChange = previousPrice && currentPrice
? ((currentPrice - previousPrice) / previousPrice) * 100
: null;
let trend: 'growing' | 'declining' | 'stable' = 'stable';
if (skuGrowth > 5) trend = 'growing';
else if (skuGrowth < -5) trend = 'declining';
return {
category: row.category,
currentSkuCount: currentSkus,
previousSkuCount: previousSkus,
skuGrowthPercent: Math.round(skuGrowth * 10) / 10,
currentBrandCount: currentBrands,
previousBrandCount: previousBrands,
brandGrowthPercent: Math.round(brandGrowth * 10) / 10,
currentAvgPrice: currentPrice ? Math.round(currentPrice * 100) / 100 : null,
previousAvgPrice: previousPrice ? Math.round(previousPrice * 100) / 100 : null,
priceChangePercent: priceChange !== null ? Math.round(priceChange * 10) / 10 : null,
newProducts: Math.max(0, currentSkus - previousSkus),
discontinuedProducts: Math.max(0, previousSkus - currentSkus),
trend,
};
});
}, 15)).data;
}
/**
* Get category growth trend over time
*/
async getCategoryGrowthTrend(
category: string,
days: number = 90
): Promise<CategoryGrowthTrend> {
const key = cacheKey('category_growth_trend', { category, days });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
snapshot_date as date,
total_skus as sku_count,
brand_count,
avg_price,
store_count
FROM category_snapshots
WHERE category = $1
AND snapshot_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
ORDER BY snapshot_date
`, [category, days]);
const dataPoints = result.rows.map(row => ({
date: row.date.toISOString().split('T')[0],
skuCount: parseInt(row.sku_count) || 0,
brandCount: parseInt(row.brand_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
storeCount: parseInt(row.store_count) || 0,
}));
// Calculate growth rates
const calculateGrowth = (daysBack: number): number | null => {
if (dataPoints.length < 2) return null;
const targetDate = new Date();
targetDate.setDate(targetDate.getDate() - daysBack);
const targetDateStr = targetDate.toISOString().split('T')[0];
const recent = dataPoints[dataPoints.length - 1];
const older = dataPoints.find(d => d.date <= targetDateStr) || dataPoints[0];
if (older.skuCount === 0) return null;
return Math.round(((recent.skuCount - older.skuCount) / older.skuCount) * 1000) / 10;
};
return {
category,
dataPoints,
growth7d: calculateGrowth(7),
growth30d: calculateGrowth(30),
growth90d: calculateGrowth(90),
};
}, 15)).data;
}
/**
* Get category heatmap data
*/
async getCategoryHeatmap(
metric: 'skus' | 'growth' | 'price' = 'skus',
periods: number = 12 // weeks
): Promise<CategoryHeatmapData> {
const key = cacheKey('category_heatmap', { metric, periods });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
category,
snapshot_date,
total_skus,
avg_price
FROM category_snapshots
WHERE snapshot_date >= CURRENT_DATE - ($1 * 7 || ' days')::INTERVAL
ORDER BY category, snapshot_date
`, [periods]);
// Get unique categories and generate weekly periods
const categoriesSet = new Set<string>();
const periodsSet = new Set<string>();
result.rows.forEach(row => {
categoriesSet.add(row.category);
// Group by week
const date = new Date(row.snapshot_date);
const weekStart = new Date(date);
weekStart.setDate(date.getDate() - date.getDay());
periodsSet.add(weekStart.toISOString().split('T')[0]);
});
const categories = Array.from(categoriesSet).sort();
const periodsList = Array.from(periodsSet).sort();
// Aggregate data by category and week
const dataMap = new Map<string, Map<string, { skus: number; price: number | null }>>();
result.rows.forEach(row => {
const date = new Date(row.snapshot_date);
const weekStart = new Date(date);
weekStart.setDate(date.getDate() - date.getDay());
const period = weekStart.toISOString().split('T')[0];
if (!dataMap.has(row.category)) {
dataMap.set(row.category, new Map());
}
const categoryData = dataMap.get(row.category)!;
if (!categoryData.has(period)) {
categoryData.set(period, { skus: 0, price: null });
}
const existing = categoryData.get(period)!;
existing.skus = Math.max(existing.skus, parseInt(row.total_skus) || 0);
if (row.avg_price) {
existing.price = parseFloat(row.avg_price);
}
});
// Build heatmap data
const data: CategoryHeatmapData['data'] = [];
categories.forEach(category => {
let previousValue: number | null = null;
periodsList.forEach(period => {
const categoryData = dataMap.get(category)?.get(period);
let value = 0;
if (categoryData) {
switch (metric) {
case 'skus':
value = categoryData.skus;
break;
case 'price':
value = categoryData.price || 0;
break;
case 'growth':
value = previousValue !== null && previousValue > 0
? ((categoryData.skus - previousValue) / previousValue) * 100
: 0;
break;
}
}
const changeFromPrevious = previousValue !== null && previousValue > 0
? ((value - previousValue) / previousValue) * 100
: null;
data.push({
category,
period,
value: Math.round(value * 100) / 100,
changeFromPrevious: changeFromPrevious !== null
? Math.round(changeFromPrevious * 10) / 10
: null,
});
if (metric !== 'growth') {
previousValue = value;
} else if (categoryData) {
previousValue = categoryData.skus;
}
});
});
return {
categories,
periods: periodsList,
data,
};
}, 30)).data;
}
/**
* Get top growing/declining categories
*/
async getTopMovers(
limit: number = 5,
days: number = 30
): Promise<{
growing: CategoryGrowth[];
declining: CategoryGrowth[];
}> {
const key = cacheKey('top_movers', { limit, days });
return (await this.cache.getOrCompute(key, async () => {
const allGrowth = await this.getCategoryGrowth(days);
const sorted = [...allGrowth].sort((a, b) => b.skuGrowthPercent - a.skuGrowthPercent);
return {
growing: sorted.filter(c => c.skuGrowthPercent > 0).slice(0, limit),
declining: sorted.filter(c => c.skuGrowthPercent < 0).slice(-limit).reverse(),
};
}, 15)).data;
}
/**
* Get category subcategory breakdown
*/
async getSubcategoryBreakdown(category: string): Promise<Array<{
subcategory: string;
skuCount: number;
brandCount: number;
avgPrice: number | null;
percentOfCategory: number;
}>> {
const key = cacheKey('subcategory_breakdown', { category });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
WITH category_total AS (
SELECT COUNT(*) as total FROM dutchie_products WHERE type = $1
)
SELECT
COALESCE(dp.subcategory, 'Other') as subcategory,
COUNT(*) as sku_count,
COUNT(DISTINCT dp.brand_name) as brand_count,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
ct.total as category_total
FROM dutchie_products dp, category_total ct
WHERE dp.type = $1
GROUP BY dp.subcategory, ct.total
ORDER BY sku_count DESC
`, [category]);
return result.rows.map(row => ({
subcategory: row.subcategory,
skuCount: parseInt(row.sku_count) || 0,
brandCount: parseInt(row.brand_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
percentOfCategory: parseInt(row.category_total) > 0
? Math.round((parseInt(row.sku_count) / parseInt(row.category_total)) * 1000) / 10
: 0,
}));
}, 15)).data;
}
}

View File

@@ -1,57 +0,0 @@
/**
* Analytics Module Index
*
* Exports all analytics services for CannaiQ dashboards.
*
* Phase 3: Analytics Dashboards
*/
export { AnalyticsCache, cacheKey, type CacheEntry, type CacheConfig } from './cache';
export {
PriceTrendService,
type PricePoint,
type PriceTrend,
type PriceSummary,
type PriceCompressionResult,
type PriceFilters,
} from './price-trends';
export {
PenetrationService,
type BrandPenetration,
type PenetrationTrend,
type ShelfShare,
type BrandPresenceByState,
type PenetrationFilters,
} from './penetration';
export {
CategoryAnalyticsService,
type CategoryGrowth,
type CategorySummary,
type CategoryGrowthTrend,
type CategoryHeatmapData,
type SeasonalityPattern,
type CategoryFilters,
} from './category-analytics';
export {
StoreChangeService,
type StoreChangeSummary,
type StoreChangeEvent,
type BrandChange,
type ProductChange,
type CategoryLeaderboard,
type StoreFilters,
} from './store-changes';
export {
BrandOpportunityService,
type BrandOpportunity,
type PricePosition,
type MissingSkuOpportunity,
type StoreShelfShareChange,
type CompetitorAlert,
type MarketPositionSummary,
} from './brand-opportunity';

View File

@@ -1,556 +0,0 @@
/**
* Brand Penetration Analytics Service
*
* Provides analytics for brand market penetration including:
* - Stores carrying brand
* - SKU counts per brand
* - Percentage of stores carrying
* - Shelf share calculations
* - Penetration trends and momentum
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
import { AnalyticsCache, cacheKey } from './cache';
export interface BrandPenetration {
brandName: string;
brandId: string | null;
totalStores: number;
storesCarrying: number;
penetrationPercent: number;
totalSkus: number;
avgSkusPerStore: number;
shelfSharePercent: number;
categories: string[];
avgPrice: number | null;
inStockSkus: number;
}
export interface PenetrationTrend {
brandName: string;
dataPoints: Array<{
date: string;
storeCount: number;
skuCount: number;
penetrationPercent: number;
}>;
momentumScore: number; // -100 to +100
riskScore: number; // 0 to 100, higher = more risk
trend: 'growing' | 'declining' | 'stable';
}
export interface ShelfShare {
brandName: string;
category: string;
skuCount: number;
categoryTotalSkus: number;
shelfSharePercent: number;
rank: number;
}
export interface BrandPresenceByState {
state: string;
storeCount: number;
skuCount: number;
avgPrice: number | null;
}
export interface PenetrationFilters {
state?: string;
category?: string;
minStores?: number;
minSkus?: number;
}
export class PenetrationService {
private pool: Pool;
private cache: AnalyticsCache;
constructor(pool: Pool, cache: AnalyticsCache) {
this.pool = pool;
this.cache = cache;
}
/**
* Get penetration data for a specific brand
*/
async getBrandPenetration(
brandName: string,
filters: PenetrationFilters = {}
): Promise<BrandPenetration> {
const { state, category } = filters;
const key = cacheKey('brand_penetration', { brandName, state, category });
return (await this.cache.getOrCompute(key, async () => {
// Build where clauses
const conditions: string[] = [];
const params: (string | number)[] = [brandName];
let paramIndex = 2;
if (state) {
conditions.push(`d.state = $${paramIndex++}`);
params.push(state);
}
if (category) {
conditions.push(`dp.type = $${paramIndex++}`);
params.push(category);
}
const stateCondition = state ? `AND d.state = $${params.indexOf(state) + 1}` : '';
const categoryCondition = category ? `AND dp.type = $${params.indexOf(category) + 1}` : '';
const result = await this.pool.query(`
WITH total_stores AS (
SELECT COUNT(DISTINCT id) as total
FROM dispensaries
WHERE 1=1 ${state ? `AND state = $2` : ''}
),
brand_data AS (
SELECT
dp.brand_name,
dp.brand_id,
COUNT(DISTINCT dp.dispensary_id) as stores_carrying,
COUNT(*) as total_skus,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock,
ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name = $1
${stateCondition}
${categoryCondition}
GROUP BY dp.brand_name, dp.brand_id
),
total_skus AS (
SELECT COUNT(*) as total
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE 1=1 ${stateCondition} ${categoryCondition}
)
SELECT
bd.brand_name,
bd.brand_id,
ts.total as total_stores,
bd.stores_carrying,
bd.total_skus,
bd.avg_price,
bd.in_stock,
bd.categories,
tsk.total as market_total_skus
FROM brand_data bd, total_stores ts, total_skus tsk
`, params);
if (result.rows.length === 0) {
return {
brandName,
brandId: null,
totalStores: 0,
storesCarrying: 0,
penetrationPercent: 0,
totalSkus: 0,
avgSkusPerStore: 0,
shelfSharePercent: 0,
categories: [],
avgPrice: null,
inStockSkus: 0,
};
}
const row = result.rows[0];
const totalStores = parseInt(row.total_stores) || 1;
const storesCarrying = parseInt(row.stores_carrying) || 0;
const totalSkus = parseInt(row.total_skus) || 0;
const marketTotalSkus = parseInt(row.market_total_skus) || 1;
return {
brandName: row.brand_name,
brandId: row.brand_id,
totalStores,
storesCarrying,
penetrationPercent: Math.round((storesCarrying / totalStores) * 1000) / 10,
totalSkus,
avgSkusPerStore: storesCarrying > 0
? Math.round((totalSkus / storesCarrying) * 10) / 10
: 0,
shelfSharePercent: Math.round((totalSkus / marketTotalSkus) * 1000) / 10,
categories: row.categories || [],
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
inStockSkus: parseInt(row.in_stock) || 0,
};
}, 15)).data;
}
/**
* Get top brands by penetration
*/
async getTopBrandsByPenetration(
limit: number = 20,
filters: PenetrationFilters = {}
): Promise<BrandPenetration[]> {
const { state, category, minStores = 2, minSkus = 5 } = filters;
const key = cacheKey('top_brands_penetration', { limit, state, category, minStores, minSkus });
return (await this.cache.getOrCompute(key, async () => {
const params: (string | number)[] = [limit, minStores, minSkus];
let paramIndex = 4;
let stateCondition = '';
let categoryCondition = '';
if (state) {
stateCondition = `AND d.state = $${paramIndex++}`;
params.push(state);
}
if (category) {
categoryCondition = `AND dp.type = $${paramIndex++}`;
params.push(category);
}
const result = await this.pool.query(`
WITH total_stores AS (
SELECT COUNT(DISTINCT id) as total
FROM dispensaries
WHERE 1=1 ${state ? `AND state = $${params.indexOf(state) + 1}` : ''}
),
total_skus AS (
SELECT COUNT(*) as total
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE 1=1 ${stateCondition} ${categoryCondition}
),
brand_data AS (
SELECT
dp.brand_name,
dp.brand_id,
COUNT(DISTINCT dp.dispensary_id) as stores_carrying,
COUNT(*) as total_skus,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock,
ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name IS NOT NULL
${stateCondition}
${categoryCondition}
GROUP BY dp.brand_name, dp.brand_id
HAVING COUNT(DISTINCT dp.dispensary_id) >= $2
AND COUNT(*) >= $3
)
SELECT
bd.*,
ts.total as total_stores,
tsk.total as market_total_skus
FROM brand_data bd, total_stores ts, total_skus tsk
ORDER BY bd.stores_carrying DESC, bd.total_skus DESC
LIMIT $1
`, params);
return result.rows.map(row => {
const totalStores = parseInt(row.total_stores) || 1;
const storesCarrying = parseInt(row.stores_carrying) || 0;
const totalSkus = parseInt(row.total_skus) || 0;
const marketTotalSkus = parseInt(row.market_total_skus) || 1;
return {
brandName: row.brand_name,
brandId: row.brand_id,
totalStores,
storesCarrying,
penetrationPercent: Math.round((storesCarrying / totalStores) * 1000) / 10,
totalSkus,
avgSkusPerStore: storesCarrying > 0
? Math.round((totalSkus / storesCarrying) * 10) / 10
: 0,
shelfSharePercent: Math.round((totalSkus / marketTotalSkus) * 1000) / 10,
categories: row.categories || [],
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
inStockSkus: parseInt(row.in_stock) || 0,
};
});
}, 15)).data;
}
/**
* Get penetration trend for a brand (requires historical snapshots)
*/
async getPenetrationTrend(
brandName: string,
days: number = 30
): Promise<PenetrationTrend> {
const key = cacheKey('penetration_trend', { brandName, days });
return (await this.cache.getOrCompute(key, async () => {
// Use brand_snapshots table for historical data
const result = await this.pool.query(`
SELECT
snapshot_date as date,
store_count,
total_skus
FROM brand_snapshots
WHERE brand_name = $1
AND snapshot_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
ORDER BY snapshot_date
`, [brandName, days]);
// Get total stores for penetration calculation
const totalResult = await this.pool.query(
'SELECT COUNT(*) as total FROM dispensaries'
);
const totalStores = parseInt(totalResult.rows[0]?.total) || 1;
const dataPoints = result.rows.map(row => ({
date: row.date.toISOString().split('T')[0],
storeCount: parseInt(row.store_count) || 0,
skuCount: parseInt(row.total_skus) || 0,
penetrationPercent: Math.round((parseInt(row.store_count) / totalStores) * 1000) / 10,
}));
// Calculate momentum and risk scores
let momentumScore = 0;
let riskScore = 0;
let trend: 'growing' | 'declining' | 'stable' = 'stable';
if (dataPoints.length >= 2) {
const first = dataPoints[0];
const last = dataPoints[dataPoints.length - 1];
// Momentum: change in store count
const storeChange = last.storeCount - first.storeCount;
const storeChangePercent = first.storeCount > 0
? (storeChange / first.storeCount) * 100
: 0;
// Momentum score: -100 to +100
momentumScore = Math.max(-100, Math.min(100, storeChangePercent * 10));
// Risk score: higher if losing stores
if (storeChange < 0) {
riskScore = Math.min(100, Math.abs(storeChangePercent) * 5);
}
// Determine trend
if (storeChangePercent > 5) trend = 'growing';
else if (storeChangePercent < -5) trend = 'declining';
}
return {
brandName,
dataPoints,
momentumScore: Math.round(momentumScore),
riskScore: Math.round(riskScore),
trend,
};
}, 15)).data;
}
/**
* Get shelf share by category for a brand
*/
async getShelfShareByCategory(brandName: string): Promise<ShelfShare[]> {
const key = cacheKey('shelf_share_category', { brandName });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
WITH category_totals AS (
SELECT
type as category,
COUNT(*) as total_skus
FROM dutchie_products
WHERE type IS NOT NULL
GROUP BY type
),
brand_by_category AS (
SELECT
type as category,
COUNT(*) as sku_count
FROM dutchie_products
WHERE brand_name = $1
AND type IS NOT NULL
GROUP BY type
),
ranked AS (
SELECT
ct.category,
COALESCE(bc.sku_count, 0) as sku_count,
ct.total_skus,
RANK() OVER (PARTITION BY ct.category ORDER BY bc.sku_count DESC NULLS LAST) as rank
FROM category_totals ct
LEFT JOIN brand_by_category bc ON ct.category = bc.category
)
SELECT
r.category,
r.sku_count,
r.total_skus as category_total_skus,
ROUND((r.sku_count::NUMERIC / r.total_skus) * 100, 2) as shelf_share_pct,
(SELECT COUNT(*) + 1 FROM (
SELECT brand_name, COUNT(*) as cnt
FROM dutchie_products
WHERE type = r.category AND brand_name IS NOT NULL
GROUP BY brand_name
HAVING COUNT(*) > r.sku_count
) t) as rank
FROM ranked r
WHERE r.sku_count > 0
ORDER BY r.shelf_share_pct DESC
`, [brandName]);
return result.rows.map(row => ({
brandName,
category: row.category,
skuCount: parseInt(row.sku_count) || 0,
categoryTotalSkus: parseInt(row.category_total_skus) || 0,
shelfSharePercent: parseFloat(row.shelf_share_pct) || 0,
rank: parseInt(row.rank) || 0,
}));
}, 15)).data;
}
/**
* Get brand presence by state/region
*/
async getBrandPresenceByState(brandName: string): Promise<BrandPresenceByState[]> {
const key = cacheKey('brand_presence_state', { brandName });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
d.state,
COUNT(DISTINCT dp.dispensary_id) as store_count,
COUNT(*) as sku_count,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name = $1
GROUP BY d.state
ORDER BY store_count DESC
`, [brandName]);
return result.rows.map(row => ({
state: row.state,
storeCount: parseInt(row.store_count) || 0,
skuCount: parseInt(row.sku_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
}));
}, 15)).data;
}
/**
* Get stores carrying a brand
*/
async getStoresCarryingBrand(brandName: string): Promise<Array<{
storeId: number;
storeName: string;
city: string;
state: string;
skuCount: number;
avgPrice: number | null;
categories: string[];
}>> {
const key = cacheKey('stores_carrying_brand', { brandName });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
d.id as store_id,
d.name as store_name,
d.city,
d.state,
COUNT(*) as sku_count,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name = $1
GROUP BY d.id, d.name, d.city, d.state
ORDER BY sku_count DESC
`, [brandName]);
return result.rows.map(row => ({
storeId: row.store_id,
storeName: row.store_name,
city: row.city,
state: row.state,
skuCount: parseInt(row.sku_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
categories: row.categories || [],
}));
}, 15)).data;
}
/**
* Get penetration heatmap data (state-based)
*/
async getPenetrationHeatmap(
brandName?: string
): Promise<Array<{
state: string;
totalStores: number;
storesWithBrand: number;
penetrationPercent: number;
totalSkus: number;
}>> {
const key = cacheKey('penetration_heatmap', { brandName });
return (await this.cache.getOrCompute(key, async () => {
if (brandName) {
const result = await this.pool.query(`
WITH state_totals AS (
SELECT state, COUNT(*) as total_stores
FROM dispensaries
GROUP BY state
),
brand_by_state AS (
SELECT
d.state,
COUNT(DISTINCT dp.dispensary_id) as stores_with_brand,
COUNT(*) as total_skus
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name = $1
GROUP BY d.state
)
SELECT
st.state,
st.total_stores,
COALESCE(bs.stores_with_brand, 0) as stores_with_brand,
ROUND(COALESCE(bs.stores_with_brand, 0)::NUMERIC / st.total_stores * 100, 1) as penetration_pct,
COALESCE(bs.total_skus, 0) as total_skus
FROM state_totals st
LEFT JOIN brand_by_state bs ON st.state = bs.state
ORDER BY penetration_pct DESC
`, [brandName]);
return result.rows.map(row => ({
state: row.state,
totalStores: parseInt(row.total_stores) || 0,
storesWithBrand: parseInt(row.stores_with_brand) || 0,
penetrationPercent: parseFloat(row.penetration_pct) || 0,
totalSkus: parseInt(row.total_skus) || 0,
}));
} else {
// Overall market data by state
const result = await this.pool.query(`
SELECT
d.state,
COUNT(DISTINCT d.id) as total_stores,
COUNT(DISTINCT dp.brand_name) as brand_count,
COUNT(*) as total_skus
FROM dispensaries d
LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id
GROUP BY d.state
ORDER BY total_stores DESC
`);
return result.rows.map(row => ({
state: row.state,
totalStores: parseInt(row.total_stores) || 0,
storesWithBrand: parseInt(row.brand_count) || 0, // Using brand count here
penetrationPercent: 100, // Full penetration for overall view
totalSkus: parseInt(row.total_skus) || 0,
}));
}
}, 30)).data;
}
}

View File

@@ -1,534 +0,0 @@
/**
* Price Trend Analytics Service
*
* Provides time-series price analytics including:
* - Price over time for products
* - Average MSRP/Wholesale by period
* - Price volatility scoring
* - Price compression detection
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
import { AnalyticsCache, cacheKey } from './cache';
export interface PricePoint {
date: string;
minPrice: number | null;
maxPrice: number | null;
avgPrice: number | null;
wholesalePrice: number | null;
sampleSize: number;
}
export interface PriceTrend {
productId?: number;
storeId?: number;
brandName?: string;
category?: string;
dataPoints: PricePoint[];
summary: {
currentAvg: number | null;
previousAvg: number | null;
changePercent: number | null;
trend: 'up' | 'down' | 'stable';
volatilityScore: number | null;
};
}
export interface PriceSummary {
avg7d: number | null;
avg30d: number | null;
avg90d: number | null;
wholesaleAvg7d: number | null;
wholesaleAvg30d: number | null;
wholesaleAvg90d: number | null;
minPrice: number | null;
maxPrice: number | null;
priceRange: number | null;
volatilityScore: number | null;
}
export interface PriceCompressionResult {
category: string;
brands: Array<{
brandName: string;
avgPrice: number;
priceDistance: number; // distance from category mean
}>;
compressionScore: number; // 0-100, higher = more compressed
standardDeviation: number;
}
export interface PriceFilters {
storeId?: number;
brandName?: string;
category?: string;
state?: string;
days?: number;
}
export class PriceTrendService {
private pool: Pool;
private cache: AnalyticsCache;
constructor(pool: Pool, cache: AnalyticsCache) {
this.pool = pool;
this.cache = cache;
}
/**
* Get price trend for a specific product
*/
async getProductPriceTrend(
productId: number,
storeId?: number,
days: number = 30
): Promise<PriceTrend> {
const key = cacheKey('price_trend_product', { productId, storeId, days });
return (await this.cache.getOrCompute(key, async () => {
// Try to get from snapshots first
const snapshotResult = await this.pool.query(`
SELECT
DATE(crawled_at) as date,
MIN(rec_min_price_cents) / 100.0 as min_price,
MAX(rec_max_price_cents) / 100.0 as max_price,
AVG(rec_min_price_cents) / 100.0 as avg_price,
AVG(wholesale_min_price_cents) / 100.0 as wholesale_price,
COUNT(*) as sample_size
FROM dutchie_product_snapshots
WHERE dutchie_product_id = $1
AND crawled_at >= NOW() - ($2 || ' days')::INTERVAL
${storeId ? 'AND dispensary_id = $3' : ''}
GROUP BY DATE(crawled_at)
ORDER BY date
`, storeId ? [productId, days, storeId] : [productId, days]);
let dataPoints: PricePoint[] = snapshotResult.rows.map(row => ({
date: row.date.toISOString().split('T')[0],
minPrice: parseFloat(row.min_price) || null,
maxPrice: parseFloat(row.max_price) || null,
avgPrice: parseFloat(row.avg_price) || null,
wholesalePrice: parseFloat(row.wholesale_price) || null,
sampleSize: parseInt(row.sample_size),
}));
// If no snapshots, get current price from product
if (dataPoints.length === 0) {
const productResult = await this.pool.query(`
SELECT
extract_min_price(latest_raw_payload) as min_price,
extract_max_price(latest_raw_payload) as max_price,
extract_wholesale_price(latest_raw_payload) as wholesale_price
FROM dutchie_products
WHERE id = $1
`, [productId]);
if (productResult.rows.length > 0) {
const row = productResult.rows[0];
dataPoints = [{
date: new Date().toISOString().split('T')[0],
minPrice: parseFloat(row.min_price) || null,
maxPrice: parseFloat(row.max_price) || null,
avgPrice: parseFloat(row.min_price) || null,
wholesalePrice: parseFloat(row.wholesale_price) || null,
sampleSize: 1,
}];
}
}
const summary = this.calculatePriceSummary(dataPoints);
return {
productId,
storeId,
dataPoints,
summary,
};
}, 15)).data;
}
/**
* Get price trends by brand
*/
async getBrandPriceTrend(
brandName: string,
filters: PriceFilters = {}
): Promise<PriceTrend> {
const { storeId, category, state, days = 30 } = filters;
const key = cacheKey('price_trend_brand', { brandName, storeId, category, state, days });
return (await this.cache.getOrCompute(key, async () => {
// Use current product data aggregated by date
const result = await this.pool.query(`
SELECT
DATE(dp.updated_at) as date,
MIN(extract_min_price(dp.latest_raw_payload)) as min_price,
MAX(extract_max_price(dp.latest_raw_payload)) as max_price,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
AVG(extract_wholesale_price(dp.latest_raw_payload)) as wholesale_price,
COUNT(*) as sample_size
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.brand_name = $1
AND dp.updated_at >= NOW() - ($2 || ' days')::INTERVAL
${storeId ? 'AND dp.dispensary_id = $3' : ''}
${category ? `AND dp.type = $${storeId ? 4 : 3}` : ''}
${state ? `AND d.state = $${storeId ? (category ? 5 : 4) : (category ? 4 : 3)}` : ''}
GROUP BY DATE(dp.updated_at)
ORDER BY date
`, this.buildParams([brandName, days], { storeId, category, state }));
const dataPoints: PricePoint[] = result.rows.map(row => ({
date: row.date.toISOString().split('T')[0],
minPrice: parseFloat(row.min_price) || null,
maxPrice: parseFloat(row.max_price) || null,
avgPrice: parseFloat(row.avg_price) || null,
wholesalePrice: parseFloat(row.wholesale_price) || null,
sampleSize: parseInt(row.sample_size),
}));
return {
brandName,
storeId,
category,
dataPoints,
summary: this.calculatePriceSummary(dataPoints),
};
}, 15)).data;
}
/**
* Get price trends by category
*/
async getCategoryPriceTrend(
category: string,
filters: PriceFilters = {}
): Promise<PriceTrend> {
const { storeId, brandName, state, days = 30 } = filters;
const key = cacheKey('price_trend_category', { category, storeId, brandName, state, days });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
DATE(dp.updated_at) as date,
MIN(extract_min_price(dp.latest_raw_payload)) as min_price,
MAX(extract_max_price(dp.latest_raw_payload)) as max_price,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
AVG(extract_wholesale_price(dp.latest_raw_payload)) as wholesale_price,
COUNT(*) as sample_size
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.type = $1
AND dp.updated_at >= NOW() - ($2 || ' days')::INTERVAL
${storeId ? 'AND dp.dispensary_id = $3' : ''}
${brandName ? `AND dp.brand_name = $${storeId ? 4 : 3}` : ''}
${state ? `AND d.state = $${storeId ? (brandName ? 5 : 4) : (brandName ? 4 : 3)}` : ''}
GROUP BY DATE(dp.updated_at)
ORDER BY date
`, this.buildParams([category, days], { storeId, brandName, state }));
const dataPoints: PricePoint[] = result.rows.map(row => ({
date: row.date.toISOString().split('T')[0],
minPrice: parseFloat(row.min_price) || null,
maxPrice: parseFloat(row.max_price) || null,
avgPrice: parseFloat(row.avg_price) || null,
wholesalePrice: parseFloat(row.wholesale_price) || null,
sampleSize: parseInt(row.sample_size),
}));
return {
category,
storeId,
brandName,
dataPoints,
summary: this.calculatePriceSummary(dataPoints),
};
}, 15)).data;
}
/**
* Get price summary statistics
*/
async getPriceSummary(filters: PriceFilters = {}): Promise<PriceSummary> {
const { storeId, brandName, category, state } = filters;
const key = cacheKey('price_summary', filters as Record<string, unknown>);
return (await this.cache.getOrCompute(key, async () => {
const whereConditions: string[] = [];
const params: (string | number)[] = [];
let paramIndex = 1;
if (storeId) {
whereConditions.push(`dp.dispensary_id = $${paramIndex++}`);
params.push(storeId);
}
if (brandName) {
whereConditions.push(`dp.brand_name = $${paramIndex++}`);
params.push(brandName);
}
if (category) {
whereConditions.push(`dp.type = $${paramIndex++}`);
params.push(category);
}
if (state) {
whereConditions.push(`d.state = $${paramIndex++}`);
params.push(state);
}
const whereClause = whereConditions.length > 0
? 'WHERE ' + whereConditions.join(' AND ')
: '';
const result = await this.pool.query(`
WITH prices AS (
SELECT
extract_min_price(dp.latest_raw_payload) as min_price,
extract_max_price(dp.latest_raw_payload) as max_price,
extract_wholesale_price(dp.latest_raw_payload) as wholesale_price
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
${whereClause}
)
SELECT
AVG(min_price) as avg_price,
AVG(wholesale_price) as avg_wholesale,
MIN(min_price) as min_price,
MAX(max_price) as max_price,
STDDEV(min_price) as std_dev
FROM prices
WHERE min_price IS NOT NULL
`, params);
const row = result.rows[0];
const avgPrice = parseFloat(row.avg_price) || null;
const stdDev = parseFloat(row.std_dev) || null;
const volatility = avgPrice && stdDev ? (stdDev / avgPrice) * 100 : null;
return {
avg7d: avgPrice, // Using current data as proxy
avg30d: avgPrice,
avg90d: avgPrice,
wholesaleAvg7d: parseFloat(row.avg_wholesale) || null,
wholesaleAvg30d: parseFloat(row.avg_wholesale) || null,
wholesaleAvg90d: parseFloat(row.avg_wholesale) || null,
minPrice: parseFloat(row.min_price) || null,
maxPrice: parseFloat(row.max_price) || null,
priceRange: row.max_price && row.min_price
? parseFloat(row.max_price) - parseFloat(row.min_price)
: null,
volatilityScore: volatility ? Math.round(volatility * 10) / 10 : null,
};
}, 30)).data;
}
/**
* Detect price compression in a category
*/
async detectPriceCompression(
category: string,
state?: string
): Promise<PriceCompressionResult> {
const key = cacheKey('price_compression', { category, state });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
WITH brand_prices AS (
SELECT
dp.brand_name,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
COUNT(*) as sku_count
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.type = $1
AND dp.brand_name IS NOT NULL
${state ? 'AND d.state = $2' : ''}
GROUP BY dp.brand_name
HAVING COUNT(*) >= 3
),
stats AS (
SELECT
AVG(avg_price) as category_avg,
STDDEV(avg_price) as std_dev
FROM brand_prices
WHERE avg_price IS NOT NULL
)
SELECT
bp.brand_name,
bp.avg_price,
ABS(bp.avg_price - s.category_avg) as price_distance,
s.category_avg,
s.std_dev
FROM brand_prices bp, stats s
WHERE bp.avg_price IS NOT NULL
ORDER BY bp.avg_price
`, state ? [category, state] : [category]);
if (result.rows.length === 0) {
return {
category,
brands: [],
compressionScore: 0,
standardDeviation: 0,
};
}
const categoryAvg = parseFloat(result.rows[0].category_avg) || 0;
const stdDev = parseFloat(result.rows[0].std_dev) || 0;
// Compression score: lower std dev relative to mean = more compression
// Scale to 0-100 where 100 = very compressed
const cv = categoryAvg > 0 ? (stdDev / categoryAvg) * 100 : 0;
const compressionScore = Math.max(0, Math.min(100, 100 - cv));
const brands = result.rows.map(row => ({
brandName: row.brand_name,
avgPrice: parseFloat(row.avg_price) || 0,
priceDistance: parseFloat(row.price_distance) || 0,
}));
return {
category,
brands,
compressionScore: Math.round(compressionScore),
standardDeviation: Math.round(stdDev * 100) / 100,
};
}, 30)).data;
}
/**
* Get global price statistics
*/
async getGlobalPriceStats(): Promise<{
totalProductsWithPrice: number;
avgPrice: number | null;
medianPrice: number | null;
priceByCategory: Array<{ category: string; avgPrice: number; count: number }>;
priceByState: Array<{ state: string; avgPrice: number; count: number }>;
}> {
const key = 'global_price_stats';
return (await this.cache.getOrCompute(key, async () => {
const [countResult, categoryResult, stateResult] = await Promise.all([
this.pool.query(`
SELECT
COUNT(*) FILTER (WHERE extract_min_price(latest_raw_payload) IS NOT NULL) as with_price,
AVG(extract_min_price(latest_raw_payload)) as avg_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY extract_min_price(latest_raw_payload)) as median
FROM dutchie_products
`),
this.pool.query(`
SELECT
type as category,
AVG(extract_min_price(latest_raw_payload)) as avg_price,
COUNT(*) as count
FROM dutchie_products
WHERE type IS NOT NULL
AND extract_min_price(latest_raw_payload) IS NOT NULL
GROUP BY type
ORDER BY avg_price DESC
`),
this.pool.query(`
SELECT
d.state,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price,
COUNT(*) as count
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE extract_min_price(dp.latest_raw_payload) IS NOT NULL
GROUP BY d.state
ORDER BY avg_price DESC
`),
]);
return {
totalProductsWithPrice: parseInt(countResult.rows[0]?.with_price || '0'),
avgPrice: parseFloat(countResult.rows[0]?.avg_price) || null,
medianPrice: parseFloat(countResult.rows[0]?.median) || null,
priceByCategory: categoryResult.rows.map(r => ({
category: r.category,
avgPrice: parseFloat(r.avg_price) || 0,
count: parseInt(r.count),
})),
priceByState: stateResult.rows.map(r => ({
state: r.state,
avgPrice: parseFloat(r.avg_price) || 0,
count: parseInt(r.count),
})),
};
}, 30)).data;
}
// ============================================================
// HELPER METHODS
// ============================================================
private calculatePriceSummary(dataPoints: PricePoint[]): PriceTrend['summary'] {
if (dataPoints.length === 0) {
return {
currentAvg: null,
previousAvg: null,
changePercent: null,
trend: 'stable',
volatilityScore: null,
};
}
const prices = dataPoints
.map(d => d.avgPrice)
.filter((p): p is number => p !== null);
if (prices.length === 0) {
return {
currentAvg: null,
previousAvg: null,
changePercent: null,
trend: 'stable',
volatilityScore: null,
};
}
const currentAvg = prices[prices.length - 1];
const midpoint = Math.floor(prices.length / 2);
const previousAvg = prices.length > 1 ? prices[midpoint] : currentAvg;
const changePercent = previousAvg > 0
? ((currentAvg - previousAvg) / previousAvg) * 100
: null;
// Calculate volatility (coefficient of variation)
const mean = prices.reduce((a, b) => a + b, 0) / prices.length;
const variance = prices.reduce((sum, p) => sum + Math.pow(p - mean, 2), 0) / prices.length;
const stdDev = Math.sqrt(variance);
const volatilityScore = mean > 0 ? (stdDev / mean) * 100 : null;
let trend: 'up' | 'down' | 'stable' = 'stable';
if (changePercent !== null) {
if (changePercent > 5) trend = 'up';
else if (changePercent < -5) trend = 'down';
}
return {
currentAvg: Math.round(currentAvg * 100) / 100,
previousAvg: Math.round(previousAvg * 100) / 100,
changePercent: changePercent !== null ? Math.round(changePercent * 10) / 10 : null,
trend,
volatilityScore: volatilityScore !== null ? Math.round(volatilityScore * 10) / 10 : null,
};
}
private buildParams(
baseParams: (string | number)[],
optionalParams: Record<string, string | number | undefined>
): (string | number)[] {
const params = [...baseParams];
for (const value of Object.values(optionalParams)) {
if (value !== undefined) {
params.push(value);
}
}
return params;
}
}

View File

@@ -1,587 +0,0 @@
/**
* Store Change Tracking Service
*
* Tracks changes at the store level including:
* - New/lost brands
* - New/discontinued products
* - Stock status transitions
* - Price changes
* - Category movement leaderboards
*
* Phase 3: Analytics Dashboards
*/
import { Pool } from 'pg';
import { AnalyticsCache, cacheKey } from './cache';
export interface StoreChangeSummary {
storeId: number;
storeName: string;
city: string;
state: string;
brandsAdded7d: number;
brandsAdded30d: number;
brandsLost7d: number;
brandsLost30d: number;
productsAdded7d: number;
productsAdded30d: number;
productsDiscontinued7d: number;
productsDiscontinued30d: number;
priceDrops7d: number;
priceIncreases7d: number;
restocks7d: number;
stockOuts7d: number;
}
export interface StoreChangeEvent {
id: number;
storeId: number;
storeName: string;
eventType: string;
eventDate: string;
brandName: string | null;
productName: string | null;
category: string | null;
oldValue: string | null;
newValue: string | null;
metadata: Record<string, unknown> | null;
}
export interface BrandChange {
brandName: string;
changeType: 'added' | 'removed';
date: string;
skuCount: number;
categories: string[];
}
export interface ProductChange {
productId: number;
productName: string;
brandName: string | null;
category: string | null;
changeType: 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock';
date: string;
oldValue?: string;
newValue?: string;
}
export interface CategoryLeaderboard {
category: string;
storeId: number;
storeName: string;
skuCount: number;
brandCount: number;
avgPrice: number | null;
changePercent7d: number;
rank: number;
}
export interface StoreFilters {
storeId?: number;
state?: string;
days?: number;
eventType?: string;
}
export class StoreChangeService {
private pool: Pool;
private cache: AnalyticsCache;
constructor(pool: Pool, cache: AnalyticsCache) {
this.pool = pool;
this.cache = cache;
}
/**
* Get change summary for a store
*/
async getStoreChangeSummary(
storeId: number
): Promise<StoreChangeSummary | null> {
const key = cacheKey('store_change_summary', { storeId });
return (await this.cache.getOrCompute(key, async () => {
// Get store info
const storeResult = await this.pool.query(`
SELECT id, name, city, state FROM dispensaries WHERE id = $1
`, [storeId]);
if (storeResult.rows.length === 0) return null;
const store = storeResult.rows[0];
// Get change events counts
const eventsResult = await this.pool.query(`
SELECT
event_type,
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '7 days') as count_7d,
COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days') as count_30d
FROM store_change_events
WHERE store_id = $1
GROUP BY event_type
`, [storeId]);
const counts: Record<string, { count_7d: number; count_30d: number }> = {};
eventsResult.rows.forEach(row => {
counts[row.event_type] = {
count_7d: parseInt(row.count_7d) || 0,
count_30d: parseInt(row.count_30d) || 0,
};
});
return {
storeId: store.id,
storeName: store.name,
city: store.city,
state: store.state,
brandsAdded7d: counts['brand_added']?.count_7d || 0,
brandsAdded30d: counts['brand_added']?.count_30d || 0,
brandsLost7d: counts['brand_removed']?.count_7d || 0,
brandsLost30d: counts['brand_removed']?.count_30d || 0,
productsAdded7d: counts['product_added']?.count_7d || 0,
productsAdded30d: counts['product_added']?.count_30d || 0,
productsDiscontinued7d: counts['product_removed']?.count_7d || 0,
productsDiscontinued30d: counts['product_removed']?.count_30d || 0,
priceDrops7d: counts['price_drop']?.count_7d || 0,
priceIncreases7d: counts['price_increase']?.count_7d || 0,
restocks7d: counts['restocked']?.count_7d || 0,
stockOuts7d: counts['out_of_stock']?.count_7d || 0,
};
}, 15)).data;
}
/**
* Get recent change events for a store
*/
async getStoreChangeEvents(
storeId: number,
filters: { eventType?: string; days?: number; limit?: number } = {}
): Promise<StoreChangeEvent[]> {
const { eventType, days = 30, limit = 100 } = filters;
const key = cacheKey('store_change_events', { storeId, eventType, days, limit });
return (await this.cache.getOrCompute(key, async () => {
const params: (string | number)[] = [storeId, days, limit];
let eventTypeCondition = '';
if (eventType) {
eventTypeCondition = 'AND event_type = $4';
params.push(eventType);
}
const result = await this.pool.query(`
SELECT
sce.id,
sce.store_id,
d.name as store_name,
sce.event_type,
sce.event_date,
sce.brand_name,
sce.product_name,
sce.category,
sce.old_value,
sce.new_value,
sce.metadata
FROM store_change_events sce
JOIN dispensaries d ON sce.store_id = d.id
WHERE sce.store_id = $1
AND sce.event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
${eventTypeCondition}
ORDER BY sce.event_date DESC, sce.id DESC
LIMIT $3
`, params);
return result.rows.map(row => ({
id: row.id,
storeId: row.store_id,
storeName: row.store_name,
eventType: row.event_type,
eventDate: row.event_date.toISOString().split('T')[0],
brandName: row.brand_name,
productName: row.product_name,
category: row.category,
oldValue: row.old_value,
newValue: row.new_value,
metadata: row.metadata,
}));
}, 5)).data;
}
/**
* Get new brands added to a store
*/
async getNewBrands(
storeId: number,
days: number = 30
): Promise<BrandChange[]> {
const key = cacheKey('new_brands', { storeId, days });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
brand_name,
event_date,
metadata
FROM store_change_events
WHERE store_id = $1
AND event_type = 'brand_added'
AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
ORDER BY event_date DESC
`, [storeId, days]);
return result.rows.map(row => ({
brandName: row.brand_name,
changeType: 'added' as const,
date: row.event_date.toISOString().split('T')[0],
skuCount: row.metadata?.sku_count || 0,
categories: row.metadata?.categories || [],
}));
}, 15)).data;
}
/**
* Get brands lost from a store
*/
async getLostBrands(
storeId: number,
days: number = 30
): Promise<BrandChange[]> {
const key = cacheKey('lost_brands', { storeId, days });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
brand_name,
event_date,
metadata
FROM store_change_events
WHERE store_id = $1
AND event_type = 'brand_removed'
AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
ORDER BY event_date DESC
`, [storeId, days]);
return result.rows.map(row => ({
brandName: row.brand_name,
changeType: 'removed' as const,
date: row.event_date.toISOString().split('T')[0],
skuCount: row.metadata?.sku_count || 0,
categories: row.metadata?.categories || [],
}));
}, 15)).data;
}
/**
* Get product changes for a store
*/
async getProductChanges(
storeId: number,
changeType?: 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock',
days: number = 7
): Promise<ProductChange[]> {
const key = cacheKey('product_changes', { storeId, changeType, days });
return (await this.cache.getOrCompute(key, async () => {
const eventTypeMap: Record<string, string> = {
'added': 'product_added',
'discontinued': 'product_removed',
'price_drop': 'price_drop',
'price_increase': 'price_increase',
'restocked': 'restocked',
'out_of_stock': 'out_of_stock',
};
const params: (string | number)[] = [storeId, days];
let eventCondition = '';
if (changeType) {
eventCondition = 'AND event_type = $3';
params.push(eventTypeMap[changeType]);
}
const result = await this.pool.query(`
SELECT
product_id,
product_name,
brand_name,
category,
event_type,
event_date,
old_value,
new_value
FROM store_change_events
WHERE store_id = $1
AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL
AND product_id IS NOT NULL
${eventCondition}
ORDER BY event_date DESC
LIMIT 100
`, params);
const reverseMap: Record<string, ProductChange['changeType']> = {
'product_added': 'added',
'product_removed': 'discontinued',
'price_drop': 'price_drop',
'price_increase': 'price_increase',
'restocked': 'restocked',
'out_of_stock': 'out_of_stock',
};
return result.rows.map(row => ({
productId: row.product_id,
productName: row.product_name,
brandName: row.brand_name,
category: row.category,
changeType: reverseMap[row.event_type] || 'added',
date: row.event_date.toISOString().split('T')[0],
oldValue: row.old_value,
newValue: row.new_value,
}));
}, 5)).data;
}
/**
* Get category leaderboard across stores
*/
async getCategoryLeaderboard(
category: string,
limit: number = 20
): Promise<CategoryLeaderboard[]> {
const key = cacheKey('category_leaderboard', { category, limit });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
WITH store_category_stats AS (
SELECT
dp.dispensary_id as store_id,
d.name as store_name,
COUNT(*) as sku_count,
COUNT(DISTINCT dp.brand_name) as brand_count,
AVG(extract_min_price(dp.latest_raw_payload)) as avg_price
FROM dutchie_products dp
JOIN dispensaries d ON dp.dispensary_id = d.id
WHERE dp.type = $1
GROUP BY dp.dispensary_id, d.name
)
SELECT
scs.*,
RANK() OVER (ORDER BY scs.sku_count DESC) as rank
FROM store_category_stats scs
ORDER BY scs.sku_count DESC
LIMIT $2
`, [category, limit]);
return result.rows.map(row => ({
category,
storeId: row.store_id,
storeName: row.store_name,
skuCount: parseInt(row.sku_count) || 0,
brandCount: parseInt(row.brand_count) || 0,
avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null,
changePercent7d: 0, // Would need historical data
rank: parseInt(row.rank) || 0,
}));
}, 15)).data;
}
/**
* Get stores with most activity (changes)
*/
async getMostActiveStores(
days: number = 7,
limit: number = 10
): Promise<Array<{
storeId: number;
storeName: string;
city: string;
state: string;
totalChanges: number;
brandsChanged: number;
productsChanged: number;
priceChanges: number;
stockChanges: number;
}>> {
const key = cacheKey('most_active_stores', { days, limit });
return (await this.cache.getOrCompute(key, async () => {
const result = await this.pool.query(`
SELECT
d.id as store_id,
d.name as store_name,
d.city,
d.state,
COUNT(*) as total_changes,
COUNT(*) FILTER (WHERE sce.event_type IN ('brand_added', 'brand_removed')) as brands_changed,
COUNT(*) FILTER (WHERE sce.event_type IN ('product_added', 'product_removed')) as products_changed,
COUNT(*) FILTER (WHERE sce.event_type IN ('price_drop', 'price_increase')) as price_changes,
COUNT(*) FILTER (WHERE sce.event_type IN ('restocked', 'out_of_stock')) as stock_changes
FROM store_change_events sce
JOIN dispensaries d ON sce.store_id = d.id
WHERE sce.event_date >= CURRENT_DATE - ($1 || ' days')::INTERVAL
GROUP BY d.id, d.name, d.city, d.state
ORDER BY total_changes DESC
LIMIT $2
`, [days, limit]);
return result.rows.map(row => ({
storeId: row.store_id,
storeName: row.store_name,
city: row.city,
state: row.state,
totalChanges: parseInt(row.total_changes) || 0,
brandsChanged: parseInt(row.brands_changed) || 0,
productsChanged: parseInt(row.products_changed) || 0,
priceChanges: parseInt(row.price_changes) || 0,
stockChanges: parseInt(row.stock_changes) || 0,
}));
}, 15)).data;
}
/**
* Compare two stores
*/
async compareStores(
storeId1: number,
storeId2: number
): Promise<{
store1: { id: number; name: string; brands: string[]; categories: string[]; skuCount: number };
store2: { id: number; name: string; brands: string[]; categories: string[]; skuCount: number };
sharedBrands: string[];
uniqueToStore1: string[];
uniqueToStore2: string[];
categoryComparison: Array<{
category: string;
store1Skus: number;
store2Skus: number;
difference: number;
}>;
}> {
const key = cacheKey('compare_stores', { storeId1, storeId2 });
return (await this.cache.getOrCompute(key, async () => {
const [store1Data, store2Data] = await Promise.all([
this.pool.query(`
SELECT
d.id, d.name,
ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name IS NOT NULL) as brands,
ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories,
COUNT(*) as sku_count
FROM dispensaries d
LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id
WHERE d.id = $1
GROUP BY d.id, d.name
`, [storeId1]),
this.pool.query(`
SELECT
d.id, d.name,
ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name IS NOT NULL) as brands,
ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories,
COUNT(*) as sku_count
FROM dispensaries d
LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id
WHERE d.id = $1
GROUP BY d.id, d.name
`, [storeId2]),
]);
const s1 = store1Data.rows[0];
const s2 = store2Data.rows[0];
const brands1Array: string[] = (s1?.brands || []).filter((b: string | null): b is string => b !== null);
const brands2Array: string[] = (s2?.brands || []).filter((b: string | null): b is string => b !== null);
const brands1 = new Set(brands1Array);
const brands2 = new Set(brands2Array);
const sharedBrands: string[] = brands1Array.filter(b => brands2.has(b));
const uniqueToStore1: string[] = brands1Array.filter(b => !brands2.has(b));
const uniqueToStore2: string[] = brands2Array.filter(b => !brands1.has(b));
// Category comparison
const categoryResult = await this.pool.query(`
WITH store1_cats AS (
SELECT type as category, COUNT(*) as sku_count
FROM dutchie_products WHERE dispensary_id = $1 AND type IS NOT NULL
GROUP BY type
),
store2_cats AS (
SELECT type as category, COUNT(*) as sku_count
FROM dutchie_products WHERE dispensary_id = $2 AND type IS NOT NULL
GROUP BY type
),
all_cats AS (
SELECT category FROM store1_cats
UNION
SELECT category FROM store2_cats
)
SELECT
ac.category,
COALESCE(s1.sku_count, 0) as store1_skus,
COALESCE(s2.sku_count, 0) as store2_skus
FROM all_cats ac
LEFT JOIN store1_cats s1 ON ac.category = s1.category
LEFT JOIN store2_cats s2 ON ac.category = s2.category
ORDER BY (COALESCE(s1.sku_count, 0) + COALESCE(s2.sku_count, 0)) DESC
`, [storeId1, storeId2]);
return {
store1: {
id: s1?.id || storeId1,
name: s1?.name || 'Unknown',
brands: s1?.brands || [],
categories: s1?.categories || [],
skuCount: parseInt(s1?.sku_count) || 0,
},
store2: {
id: s2?.id || storeId2,
name: s2?.name || 'Unknown',
brands: s2?.brands || [],
categories: s2?.categories || [],
skuCount: parseInt(s2?.sku_count) || 0,
},
sharedBrands,
uniqueToStore1,
uniqueToStore2,
categoryComparison: categoryResult.rows.map(row => ({
category: row.category,
store1Skus: parseInt(row.store1_skus) || 0,
store2Skus: parseInt(row.store2_skus) || 0,
difference: (parseInt(row.store1_skus) || 0) - (parseInt(row.store2_skus) || 0),
})),
};
}, 15)).data;
}
/**
* Record a change event (used by crawler/worker)
*/
async recordChangeEvent(event: {
storeId: number;
eventType: string;
brandName?: string;
productId?: number;
productName?: string;
category?: string;
oldValue?: string;
newValue?: string;
metadata?: Record<string, unknown>;
}): Promise<void> {
await this.pool.query(`
INSERT INTO store_change_events
(store_id, event_type, event_date, brand_name, product_id, product_name, category, old_value, new_value, metadata)
VALUES ($1, $2, CURRENT_DATE, $3, $4, $5, $6, $7, $8, $9)
`, [
event.storeId,
event.eventType,
event.brandName || null,
event.productId || null,
event.productName || null,
event.category || null,
event.oldValue || null,
event.newValue || null,
event.metadata ? JSON.stringify(event.metadata) : null,
]);
// Invalidate cache
await this.cache.invalidatePattern(`store_change_summary:storeId=${event.storeId}`);
}
}

View File

@@ -1,266 +0,0 @@
/**
* LEGACY SERVICE - AZDHS Import
*
* DEPRECATED: This service creates its own database pool.
* Future implementations should use the canonical CannaiQ connection.
*
* Imports Arizona dispensaries from the main database's dispensaries table
* (which was populated from AZDHS data) into the isolated Dutchie AZ database.
*
* This establishes the canonical list of AZ dispensaries to match against Dutchie.
*
* DO NOT:
* - Run this in automated jobs
* - Use DATABASE_URL directly
*/
import { Pool } from 'pg';
import { query as dutchieQuery } from '../db/connection';
import { Dispensary } from '../types';
// Single database connection (cannaiq in cannaiq-postgres container)
// Use CANNAIQ_DB_* env vars or defaults
const MAIN_DB_CONNECTION = process.env.CANNAIQ_DB_URL ||
`postgresql://${process.env.CANNAIQ_DB_USER || 'dutchie'}:${process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass'}@${process.env.CANNAIQ_DB_HOST || 'localhost'}:${process.env.CANNAIQ_DB_PORT || '54320'}/${process.env.CANNAIQ_DB_NAME || 'cannaiq'}`;
/**
* AZDHS dispensary record from the main database
*/
interface AZDHSDispensary {
id: number;
azdhs_id: number;
name: string;
company_name?: string;
address?: string;
city: string;
state: string;
zip?: string;
latitude?: number;
longitude?: number;
dba_name?: string;
phone?: string;
email?: string;
website?: string;
google_rating?: string;
google_review_count?: number;
slug: string;
menu_provider?: string;
product_provider?: string;
created_at: Date;
updated_at: Date;
}
/**
* Import result statistics
*/
interface ImportResult {
total: number;
imported: number;
skipped: number;
errors: string[];
}
/**
* Create a temporary connection to the main database
*/
function getMainDBPool(): Pool {
console.warn('[AZDHS Import] LEGACY: Using separate pool. Should use canonical CannaiQ connection.');
return new Pool({
connectionString: MAIN_DB_CONNECTION,
max: 5,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
}
/**
* Fetch all AZ dispensaries from the main database
*/
async function fetchAZDHSDispensaries(): Promise<AZDHSDispensary[]> {
const pool = getMainDBPool();
try {
const result = await pool.query<AZDHSDispensary>(`
SELECT
id, azdhs_id, name, company_name, address, city, state, zip,
latitude, longitude, dba_name, phone, email, website,
google_rating, google_review_count, slug,
menu_provider, product_provider,
created_at, updated_at
FROM dispensaries
WHERE state = 'AZ'
ORDER BY id
`);
return result.rows;
} finally {
await pool.end();
}
}
/**
* Import a single dispensary into the Dutchie AZ database
*/
async function importDispensary(disp: AZDHSDispensary): Promise<number> {
const result = await dutchieQuery<{ id: number }>(
`
INSERT INTO dispensaries (
platform, name, slug, city, state, postal_code, address,
latitude, longitude, is_delivery, is_pickup, raw_metadata, updated_at
) VALUES (
$1, $2, $3, $4, $5, $6, $7,
$8, $9, $10, $11, $12, NOW()
)
ON CONFLICT (platform, slug, city, state) DO UPDATE SET
name = EXCLUDED.name,
postal_code = EXCLUDED.postal_code,
address = EXCLUDED.address,
latitude = EXCLUDED.latitude,
longitude = EXCLUDED.longitude,
raw_metadata = EXCLUDED.raw_metadata,
updated_at = NOW()
RETURNING id
`,
[
'dutchie', // Will be updated when Dutchie match is found
disp.dba_name || disp.name,
disp.slug,
disp.city,
disp.state,
disp.zip,
disp.address,
disp.latitude,
disp.longitude,
false, // is_delivery - unknown
true, // is_pickup - assume true
JSON.stringify({
azdhs_id: disp.azdhs_id,
main_db_id: disp.id,
company_name: disp.company_name,
phone: disp.phone,
email: disp.email,
website: disp.website,
google_rating: disp.google_rating,
google_review_count: disp.google_review_count,
menu_provider: disp.menu_provider,
product_provider: disp.product_provider,
}),
]
);
return result.rows[0].id;
}
/**
* Import all AZDHS dispensaries into the Dutchie AZ database
*/
export async function importAZDHSDispensaries(): Promise<ImportResult> {
console.log('[AZDHS Import] Starting import from main database...');
const result: ImportResult = {
total: 0,
imported: 0,
skipped: 0,
errors: [],
};
try {
const dispensaries = await fetchAZDHSDispensaries();
result.total = dispensaries.length;
console.log(`[AZDHS Import] Found ${dispensaries.length} AZ dispensaries in main DB`);
for (const disp of dispensaries) {
try {
const id = await importDispensary(disp);
result.imported++;
console.log(`[AZDHS Import] Imported: ${disp.name} (${disp.city}) -> id=${id}`);
} catch (error: any) {
if (error.message.includes('duplicate')) {
result.skipped++;
} else {
result.errors.push(`${disp.name}: ${error.message}`);
}
}
}
} catch (error: any) {
result.errors.push(`Failed to fetch from main DB: ${error.message}`);
}
console.log(`[AZDHS Import] Complete: ${result.imported} imported, ${result.skipped} skipped, ${result.errors.length} errors`);
return result;
}
/**
* Import dispensaries from JSON file (backup export)
*/
export async function importFromJSON(jsonPath: string): Promise<ImportResult> {
console.log(`[AZDHS Import] Importing from JSON: ${jsonPath}`);
const result: ImportResult = {
total: 0,
imported: 0,
skipped: 0,
errors: [],
};
try {
const fs = await import('fs/promises');
const data = await fs.readFile(jsonPath, 'utf-8');
const dispensaries: AZDHSDispensary[] = JSON.parse(data);
result.total = dispensaries.length;
console.log(`[AZDHS Import] Found ${dispensaries.length} dispensaries in JSON file`);
for (const disp of dispensaries) {
try {
const id = await importDispensary(disp);
result.imported++;
} catch (error: any) {
if (error.message.includes('duplicate')) {
result.skipped++;
} else {
result.errors.push(`${disp.name}: ${error.message}`);
}
}
}
} catch (error: any) {
result.errors.push(`Failed to read JSON file: ${error.message}`);
}
console.log(`[AZDHS Import] Complete: ${result.imported} imported, ${result.skipped} skipped`);
return result;
}
/**
* Get import statistics
*/
export async function getImportStats(): Promise<{
totalDispensaries: number;
withPlatformIds: number;
withoutPlatformIds: number;
lastImportedAt?: Date;
}> {
const { rows } = await dutchieQuery<{
total: string;
with_platform_id: string;
without_platform_id: string;
last_updated: Date;
}>(`
SELECT
COUNT(*) as total,
COUNT(platform_dispensary_id) as with_platform_id,
COUNT(*) - COUNT(platform_dispensary_id) as without_platform_id,
MAX(updated_at) as last_updated
FROM dispensaries
WHERE state = 'AZ'
`);
const stats = rows[0];
return {
totalDispensaries: parseInt(stats.total, 10),
withPlatformIds: parseInt(stats.with_platform_id, 10),
withoutPlatformIds: parseInt(stats.without_platform_id, 10),
lastImportedAt: stats.last_updated,
};
}

View File

@@ -1,481 +0,0 @@
/**
* Directory-Based Store Matcher
*
* Scrapes provider directory pages (Curaleaf, Sol, etc.) to get store lists,
* then matches them to existing dispensaries by fuzzy name/city/address matching.
*
* This allows us to:
* 1. Find specific store URLs for directory-style websites
* 2. Match stores confidently by name+city
* 3. Mark non-Dutchie providers as not_crawlable until we build crawlers
*/
import { query } from '../db/connection';
// ============================================================
// TYPES
// ============================================================
export interface DirectoryStore {
name: string;
city: string;
state: string;
address: string | null;
storeUrl: string;
}
export interface MatchResult {
directoryStore: DirectoryStore;
dispensaryId: number | null;
dispensaryName: string | null;
confidence: 'high' | 'medium' | 'low' | 'none';
matchReason: string;
}
export interface DirectoryMatchReport {
provider: string;
totalDirectoryStores: number;
highConfidenceMatches: number;
mediumConfidenceMatches: number;
lowConfidenceMatches: number;
unmatched: number;
results: MatchResult[];
}
// ============================================================
// NORMALIZATION FUNCTIONS
// ============================================================
/**
* Normalize a string for comparison:
* - Lowercase
* - Remove common suffixes (dispensary, cannabis, etc.)
* - Remove punctuation
* - Collapse whitespace
*/
function normalizeForComparison(str: string): string {
if (!str) return '';
return str
.toLowerCase()
.replace(/\s+(dispensary|cannabis|marijuana|medical|recreational|shop|store|flower|wellness)(\s|$)/gi, ' ')
.replace(/[^\w\s]/g, ' ') // Remove punctuation
.replace(/\s+/g, ' ') // Collapse whitespace
.trim();
}
/**
* Normalize city name for comparison
*/
function normalizeCity(city: string): string {
if (!city) return '';
return city
.toLowerCase()
.replace(/[^\w\s]/g, '')
.trim();
}
/**
* Calculate similarity between two strings (0-1)
* Uses Levenshtein distance normalized by max length
*/
function stringSimilarity(a: string, b: string): number {
if (!a || !b) return 0;
if (a === b) return 1;
const longer = a.length > b.length ? a : b;
const shorter = a.length > b.length ? b : a;
if (longer.length === 0) return 1;
const distance = levenshteinDistance(longer, shorter);
return (longer.length - distance) / longer.length;
}
/**
* Levenshtein distance between two strings
*/
function levenshteinDistance(a: string, b: string): number {
const matrix: number[][] = [];
for (let i = 0; i <= b.length; i++) {
matrix[i] = [i];
}
for (let j = 0; j <= a.length; j++) {
matrix[0][j] = j;
}
for (let i = 1; i <= b.length; i++) {
for (let j = 1; j <= a.length; j++) {
if (b.charAt(i - 1) === a.charAt(j - 1)) {
matrix[i][j] = matrix[i - 1][j - 1];
} else {
matrix[i][j] = Math.min(
matrix[i - 1][j - 1] + 1, // substitution
matrix[i][j - 1] + 1, // insertion
matrix[i - 1][j] + 1 // deletion
);
}
}
}
return matrix[b.length][a.length];
}
/**
* Check if string contains another (with normalization)
*/
function containsNormalized(haystack: string, needle: string): boolean {
return normalizeForComparison(haystack).includes(normalizeForComparison(needle));
}
// ============================================================
// PROVIDER DIRECTORY SCRAPERS
// ============================================================
/**
* Sol Flower (livewithsol.com) - Static HTML, easy to scrape
*/
export async function scrapeSolDirectory(): Promise<DirectoryStore[]> {
console.log('[DirectoryMatcher] Scraping Sol Flower directory...');
try {
const response = await fetch('https://www.livewithsol.com/locations/', {
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
Accept: 'text/html',
},
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const html = await response.text();
// Extract store entries from HTML
// Sol's structure: Each location has name, address in specific divs
const stores: DirectoryStore[] = [];
// Pattern to find location cards
// Format: <a href="/locations/slug/">NAME</a> with address nearby
const locationRegex =
/<a[^>]+href="(\/locations\/[^"]+)"[^>]*>([^<]+)<\/a>[\s\S]*?(\d+[^<]+(?:Ave|St|Blvd|Dr|Rd|Way)[^<]*)/gi;
let match;
while ((match = locationRegex.exec(html)) !== null) {
const [, path, name, address] = match;
// Extract city from common Arizona cities
let city = 'Unknown';
const cityPatterns = [
{ pattern: /phoenix/i, city: 'Phoenix' },
{ pattern: /scottsdale/i, city: 'Scottsdale' },
{ pattern: /tempe/i, city: 'Tempe' },
{ pattern: /tucson/i, city: 'Tucson' },
{ pattern: /mesa/i, city: 'Mesa' },
{ pattern: /sun city/i, city: 'Sun City' },
{ pattern: /glendale/i, city: 'Glendale' },
];
for (const { pattern, city: cityName } of cityPatterns) {
if (pattern.test(name) || pattern.test(address)) {
city = cityName;
break;
}
}
stores.push({
name: name.trim(),
city,
state: 'AZ',
address: address.trim(),
storeUrl: `https://www.livewithsol.com${path}`,
});
}
// If regex didn't work, use known hardcoded values (fallback)
if (stores.length === 0) {
console.log('[DirectoryMatcher] Using hardcoded Sol locations');
return [
{ name: 'Sol Flower 32nd & Shea', city: 'Phoenix', state: 'AZ', address: '3217 E Shea Blvd Suite 1 A', storeUrl: 'https://www.livewithsol.com/locations/deer-valley/' },
{ name: 'Sol Flower Scottsdale Airpark', city: 'Scottsdale', state: 'AZ', address: '14980 N 78th Way Ste 204', storeUrl: 'https://www.livewithsol.com/locations/scottsdale-airpark/' },
{ name: 'Sol Flower Sun City', city: 'Sun City', state: 'AZ', address: '13650 N 99th Ave', storeUrl: 'https://www.livewithsol.com/locations/sun-city/' },
{ name: 'Sol Flower Tempe McClintock', city: 'Tempe', state: 'AZ', address: '1322 N McClintock Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-mcclintock/' },
{ name: 'Sol Flower Tempe University', city: 'Tempe', state: 'AZ', address: '2424 W University Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-university/' },
{ name: 'Sol Flower Foothills Tucson', city: 'Tucson', state: 'AZ', address: '6026 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/foothills-tucson/' },
{ name: 'Sol Flower South Tucson', city: 'Tucson', state: 'AZ', address: '3000 W Valencia Rd Ste 210', storeUrl: 'https://www.livewithsol.com/locations/south-tucson/' },
{ name: 'Sol Flower North Tucson', city: 'Tucson', state: 'AZ', address: '4837 N 1st Ave', storeUrl: 'https://www.livewithsol.com/locations/north-tucson/' },
{ name: 'Sol Flower Casas Adobes', city: 'Tucson', state: 'AZ', address: '6437 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/casas-adobes/' },
];
}
console.log(`[DirectoryMatcher] Found ${stores.length} Sol Flower locations`);
return stores;
} catch (error: any) {
console.error('[DirectoryMatcher] Error scraping Sol directory:', error.message);
// Return hardcoded fallback
return [
{ name: 'Sol Flower 32nd & Shea', city: 'Phoenix', state: 'AZ', address: '3217 E Shea Blvd Suite 1 A', storeUrl: 'https://www.livewithsol.com/locations/deer-valley/' },
{ name: 'Sol Flower Scottsdale Airpark', city: 'Scottsdale', state: 'AZ', address: '14980 N 78th Way Ste 204', storeUrl: 'https://www.livewithsol.com/locations/scottsdale-airpark/' },
{ name: 'Sol Flower Sun City', city: 'Sun City', state: 'AZ', address: '13650 N 99th Ave', storeUrl: 'https://www.livewithsol.com/locations/sun-city/' },
{ name: 'Sol Flower Tempe McClintock', city: 'Tempe', state: 'AZ', address: '1322 N McClintock Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-mcclintock/' },
{ name: 'Sol Flower Tempe University', city: 'Tempe', state: 'AZ', address: '2424 W University Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-university/' },
{ name: 'Sol Flower Foothills Tucson', city: 'Tucson', state: 'AZ', address: '6026 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/foothills-tucson/' },
{ name: 'Sol Flower South Tucson', city: 'Tucson', state: 'AZ', address: '3000 W Valencia Rd Ste 210', storeUrl: 'https://www.livewithsol.com/locations/south-tucson/' },
{ name: 'Sol Flower North Tucson', city: 'Tucson', state: 'AZ', address: '4837 N 1st Ave', storeUrl: 'https://www.livewithsol.com/locations/north-tucson/' },
{ name: 'Sol Flower Casas Adobes', city: 'Tucson', state: 'AZ', address: '6437 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/casas-adobes/' },
];
}
}
/**
* Curaleaf - Has age-gate, so we need hardcoded AZ locations
* In production, this would use Playwright to bypass age-gate
*/
export async function scrapeCuraleafDirectory(): Promise<DirectoryStore[]> {
console.log('[DirectoryMatcher] Using hardcoded Curaleaf AZ locations (age-gate blocks simple fetch)...');
// Hardcoded Arizona Curaleaf locations from public knowledge
// These would be scraped via Playwright in production
return [
{ name: 'Curaleaf Phoenix Camelback', city: 'Phoenix', state: 'AZ', address: '4811 E Camelback Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-phoenix-camelback' },
{ name: 'Curaleaf Phoenix Midtown', city: 'Phoenix', state: 'AZ', address: '1928 E Highland Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-phoenix-midtown' },
{ name: 'Curaleaf Glendale East', city: 'Glendale', state: 'AZ', address: '5150 W Glendale Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-glendale-east' },
{ name: 'Curaleaf Glendale West', city: 'Glendale', state: 'AZ', address: '6501 W Glendale Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-glendale-west' },
{ name: 'Curaleaf Gilbert', city: 'Gilbert', state: 'AZ', address: '1736 E Williams Field Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-gilbert' },
{ name: 'Curaleaf Mesa', city: 'Mesa', state: 'AZ', address: '1540 S Power Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-mesa' },
{ name: 'Curaleaf Tempe', city: 'Tempe', state: 'AZ', address: '1815 E Broadway Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tempe' },
{ name: 'Curaleaf Scottsdale', city: 'Scottsdale', state: 'AZ', address: '8904 E Indian Bend Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-scottsdale' },
{ name: 'Curaleaf Tucson Prince', city: 'Tucson', state: 'AZ', address: '3955 W Prince Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tucson-prince' },
{ name: 'Curaleaf Tucson Midvale', city: 'Tucson', state: 'AZ', address: '2936 N Midvale Park Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tucson-midvale' },
{ name: 'Curaleaf Sedona', city: 'Sedona', state: 'AZ', address: '525 AZ-179', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-sedona' },
{ name: 'Curaleaf Youngtown', city: 'Youngtown', state: 'AZ', address: '11125 W Grand Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-youngtown' },
];
}
// ============================================================
// MATCHING LOGIC
// ============================================================
interface Dispensary {
id: number;
name: string;
city: string | null;
state: string | null;
address: string | null;
menu_type: string | null;
menu_url: string | null;
website: string | null;
}
/**
* Match a directory store to an existing dispensary
*/
function matchStoreToDispensary(store: DirectoryStore, dispensaries: Dispensary[]): MatchResult {
const normalizedStoreName = normalizeForComparison(store.name);
const normalizedStoreCity = normalizeCity(store.city);
let bestMatch: Dispensary | null = null;
let bestScore = 0;
let matchReason = '';
for (const disp of dispensaries) {
const normalizedDispName = normalizeForComparison(disp.name);
const normalizedDispCity = normalizeCity(disp.city || '');
let score = 0;
const reasons: string[] = [];
// 1. Name similarity (max 50 points)
const nameSimilarity = stringSimilarity(normalizedStoreName, normalizedDispName);
score += nameSimilarity * 50;
if (nameSimilarity > 0.8) reasons.push(`name_match(${(nameSimilarity * 100).toFixed(0)}%)`);
// 2. City match (25 points for exact, 15 for partial)
if (normalizedStoreCity && normalizedDispCity) {
if (normalizedStoreCity === normalizedDispCity) {
score += 25;
reasons.push('city_exact');
} else if (
normalizedStoreCity.includes(normalizedDispCity) ||
normalizedDispCity.includes(normalizedStoreCity)
) {
score += 15;
reasons.push('city_partial');
}
}
// 3. Address contains street name (15 points)
if (store.address && disp.address) {
const storeStreet = store.address.toLowerCase().split(/\s+/).slice(1, 4).join(' ');
const dispStreet = disp.address.toLowerCase().split(/\s+/).slice(1, 4).join(' ');
if (storeStreet && dispStreet && stringSimilarity(storeStreet, dispStreet) > 0.7) {
score += 15;
reasons.push('address_match');
}
}
// 4. Brand name in dispensary name (10 points)
const brandName = store.name.split(' ')[0].toLowerCase(); // e.g., "Curaleaf", "Sol"
if (disp.name.toLowerCase().includes(brandName)) {
score += 10;
reasons.push('brand_match');
}
if (score > bestScore) {
bestScore = score;
bestMatch = disp;
matchReason = reasons.join(', ');
}
}
// Determine confidence level
let confidence: 'high' | 'medium' | 'low' | 'none';
if (bestScore >= 70) {
confidence = 'high';
} else if (bestScore >= 50) {
confidence = 'medium';
} else if (bestScore >= 30) {
confidence = 'low';
} else {
confidence = 'none';
}
return {
directoryStore: store,
dispensaryId: bestMatch?.id || null,
dispensaryName: bestMatch?.name || null,
confidence,
matchReason: matchReason || 'no_match',
};
}
// ============================================================
// MAIN FUNCTIONS
// ============================================================
/**
* Run directory matching for a provider and update database
* Only applies high-confidence matches automatically
*/
export async function matchDirectoryToDispensaries(
provider: 'curaleaf' | 'sol',
dryRun: boolean = true
): Promise<DirectoryMatchReport> {
console.log(`[DirectoryMatcher] Running ${provider} directory matching (dryRun=${dryRun})...`);
// Get directory stores
let directoryStores: DirectoryStore[];
if (provider === 'curaleaf') {
directoryStores = await scrapeCuraleafDirectory();
} else if (provider === 'sol') {
directoryStores = await scrapeSolDirectory();
} else {
throw new Error(`Unknown provider: ${provider}`);
}
// Get all AZ dispensaries from database
const { rows: dispensaries } = await query<Dispensary>(
`SELECT id, name, city, state, address, menu_type, menu_url, website
FROM dispensaries
WHERE state = 'AZ'`
);
console.log(`[DirectoryMatcher] Matching ${directoryStores.length} directory stores against ${dispensaries.length} dispensaries`);
// Match each directory store
const results: MatchResult[] = [];
for (const store of directoryStores) {
const match = matchStoreToDispensary(store, dispensaries);
results.push(match);
// Only apply high-confidence matches if not dry run
if (!dryRun && match.confidence === 'high' && match.dispensaryId) {
await applyDirectoryMatch(match.dispensaryId, provider, store);
}
}
// Count results
const report: DirectoryMatchReport = {
provider,
totalDirectoryStores: directoryStores.length,
highConfidenceMatches: results.filter((r) => r.confidence === 'high').length,
mediumConfidenceMatches: results.filter((r) => r.confidence === 'medium').length,
lowConfidenceMatches: results.filter((r) => r.confidence === 'low').length,
unmatched: results.filter((r) => r.confidence === 'none').length,
results,
};
console.log(`[DirectoryMatcher] ${provider} matching complete:`);
console.log(` - High confidence: ${report.highConfidenceMatches}`);
console.log(` - Medium confidence: ${report.mediumConfidenceMatches}`);
console.log(` - Low confidence: ${report.lowConfidenceMatches}`);
console.log(` - Unmatched: ${report.unmatched}`);
return report;
}
/**
* Apply a directory match to a dispensary
*/
async function applyDirectoryMatch(
dispensaryId: number,
provider: string,
store: DirectoryStore
): Promise<void> {
console.log(`[DirectoryMatcher] Applying match: dispensary ${dispensaryId} -> ${store.storeUrl}`);
await query(
`
UPDATE dispensaries SET
menu_type = $1,
menu_url = $2,
platform_dispensary_id = NULL,
provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) ||
jsonb_build_object(
'detected_provider', $1::text,
'detection_method', 'directory_match'::text,
'detected_at', NOW(),
'directory_store_name', $3::text,
'directory_store_url', $2::text,
'directory_store_city', $4::text,
'directory_store_address', $5::text,
'not_crawlable', true,
'not_crawlable_reason', $6::text
),
updated_at = NOW()
WHERE id = $7
`,
[
provider,
store.storeUrl,
store.name,
store.city,
store.address,
`${provider} proprietary menu - no crawler available`,
dispensaryId,
]
);
}
/**
* Preview matches without applying them
*/
export async function previewDirectoryMatches(
provider: 'curaleaf' | 'sol'
): Promise<DirectoryMatchReport> {
return matchDirectoryToDispensaries(provider, true);
}
/**
* Apply high-confidence matches
*/
export async function applyHighConfidenceMatches(
provider: 'curaleaf' | 'sol'
): Promise<DirectoryMatchReport> {
return matchDirectoryToDispensaries(provider, false);
}

View File

@@ -1,592 +0,0 @@
/**
* Dutchie AZ Discovery Service
*
* Discovers and manages dispensaries from Dutchie for Arizona.
*/
import { query, getClient } from '../db/connection';
import { discoverArizonaDispensaries, resolveDispensaryId, resolveDispensaryIdWithDetails, ResolveDispensaryResult } from './graphql-client';
import { Dispensary } from '../types';
/**
* Upsert a dispensary record
*/
async function upsertDispensary(dispensary: Partial<Dispensary>): Promise<number> {
const result = await query<{ id: number }>(
`
INSERT INTO dispensaries (
platform, name, slug, city, state, postal_code, address,
latitude, longitude, platform_dispensary_id,
is_delivery, is_pickup, raw_metadata, updated_at
) VALUES (
$1, $2, $3, $4, $5, $6, $7,
$8, $9, $10,
$11, $12, $13, NOW()
)
ON CONFLICT (platform, slug, city, state) DO UPDATE SET
name = EXCLUDED.name,
postal_code = EXCLUDED.postal_code,
address = EXCLUDED.address,
latitude = EXCLUDED.latitude,
longitude = EXCLUDED.longitude,
platform_dispensary_id = COALESCE(EXCLUDED.platform_dispensary_id, dispensaries.platform_dispensary_id),
is_delivery = EXCLUDED.is_delivery,
is_pickup = EXCLUDED.is_pickup,
raw_metadata = EXCLUDED.raw_metadata,
updated_at = NOW()
RETURNING id
`,
[
dispensary.platform || 'dutchie',
dispensary.name,
dispensary.slug,
dispensary.city,
dispensary.state || 'AZ',
dispensary.postalCode,
dispensary.address,
dispensary.latitude,
dispensary.longitude,
dispensary.platformDispensaryId,
dispensary.isDelivery || false,
dispensary.isPickup || true,
dispensary.rawMetadata ? JSON.stringify(dispensary.rawMetadata) : null,
]
);
return result.rows[0].id;
}
/**
* Normalize a raw discovery result to Dispensary
*/
function normalizeDispensary(raw: any): Partial<Dispensary> {
return {
platform: 'dutchie',
name: raw.name || raw.Name || '',
slug: raw.slug || raw.cName || raw.id || '',
city: raw.city || raw.address?.city || '',
state: 'AZ',
postalCode: raw.postalCode || raw.address?.postalCode || raw.address?.zip,
address: raw.streetAddress || raw.address?.streetAddress,
latitude: raw.latitude || raw.location?.lat,
longitude: raw.longitude || raw.location?.lng,
platformDispensaryId: raw.dispensaryId || raw.id || null,
isDelivery: raw.isDelivery || raw.delivery || false,
isPickup: raw.isPickup || raw.pickup || true,
rawMetadata: raw,
};
}
/**
* Import dispensaries from the existing dispensaries table (from AZDHS data)
* This creates records in the dutchie_az database for AZ dispensaries
*/
export async function importFromExistingDispensaries(): Promise<{ imported: number }> {
console.log('[Discovery] Importing from existing dispensaries table...');
// This is a workaround - we'll use the dispensaries we already know about
// and try to resolve their Dutchie IDs
const knownDispensaries = [
{ name: 'Deeply Rooted', slug: 'AZ-Deeply-Rooted', city: 'Phoenix', state: 'AZ' },
{ name: 'Curaleaf Gilbert', slug: 'curaleaf-gilbert', city: 'Gilbert', state: 'AZ' },
{ name: 'Zen Leaf Prescott', slug: 'AZ-zen-leaf-prescott', city: 'Prescott', state: 'AZ' },
// Add more known Dutchie stores here
];
let imported = 0;
for (const disp of knownDispensaries) {
try {
const id = await upsertDispensary({
platform: 'dutchie',
name: disp.name,
slug: disp.slug,
city: disp.city,
state: disp.state,
});
imported++;
console.log(`[Discovery] Imported: ${disp.name} (id=${id})`);
} catch (error: any) {
console.error(`[Discovery] Failed to import ${disp.name}:`, error.message);
}
}
return { imported };
}
/**
* Discover all Arizona Dutchie dispensaries via GraphQL
*/
export async function discoverDispensaries(): Promise<{ discovered: number; errors: string[] }> {
console.log('[Discovery] Starting Arizona dispensary discovery...');
const errors: string[] = [];
let discovered = 0;
try {
const rawDispensaries = await discoverArizonaDispensaries();
console.log(`[Discovery] Found ${rawDispensaries.length} dispensaries from GraphQL`);
for (const raw of rawDispensaries) {
try {
const normalized = normalizeDispensary(raw);
if (normalized.name && normalized.slug && normalized.city) {
await upsertDispensary(normalized);
discovered++;
}
} catch (error: any) {
errors.push(`${raw.name || raw.slug}: ${error.message}`);
}
}
} catch (error: any) {
errors.push(`Discovery failed: ${error.message}`);
}
console.log(`[Discovery] Completed: ${discovered} dispensaries, ${errors.length} errors`);
return { discovered, errors };
}
/**
* Check if a string looks like a MongoDB ObjectId (24 hex chars)
*/
export function isObjectId(value: string): boolean {
return /^[a-f0-9]{24}$/i.test(value);
}
/**
* Extract cName (slug) or platform_dispensary_id from a Dutchie menu_url
*
* Supports formats:
* - https://dutchie.com/embedded-menu/<cName> -> returns { type: 'cName', value: '<cName>' }
* - https://dutchie.com/dispensary/<cName> -> returns { type: 'cName', value: '<cName>' }
* - https://dutchie.com/api/v2/embedded-menu/<id>.js -> returns { type: 'platformId', value: '<id>' }
*
* For backward compatibility, extractCNameFromMenuUrl still returns just the string value.
*/
export interface MenuUrlExtraction {
type: 'cName' | 'platformId';
value: string;
}
export function extractFromMenuUrl(menuUrl: string | null | undefined): MenuUrlExtraction | null {
if (!menuUrl) return null;
try {
const url = new URL(menuUrl);
const pathname = url.pathname;
// Match /api/v2/embedded-menu/<id>.js - this contains the platform_dispensary_id directly
const apiMatch = pathname.match(/^\/api\/v2\/embedded-menu\/([a-f0-9]{24})\.js$/i);
if (apiMatch) {
return { type: 'platformId', value: apiMatch[1] };
}
// Match /embedded-menu/<cName> or /dispensary/<cName>
const embeddedMatch = pathname.match(/^\/embedded-menu\/([^/?]+)/);
if (embeddedMatch) {
const value = embeddedMatch[1];
// Check if it's actually an ObjectId (some URLs use ID directly)
if (isObjectId(value)) {
return { type: 'platformId', value };
}
return { type: 'cName', value };
}
const dispensaryMatch = pathname.match(/^\/dispensary\/([^/?]+)/);
if (dispensaryMatch) {
const value = dispensaryMatch[1];
if (isObjectId(value)) {
return { type: 'platformId', value };
}
return { type: 'cName', value };
}
return null;
} catch {
return null;
}
}
/**
* Extract cName (slug) from a Dutchie menu_url
* Backward compatible - use extractFromMenuUrl for full info
*/
export function extractCNameFromMenuUrl(menuUrl: string | null | undefined): string | null {
const extraction = extractFromMenuUrl(menuUrl);
return extraction?.value || null;
}
/**
* Resolve platform dispensary IDs for all dispensaries that don't have one
* CRITICAL: Uses cName extracted from menu_url, NOT the slug column!
*
* Uses the new resolveDispensaryIdWithDetails which:
* 1. Extracts dispensaryId from window.reactEnv in the embedded menu page (preferred)
* 2. Falls back to GraphQL if reactEnv extraction fails
* 3. Returns HTTP status so we can mark 403/404 stores as not_crawlable
*/
export async function resolvePlatformDispensaryIds(): Promise<{ resolved: number; failed: number; skipped: number; notCrawlable: number }> {
console.log('[Discovery] Resolving platform dispensary IDs...');
const { rows: dispensaries } = await query<any>(
`
SELECT id, name, slug, menu_url, menu_type, platform_dispensary_id, crawl_status
FROM dispensaries
WHERE menu_type = 'dutchie'
AND platform_dispensary_id IS NULL
AND menu_url IS NOT NULL
AND (crawl_status IS NULL OR crawl_status != 'not_crawlable')
ORDER BY id
`
);
let resolved = 0;
let failed = 0;
let skipped = 0;
let notCrawlable = 0;
for (const dispensary of dispensaries) {
try {
// Extract cName from menu_url - this is the CORRECT way to get the Dutchie slug
const cName = extractCNameFromMenuUrl(dispensary.menu_url);
if (!cName) {
console.log(`[Discovery] Skipping ${dispensary.name}: Could not extract cName from menu_url: ${dispensary.menu_url}`);
skipped++;
continue;
}
console.log(`[Discovery] Resolving ID for: ${dispensary.name} (cName=${cName}, menu_url=${dispensary.menu_url})`);
// Use the new detailed resolver that extracts from reactEnv first
const result = await resolveDispensaryIdWithDetails(cName);
if (result.dispensaryId) {
// SUCCESS: Store resolved
await query(
`
UPDATE dispensaries
SET platform_dispensary_id = $1,
platform_dispensary_id_resolved_at = NOW(),
crawl_status = 'ready',
crawl_status_reason = $2,
crawl_status_updated_at = NOW(),
last_tested_menu_url = $3,
last_http_status = $4,
updated_at = NOW()
WHERE id = $5
`,
[
result.dispensaryId,
`Resolved from ${result.source || 'page'}`,
dispensary.menu_url,
result.httpStatus,
dispensary.id,
]
);
resolved++;
console.log(`[Discovery] Resolved: ${cName} -> ${result.dispensaryId} (source: ${result.source})`);
} else if (result.httpStatus === 403 || result.httpStatus === 404) {
// NOT CRAWLABLE: Store removed or not accessible
await query(
`
UPDATE dispensaries
SET platform_dispensary_id = NULL,
crawl_status = 'not_crawlable',
crawl_status_reason = $1,
crawl_status_updated_at = NOW(),
last_tested_menu_url = $2,
last_http_status = $3,
updated_at = NOW()
WHERE id = $4
`,
[
result.error || `HTTP ${result.httpStatus}: Removed from Dutchie`,
dispensary.menu_url,
result.httpStatus,
dispensary.id,
]
);
notCrawlable++;
console.log(`[Discovery] Marked not crawlable: ${cName} (HTTP ${result.httpStatus})`);
} else {
// FAILED: Could not resolve but page loaded
await query(
`
UPDATE dispensaries
SET crawl_status = 'not_ready',
crawl_status_reason = $1,
crawl_status_updated_at = NOW(),
last_tested_menu_url = $2,
last_http_status = $3,
updated_at = NOW()
WHERE id = $4
`,
[
result.error || 'Could not extract dispensaryId from page',
dispensary.menu_url,
result.httpStatus,
dispensary.id,
]
);
failed++;
console.log(`[Discovery] Could not resolve: ${cName} - ${result.error}`);
}
// Delay between requests
await new Promise((r) => setTimeout(r, 2000));
} catch (error: any) {
failed++;
console.error(`[Discovery] Error resolving ${dispensary.name}:`, error.message);
}
}
console.log(`[Discovery] Completed: ${resolved} resolved, ${failed} failed, ${skipped} skipped, ${notCrawlable} not crawlable`);
return { resolved, failed, skipped, notCrawlable };
}
// Use shared dispensary columns (handles optional columns like provider_detection_data)
import { DISPENSARY_COLUMNS } from '../db/dispensary-columns';
/**
* Get all dispensaries
*/
export async function getAllDispensaries(): Promise<Dispensary[]> {
const { rows } = await query(
`SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE menu_type = 'dutchie' ORDER BY name`
);
return rows.map(mapDbRowToDispensary);
}
/**
* Map snake_case DB row to camelCase Dispensary object
* CRITICAL: DB returns snake_case (platform_dispensary_id) but TypeScript expects camelCase (platformDispensaryId)
* This function is exported for use in other modules that query dispensaries directly.
*
* NOTE: The consolidated dispensaries table column mappings:
* - zip → postalCode
* - menu_type → menuType (keep platform as 'dutchie')
* - last_crawl_at → lastCrawledAt
* - platform_dispensary_id → platformDispensaryId
*/
export function mapDbRowToDispensary(row: any): Dispensary {
// Extract website from raw_metadata if available (field may not exist in all environments)
let rawMetadata = undefined;
if (row.raw_metadata !== undefined) {
rawMetadata = typeof row.raw_metadata === 'string'
? JSON.parse(row.raw_metadata)
: row.raw_metadata;
}
const website = row.website || rawMetadata?.website || undefined;
return {
id: row.id,
platform: row.platform || 'dutchie', // keep platform as-is, default to 'dutchie'
name: row.name,
dbaName: row.dbaName || row.dba_name || undefined, // dba_name column is optional
slug: row.slug,
city: row.city,
state: row.state,
postalCode: row.postalCode || row.zip || row.postal_code,
latitude: row.latitude ? parseFloat(row.latitude) : undefined,
longitude: row.longitude ? parseFloat(row.longitude) : undefined,
address: row.address,
platformDispensaryId: row.platformDispensaryId || row.platform_dispensary_id, // CRITICAL mapping!
isDelivery: row.is_delivery,
isPickup: row.is_pickup,
rawMetadata: rawMetadata,
lastCrawledAt: row.lastCrawledAt || row.last_crawl_at, // use last_crawl_at
productCount: row.product_count,
createdAt: row.created_at,
updatedAt: row.updated_at,
menuType: row.menuType || row.menu_type,
menuUrl: row.menuUrl || row.menu_url,
scrapeEnabled: row.scrapeEnabled ?? row.scrape_enabled,
providerDetectionData: row.provider_detection_data,
platformDispensaryIdResolvedAt: row.platform_dispensary_id_resolved_at,
website,
};
}
/**
* Get dispensary by ID
* NOTE: Uses SQL aliases to map snake_case → camelCase directly
*/
export async function getDispensaryById(id: number): Promise<Dispensary | null> {
const { rows } = await query(
`
SELECT
id,
name,
slug,
city,
state,
zip AS "postalCode",
address,
latitude,
longitude,
menu_type AS "menuType",
menu_url AS "menuUrl",
platform_dispensary_id AS "platformDispensaryId",
website,
provider_detection_data AS "providerDetectionData",
created_at,
updated_at
FROM dispensaries
WHERE id = $1
`,
[id]
);
if (!rows[0]) return null;
return mapDbRowToDispensary(rows[0]);
}
/**
* Get dispensaries with platform IDs (ready for crawling)
*/
export async function getDispensariesWithPlatformIds(): Promise<Dispensary[]> {
const { rows } = await query(
`
SELECT ${DISPENSARY_COLUMNS} FROM dispensaries
WHERE menu_type = 'dutchie' AND platform_dispensary_id IS NOT NULL
ORDER BY name
`
);
return rows.map(mapDbRowToDispensary);
}
/**
* Re-resolve a single dispensary's platform ID
* Clears the existing ID and re-resolves from the menu_url cName
*/
export async function reResolveDispensaryPlatformId(dispensaryId: number): Promise<{
success: boolean;
platformId: string | null;
cName: string | null;
error?: string;
}> {
console.log(`[Discovery] Re-resolving platform ID for dispensary ${dispensaryId}...`);
const dispensary = await getDispensaryById(dispensaryId);
if (!dispensary) {
return { success: false, platformId: null, cName: null, error: 'Dispensary not found' };
}
const cName = extractCNameFromMenuUrl(dispensary.menuUrl);
if (!cName) {
console.log(`[Discovery] Could not extract cName from menu_url: ${dispensary.menuUrl}`);
return {
success: false,
platformId: null,
cName: null,
error: `Could not extract cName from menu_url: ${dispensary.menuUrl}`,
};
}
console.log(`[Discovery] Extracted cName: ${cName} from menu_url: ${dispensary.menuUrl}`);
try {
const platformId = await resolveDispensaryId(cName);
if (platformId) {
await query(
`
UPDATE dispensaries
SET platform_dispensary_id = $1,
platform_dispensary_id_resolved_at = NOW(),
updated_at = NOW()
WHERE id = $2
`,
[platformId, dispensaryId]
);
console.log(`[Discovery] Resolved: ${cName} -> ${platformId}`);
return { success: true, platformId, cName };
} else {
// Clear the invalid platform ID and mark as not crawlable
await query(
`
UPDATE dispensaries
SET platform_dispensary_id = NULL,
provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) ||
'{"resolution_error": "cName no longer exists on Dutchie", "not_crawlable": true}'::jsonb,
updated_at = NOW()
WHERE id = $1
`,
[dispensaryId]
);
console.log(`[Discovery] Could not resolve: ${cName} - marked as not crawlable`);
return {
success: false,
platformId: null,
cName,
error: `cName "${cName}" no longer exists on Dutchie`,
};
}
} catch (error: any) {
console.error(`[Discovery] Error resolving ${cName}:`, error.message);
return { success: false, platformId: null, cName, error: error.message };
}
}
/**
* Update menu_url for a dispensary and re-resolve platform ID
*/
export async function updateMenuUrlAndResolve(dispensaryId: number, newMenuUrl: string): Promise<{
success: boolean;
platformId: string | null;
cName: string | null;
error?: string;
}> {
console.log(`[Discovery] Updating menu_url for dispensary ${dispensaryId} to: ${newMenuUrl}`);
const cName = extractCNameFromMenuUrl(newMenuUrl);
if (!cName) {
return {
success: false,
platformId: null,
cName: null,
error: `Could not extract cName from new menu_url: ${newMenuUrl}`,
};
}
// Update the menu_url first
await query(
`
UPDATE dispensaries
SET menu_url = $1,
menu_type = 'dutchie',
platform_dispensary_id = NULL,
updated_at = NOW()
WHERE id = $2
`,
[newMenuUrl, dispensaryId]
);
// Now resolve the platform ID with the new cName
return await reResolveDispensaryPlatformId(dispensaryId);
}
/**
* Mark a dispensary as not crawlable (when resolution fails permanently)
*/
export async function markDispensaryNotCrawlable(dispensaryId: number, reason: string): Promise<void> {
await query(
`
UPDATE dispensaries
SET platform_dispensary_id = NULL,
provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) ||
jsonb_build_object('not_crawlable', true, 'not_crawlable_reason', $1::text, 'not_crawlable_at', NOW()::text),
updated_at = NOW()
WHERE id = $2
`,
[reason, dispensaryId]
);
console.log(`[Discovery] Marked dispensary ${dispensaryId} as not crawlable: ${reason}`);
}
/**
* Get the cName for a dispensary (extracted from menu_url)
*/
export function getDispensaryCName(dispensary: Dispensary): string | null {
return extractCNameFromMenuUrl(dispensary.menuUrl);
}

View File

@@ -1,491 +0,0 @@
/**
* Error Taxonomy Module
*
* Standardized error codes and classification for crawler reliability.
* All crawl results must use these codes for consistent error handling.
*
* Phase 1: Crawler Reliability & Stabilization
*/
// ============================================================
// ERROR CODES
// ============================================================
/**
* Standardized error codes for all crawl operations.
* These codes are stored in the database for analytics and debugging.
*/
export const CrawlErrorCode = {
// Success states
SUCCESS: 'SUCCESS',
// Rate limiting
RATE_LIMITED: 'RATE_LIMITED', // 429 responses
// Proxy issues
BLOCKED_PROXY: 'BLOCKED_PROXY', // 407 or proxy-related blocks
PROXY_TIMEOUT: 'PROXY_TIMEOUT', // Proxy connection timeout
// Content issues
HTML_CHANGED: 'HTML_CHANGED', // Page structure changed
NO_PRODUCTS: 'NO_PRODUCTS', // Empty response (valid but no data)
PARSE_ERROR: 'PARSE_ERROR', // Failed to parse response
// Network issues
TIMEOUT: 'TIMEOUT', // Request timeout
NETWORK_ERROR: 'NETWORK_ERROR', // Connection failed
DNS_ERROR: 'DNS_ERROR', // DNS resolution failed
// Authentication
AUTH_FAILED: 'AUTH_FAILED', // Authentication/session issues
// Server errors
SERVER_ERROR: 'SERVER_ERROR', // 5xx responses
SERVICE_UNAVAILABLE: 'SERVICE_UNAVAILABLE', // 503
// Configuration issues
INVALID_CONFIG: 'INVALID_CONFIG', // Bad store configuration
MISSING_PLATFORM_ID: 'MISSING_PLATFORM_ID', // No platform_dispensary_id
// Unknown
UNKNOWN_ERROR: 'UNKNOWN_ERROR', // Catch-all for unclassified errors
} as const;
export type CrawlErrorCodeType = typeof CrawlErrorCode[keyof typeof CrawlErrorCode];
// ============================================================
// ERROR CLASSIFICATION
// ============================================================
/**
* Error metadata for each error code
*/
interface ErrorMetadata {
code: CrawlErrorCodeType;
retryable: boolean;
rotateProxy: boolean;
rotateUserAgent: boolean;
backoffMultiplier: number;
severity: 'low' | 'medium' | 'high' | 'critical';
description: string;
}
/**
* Metadata for each error code - defines retry behavior
*/
export const ERROR_METADATA: Record<CrawlErrorCodeType, ErrorMetadata> = {
[CrawlErrorCode.SUCCESS]: {
code: CrawlErrorCode.SUCCESS,
retryable: false,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 0,
severity: 'low',
description: 'Crawl completed successfully',
},
[CrawlErrorCode.RATE_LIMITED]: {
code: CrawlErrorCode.RATE_LIMITED,
retryable: true,
rotateProxy: true,
rotateUserAgent: true,
backoffMultiplier: 2.0,
severity: 'medium',
description: 'Rate limited by target (429)',
},
[CrawlErrorCode.BLOCKED_PROXY]: {
code: CrawlErrorCode.BLOCKED_PROXY,
retryable: true,
rotateProxy: true,
rotateUserAgent: true,
backoffMultiplier: 1.5,
severity: 'medium',
description: 'Proxy blocked or rejected (407)',
},
[CrawlErrorCode.PROXY_TIMEOUT]: {
code: CrawlErrorCode.PROXY_TIMEOUT,
retryable: true,
rotateProxy: true,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'low',
description: 'Proxy connection timed out',
},
[CrawlErrorCode.HTML_CHANGED]: {
code: CrawlErrorCode.HTML_CHANGED,
retryable: false,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'high',
description: 'Page structure changed - needs selector update',
},
[CrawlErrorCode.NO_PRODUCTS]: {
code: CrawlErrorCode.NO_PRODUCTS,
retryable: true,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'low',
description: 'No products returned (may be temporary)',
},
[CrawlErrorCode.PARSE_ERROR]: {
code: CrawlErrorCode.PARSE_ERROR,
retryable: true,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'medium',
description: 'Failed to parse response data',
},
[CrawlErrorCode.TIMEOUT]: {
code: CrawlErrorCode.TIMEOUT,
retryable: true,
rotateProxy: true,
rotateUserAgent: false,
backoffMultiplier: 1.5,
severity: 'medium',
description: 'Request timed out',
},
[CrawlErrorCode.NETWORK_ERROR]: {
code: CrawlErrorCode.NETWORK_ERROR,
retryable: true,
rotateProxy: true,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'medium',
description: 'Network connection failed',
},
[CrawlErrorCode.DNS_ERROR]: {
code: CrawlErrorCode.DNS_ERROR,
retryable: true,
rotateProxy: true,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'medium',
description: 'DNS resolution failed',
},
[CrawlErrorCode.AUTH_FAILED]: {
code: CrawlErrorCode.AUTH_FAILED,
retryable: true,
rotateProxy: false,
rotateUserAgent: true,
backoffMultiplier: 2.0,
severity: 'high',
description: 'Authentication or session failed',
},
[CrawlErrorCode.SERVER_ERROR]: {
code: CrawlErrorCode.SERVER_ERROR,
retryable: true,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 1.5,
severity: 'medium',
description: 'Server error (5xx)',
},
[CrawlErrorCode.SERVICE_UNAVAILABLE]: {
code: CrawlErrorCode.SERVICE_UNAVAILABLE,
retryable: true,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 2.0,
severity: 'high',
description: 'Service temporarily unavailable (503)',
},
[CrawlErrorCode.INVALID_CONFIG]: {
code: CrawlErrorCode.INVALID_CONFIG,
retryable: false,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 0,
severity: 'critical',
description: 'Invalid store configuration',
},
[CrawlErrorCode.MISSING_PLATFORM_ID]: {
code: CrawlErrorCode.MISSING_PLATFORM_ID,
retryable: false,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 0,
severity: 'critical',
description: 'Missing platform_dispensary_id',
},
[CrawlErrorCode.UNKNOWN_ERROR]: {
code: CrawlErrorCode.UNKNOWN_ERROR,
retryable: true,
rotateProxy: false,
rotateUserAgent: false,
backoffMultiplier: 1.0,
severity: 'high',
description: 'Unknown/unclassified error',
},
};
// ============================================================
// ERROR CLASSIFICATION FUNCTIONS
// ============================================================
/**
* Classify an error into a standardized error code.
*
* @param error - The error to classify (Error object, string, or HTTP status)
* @param httpStatus - Optional HTTP status code
* @returns Standardized error code
*/
export function classifyError(
error: Error | string | null,
httpStatus?: number
): CrawlErrorCodeType {
// Check HTTP status first
if (httpStatus) {
if (httpStatus === 429) return CrawlErrorCode.RATE_LIMITED;
if (httpStatus === 407) return CrawlErrorCode.BLOCKED_PROXY;
if (httpStatus === 401 || httpStatus === 403) return CrawlErrorCode.AUTH_FAILED;
if (httpStatus === 503) return CrawlErrorCode.SERVICE_UNAVAILABLE;
if (httpStatus >= 500) return CrawlErrorCode.SERVER_ERROR;
}
if (!error) return CrawlErrorCode.UNKNOWN_ERROR;
const message = typeof error === 'string' ? error.toLowerCase() : error.message.toLowerCase();
// Rate limiting patterns
if (message.includes('rate limit') || message.includes('too many requests') || message.includes('429')) {
return CrawlErrorCode.RATE_LIMITED;
}
// Proxy patterns
if (message.includes('proxy') && (message.includes('block') || message.includes('reject') || message.includes('407'))) {
return CrawlErrorCode.BLOCKED_PROXY;
}
// Timeout patterns
if (message.includes('timeout') || message.includes('timed out') || message.includes('etimedout')) {
if (message.includes('proxy')) {
return CrawlErrorCode.PROXY_TIMEOUT;
}
return CrawlErrorCode.TIMEOUT;
}
// Network patterns
if (message.includes('econnrefused') || message.includes('econnreset') || message.includes('network')) {
return CrawlErrorCode.NETWORK_ERROR;
}
// DNS patterns
if (message.includes('enotfound') || message.includes('dns') || message.includes('getaddrinfo')) {
return CrawlErrorCode.DNS_ERROR;
}
// Auth patterns
if (message.includes('auth') || message.includes('unauthorized') || message.includes('forbidden') || message.includes('401') || message.includes('403')) {
return CrawlErrorCode.AUTH_FAILED;
}
// HTML change patterns
if (message.includes('selector') || message.includes('element not found') || message.includes('structure changed')) {
return CrawlErrorCode.HTML_CHANGED;
}
// Parse patterns
if (message.includes('parse') || message.includes('json') || message.includes('syntax')) {
return CrawlErrorCode.PARSE_ERROR;
}
// No products patterns
if (message.includes('no products') || message.includes('empty') || message.includes('0 products')) {
return CrawlErrorCode.NO_PRODUCTS;
}
// Server error patterns
if (message.includes('500') || message.includes('502') || message.includes('503') || message.includes('504')) {
return CrawlErrorCode.SERVER_ERROR;
}
// Config patterns
if (message.includes('config') || message.includes('invalid') || message.includes('missing')) {
if (message.includes('platform') || message.includes('dispensary_id')) {
return CrawlErrorCode.MISSING_PLATFORM_ID;
}
return CrawlErrorCode.INVALID_CONFIG;
}
return CrawlErrorCode.UNKNOWN_ERROR;
}
/**
* Get metadata for an error code
*/
export function getErrorMetadata(code: CrawlErrorCodeType): ErrorMetadata {
return ERROR_METADATA[code] || ERROR_METADATA[CrawlErrorCode.UNKNOWN_ERROR];
}
/**
* Check if an error is retryable
*/
export function isRetryable(code: CrawlErrorCodeType): boolean {
return getErrorMetadata(code).retryable;
}
/**
* Check if proxy should be rotated for this error
*/
export function shouldRotateProxy(code: CrawlErrorCodeType): boolean {
return getErrorMetadata(code).rotateProxy;
}
/**
* Check if user agent should be rotated for this error
*/
export function shouldRotateUserAgent(code: CrawlErrorCodeType): boolean {
return getErrorMetadata(code).rotateUserAgent;
}
/**
* Get backoff multiplier for this error
*/
export function getBackoffMultiplier(code: CrawlErrorCodeType): number {
return getErrorMetadata(code).backoffMultiplier;
}
// ============================================================
// CRAWL RESULT TYPE
// ============================================================
/**
* Standardized crawl result with error taxonomy
*/
export interface CrawlResult {
success: boolean;
dispensaryId: number;
// Error info
errorCode: CrawlErrorCodeType;
errorMessage?: string;
httpStatus?: number;
// Timing
startedAt: Date;
finishedAt: Date;
durationMs: number;
// Context
attemptNumber: number;
proxyUsed?: string;
userAgentUsed?: string;
// Metrics (on success)
productsFound?: number;
productsUpserted?: number;
snapshotsCreated?: number;
imagesDownloaded?: number;
// Metadata
metadata?: Record<string, any>;
}
/**
* Create a success result
*/
export function createSuccessResult(
dispensaryId: number,
startedAt: Date,
metrics: {
productsFound: number;
productsUpserted: number;
snapshotsCreated: number;
imagesDownloaded?: number;
},
context?: {
attemptNumber?: number;
proxyUsed?: string;
userAgentUsed?: string;
}
): CrawlResult {
const finishedAt = new Date();
return {
success: true,
dispensaryId,
errorCode: CrawlErrorCode.SUCCESS,
startedAt,
finishedAt,
durationMs: finishedAt.getTime() - startedAt.getTime(),
attemptNumber: context?.attemptNumber || 1,
proxyUsed: context?.proxyUsed,
userAgentUsed: context?.userAgentUsed,
...metrics,
};
}
/**
* Create a failure result
*/
export function createFailureResult(
dispensaryId: number,
startedAt: Date,
error: Error | string,
httpStatus?: number,
context?: {
attemptNumber?: number;
proxyUsed?: string;
userAgentUsed?: string;
}
): CrawlResult {
const finishedAt = new Date();
const errorCode = classifyError(error, httpStatus);
const errorMessage = typeof error === 'string' ? error : error.message;
return {
success: false,
dispensaryId,
errorCode,
errorMessage,
httpStatus,
startedAt,
finishedAt,
durationMs: finishedAt.getTime() - startedAt.getTime(),
attemptNumber: context?.attemptNumber || 1,
proxyUsed: context?.proxyUsed,
userAgentUsed: context?.userAgentUsed,
};
}
// ============================================================
// LOGGING HELPERS
// ============================================================
/**
* Format error code for logging
*/
export function formatErrorForLog(result: CrawlResult): string {
const metadata = getErrorMetadata(result.errorCode);
const retryInfo = metadata.retryable ? '(retryable)' : '(non-retryable)';
const proxyInfo = result.proxyUsed ? ` via ${result.proxyUsed}` : '';
if (result.success) {
return `[${result.errorCode}] Crawl successful: ${result.productsFound} products${proxyInfo}`;
}
return `[${result.errorCode}] ${result.errorMessage}${proxyInfo} ${retryInfo}`;
}
/**
* Get user-friendly error description
*/
export function getErrorDescription(code: CrawlErrorCodeType): string {
return getErrorMetadata(code).description;
}

View File

@@ -1,712 +0,0 @@
/**
* Dutchie GraphQL Client
*
* Uses Puppeteer to establish a session (get CF cookies), then makes
* SERVER-SIDE fetch calls to api-gw.dutchie.com with those cookies.
*
* DUTCHIE FETCH RULES:
* 1. Server-side only - use axios (never browser fetch with CORS)
* 2. Use dispensaryFilter.cNameOrID, NOT dispensaryId directly
* 3. Headers must mimic Chrome: User-Agent, Origin, Referer
* 4. If 403, extract CF cookies from Puppeteer session and include them
* 5. Log status codes, error bodies, and product counts
*/
import axios, { AxiosError } from 'axios';
import puppeteer from 'puppeteer-extra';
import type { Browser, Page, Protocol } from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import {
DutchieRawProduct,
DutchiePOSChild,
CrawlMode,
} from '../types';
import { dutchieConfig, GRAPHQL_HASHES, ARIZONA_CENTERPOINTS } from '../config/dutchie';
puppeteer.use(StealthPlugin());
// Re-export for backward compatibility
export { GRAPHQL_HASHES, ARIZONA_CENTERPOINTS };
// ============================================================
// SESSION MANAGEMENT - Get CF cookies via Puppeteer
// ============================================================
interface SessionCredentials {
cookies: string; // Cookie header string
userAgent: string;
browser: Browser;
page: Page; // Keep page reference for extracting dispensaryId
dispensaryId?: string; // Extracted from window.reactEnv if available
httpStatus?: number; // HTTP status code from navigation
}
/**
* Create a session by navigating to the embedded menu page
* and extracting CF clearance cookies for server-side requests.
* Also extracts dispensaryId from window.reactEnv if available.
*/
async function createSession(cName: string): Promise<SessionCredentials> {
const browser = await puppeteer.launch({
headless: 'new',
args: dutchieConfig.browserArgs,
});
const page = await browser.newPage();
const userAgent = dutchieConfig.userAgent;
await page.setUserAgent(userAgent);
await page.setViewport({ width: 1920, height: 1080 });
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
(window as any).chrome = { runtime: {} };
});
// Navigate to the embedded menu page for this dispensary
const embeddedMenuUrl = `https://dutchie.com/embedded-menu/${cName}`;
console.log(`[GraphQL Client] Loading ${embeddedMenuUrl} to get CF cookies...`);
let httpStatus: number | undefined;
let dispensaryId: string | undefined;
try {
const response = await page.goto(embeddedMenuUrl, {
waitUntil: 'networkidle2',
timeout: dutchieConfig.navigationTimeout,
});
httpStatus = response?.status();
await new Promise((r) => setTimeout(r, dutchieConfig.pageLoadDelay));
// Try to extract dispensaryId from window.reactEnv
try {
dispensaryId = await page.evaluate(() => {
return (window as any).reactEnv?.dispensaryId || null;
});
if (dispensaryId) {
console.log(`[GraphQL Client] Extracted dispensaryId from reactEnv: ${dispensaryId}`);
}
} catch (evalError: any) {
console.log(`[GraphQL Client] Could not extract dispensaryId from reactEnv: ${evalError.message}`);
}
} catch (error: any) {
console.warn(`[GraphQL Client] Navigation warning: ${error.message}`);
// Continue anyway - we may have gotten cookies
}
// Extract cookies
const cookies = await page.cookies();
const cookieString = cookies.map((c: Protocol.Network.Cookie) => `${c.name}=${c.value}`).join('; ');
console.log(`[GraphQL Client] Got ${cookies.length} cookies, HTTP status: ${httpStatus}`);
if (cookies.length > 0) {
console.log(`[GraphQL Client] Cookie names: ${cookies.map(c => c.name).join(', ')}`);
}
return { cookies: cookieString, userAgent, browser, page, dispensaryId, httpStatus };
}
/**
* Close session (browser)
*/
async function closeSession(session: SessionCredentials): Promise<void> {
await session.browser.close();
}
// ============================================================
// SERVER-SIDE GRAPHQL FETCH USING AXIOS
// ============================================================
/**
* Build headers that mimic a real browser request
*/
function buildHeaders(session: SessionCredentials, cName: string): Record<string, string> {
const embeddedMenuUrl = `https://dutchie.com/embedded-menu/${cName}`;
return {
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'content-type': 'application/json',
'origin': 'https://dutchie.com',
'referer': embeddedMenuUrl,
'user-agent': session.userAgent,
'apollographql-client-name': 'Marketplace (production)',
'sec-ch-ua': '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
...(session.cookies ? { 'cookie': session.cookies } : {}),
};
}
/**
* Execute GraphQL query server-side using axios
* Uses cookies from the browser session to bypass CF
*/
async function executeGraphQL(
session: SessionCredentials,
operationName: string,
variables: any,
hash: string,
cName: string
): Promise<any> {
const endpoint = dutchieConfig.graphqlEndpoint;
const headers = buildHeaders(session, cName);
// Build request body for POST
const body = {
operationName,
variables,
extensions: {
persistedQuery: { version: 1, sha256Hash: hash },
},
};
console.log(`[GraphQL Client] POST: ${operationName} -> ${endpoint}`);
console.log(`[GraphQL Client] Variables: ${JSON.stringify(variables).slice(0, 300)}...`);
try {
const response = await axios.post(endpoint, body, {
headers,
timeout: 30000,
validateStatus: () => true, // Don't throw on non-2xx
});
// Log response details
console.log(`[GraphQL Client] Response status: ${response.status}`);
if (response.status !== 200) {
const bodyPreview = typeof response.data === 'string'
? response.data.slice(0, 500)
: JSON.stringify(response.data).slice(0, 500);
console.error(`[GraphQL Client] HTTP ${response.status}: ${bodyPreview}`);
throw new Error(`HTTP ${response.status}`);
}
// Check for GraphQL errors
if (response.data?.errors && response.data.errors.length > 0) {
console.error(`[GraphQL Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`);
}
return response.data;
} catch (error: any) {
if (axios.isAxiosError(error)) {
const axiosError = error as AxiosError;
console.error(`[GraphQL Client] Axios error: ${axiosError.message}`);
if (axiosError.response) {
console.error(`[GraphQL Client] Response status: ${axiosError.response.status}`);
console.error(`[GraphQL Client] Response data: ${JSON.stringify(axiosError.response.data).slice(0, 500)}`);
}
if (axiosError.code) {
console.error(`[GraphQL Client] Error code: ${axiosError.code}`);
}
} else {
console.error(`[GraphQL Client] Error: ${error.message}`);
}
throw error;
}
}
// ============================================================
// DISPENSARY ID RESOLUTION
// ============================================================
/**
* Resolution result with HTTP status for error handling
*/
export interface ResolveDispensaryResult {
dispensaryId: string | null;
httpStatus?: number;
error?: string;
source?: 'reactEnv' | 'graphql';
}
/**
* Resolve a dispensary slug to its internal platform ID.
*
* STRATEGY:
* 1. Navigate to embedded menu page and extract window.reactEnv.dispensaryId (preferred)
* 2. Fall back to GraphQL GetAddressBasedDispensaryData query if reactEnv fails
*
* Returns the dispensaryId (platform_dispensary_id) or null if not found.
* Throws if page returns 403/404 so caller can mark as not_crawlable.
*/
export async function resolveDispensaryId(slug: string): Promise<string | null> {
const result = await resolveDispensaryIdWithDetails(slug);
return result.dispensaryId;
}
/**
* Resolve a dispensary slug with full details (HTTP status, source, error).
* Use this when you need to know WHY resolution failed.
*/
export async function resolveDispensaryIdWithDetails(slug: string): Promise<ResolveDispensaryResult> {
console.log(`[GraphQL Client] Resolving dispensary ID for slug: ${slug}`);
const session = await createSession(slug);
try {
// Check HTTP status first - if 403/404, the store is not crawlable
if (session.httpStatus && (session.httpStatus === 403 || session.httpStatus === 404)) {
console.log(`[GraphQL Client] Page returned HTTP ${session.httpStatus} for ${slug} - not crawlable`);
return {
dispensaryId: null,
httpStatus: session.httpStatus,
error: `HTTP ${session.httpStatus}: Store removed or not accessible`,
source: 'reactEnv',
};
}
// PREFERRED: Use dispensaryId from window.reactEnv (extracted during createSession)
if (session.dispensaryId) {
console.log(`[GraphQL Client] Resolved ${slug} -> ${session.dispensaryId} (from reactEnv)`);
return {
dispensaryId: session.dispensaryId,
httpStatus: session.httpStatus,
source: 'reactEnv',
};
}
// FALLBACK: Try GraphQL query
console.log(`[GraphQL Client] reactEnv.dispensaryId not found for ${slug}, trying GraphQL...`);
const variables = {
dispensaryFilter: {
cNameOrID: slug,
},
};
const result = await executeGraphQL(
session,
'GetAddressBasedDispensaryData',
variables,
GRAPHQL_HASHES.GetAddressBasedDispensaryData,
slug
);
const dispensaryId = result?.data?.dispensaryBySlug?.id ||
result?.data?.dispensary?.id ||
result?.data?.getAddressBasedDispensaryData?.dispensary?.id;
if (dispensaryId) {
console.log(`[GraphQL Client] Resolved ${slug} -> ${dispensaryId} (from GraphQL)`);
return {
dispensaryId,
httpStatus: session.httpStatus,
source: 'graphql',
};
}
console.log(`[GraphQL Client] Could not resolve ${slug}, GraphQL response:`, JSON.stringify(result).slice(0, 300));
return {
dispensaryId: null,
httpStatus: session.httpStatus,
error: 'Could not extract dispensaryId from reactEnv or GraphQL',
};
} finally {
await closeSession(session);
}
}
/**
* Discover Arizona dispensaries via geo-based query
*/
export async function discoverArizonaDispensaries(): Promise<any[]> {
console.log('[GraphQL Client] Discovering Arizona dispensaries...');
// Use Phoenix as the default center
const session = await createSession('AZ-Deeply-Rooted');
const allDispensaries: any[] = [];
const seenIds = new Set<string>();
try {
for (const centerpoint of ARIZONA_CENTERPOINTS) {
console.log(`[GraphQL Client] Scanning ${centerpoint.name}...`);
const variables = {
dispensariesFilter: {
latitude: centerpoint.lat,
longitude: centerpoint.lng,
distance: 100,
state: 'AZ',
},
};
try {
const result = await executeGraphQL(
session,
'ConsumerDispensaries',
variables,
GRAPHQL_HASHES.ConsumerDispensaries,
'AZ-Deeply-Rooted'
);
const dispensaries = result?.data?.consumerDispensaries || [];
for (const d of dispensaries) {
const id = d.id || d.dispensaryId;
if (id && !seenIds.has(id)) {
seenIds.add(id);
allDispensaries.push(d);
}
}
console.log(`[GraphQL Client] Found ${dispensaries.length} in ${centerpoint.name} (${allDispensaries.length} total unique)`);
} catch (error: any) {
console.warn(`[GraphQL Client] Error scanning ${centerpoint.name}: ${error.message}`);
}
// Delay between requests
await new Promise((r) => setTimeout(r, 1000));
}
} finally {
await closeSession(session);
}
console.log(`[GraphQL Client] Discovery complete: ${allDispensaries.length} dispensaries`);
return allDispensaries;
}
// ============================================================
// PRODUCT FILTERING VARIABLES
// ============================================================
/**
* Build filter variables for FilteredProducts query
*
* CRITICAL: Uses dispensaryId directly (the MongoDB ObjectId, e.g. "6405ef617056e8014d79101b")
* NOT dispensaryFilter.cNameOrID!
*
* The actual browser request structure is:
* {
* "productsFilter": {
* "dispensaryId": "6405ef617056e8014d79101b",
* "pricingType": "rec",
* "Status": "Active", // Mode A only
* "strainTypes": [],
* "subcategories": [],
* "types": [],
* "useCache": true,
* ...
* },
* "page": 0,
* "perPage": 100
* }
*
* Mode A = UI parity (Status: "Active")
* Mode B = MAX COVERAGE (no Status filter)
*/
function buildFilterVariables(
platformDispensaryId: string,
pricingType: 'rec' | 'med',
crawlMode: CrawlMode,
page: number,
perPage: number
): any {
const isModeA = crawlMode === 'mode_a';
// Per CLAUDE.md Rule #11: Use simple productsFilter with dispensaryId directly
// Do NOT use dispensaryFilter.cNameOrID - that's outdated
const productsFilter: Record<string, any> = {
dispensaryId: platformDispensaryId,
pricingType: pricingType,
};
// Mode A: Only active products (UI parity) - Status: "Active"
// Mode B: MAX COVERAGE (OOS/inactive) - omit Status or set to null
if (isModeA) {
productsFilter.Status = 'Active';
}
// Mode B: No Status filter = returns all products including OOS/inactive
return {
productsFilter,
page,
perPage,
};
}
// ============================================================
// PRODUCT FETCHING WITH PAGINATION
// ============================================================
/**
* Fetch products for a single mode with pagination
*/
async function fetchProductsForMode(
session: SessionCredentials,
platformDispensaryId: string,
cName: string,
pricingType: 'rec' | 'med',
crawlMode: CrawlMode
): Promise<{ products: DutchieRawProduct[]; totalCount: number; crawlMode: CrawlMode }> {
const perPage = dutchieConfig.perPage;
const maxPages = dutchieConfig.maxPages;
const maxRetries = dutchieConfig.maxRetries;
const pageDelayMs = dutchieConfig.pageDelayMs;
const allProducts: DutchieRawProduct[] = [];
let pageNum = 0;
let totalCount = 0;
let consecutiveEmptyPages = 0;
console.log(`[GraphQL Client] Fetching products for ${cName} (platformId: ${platformDispensaryId}, ${pricingType}, ${crawlMode})...`);
while (pageNum < maxPages) {
const variables = buildFilterVariables(platformDispensaryId, pricingType, crawlMode, pageNum, perPage);
let result: any = null;
let lastError: Error | null = null;
// Retry logic
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
result = await executeGraphQL(
session,
'FilteredProducts',
variables,
GRAPHQL_HASHES.FilteredProducts,
cName
);
lastError = null;
break;
} catch (error: any) {
lastError = error;
console.warn(`[GraphQL Client] Page ${pageNum} attempt ${attempt + 1} failed: ${error.message}`);
if (attempt < maxRetries) {
await new Promise((r) => setTimeout(r, 1000 * (attempt + 1)));
}
}
}
if (lastError) {
console.error(`[GraphQL Client] Page ${pageNum} failed after ${maxRetries + 1} attempts`);
break;
}
if (result?.errors) {
console.error('[GraphQL Client] GraphQL errors:', JSON.stringify(result.errors));
break;
}
// Log response shape on first page
if (pageNum === 0) {
console.log(`[GraphQL Client] Response keys: ${Object.keys(result || {}).join(', ')}`);
if (result?.data) {
console.log(`[GraphQL Client] data keys: ${Object.keys(result.data || {}).join(', ')}`);
}
if (!result?.data?.filteredProducts) {
console.log(`[GraphQL Client] WARNING: No filteredProducts in response!`);
console.log(`[GraphQL Client] Full response: ${JSON.stringify(result).slice(0, 1000)}`);
}
}
const products = result?.data?.filteredProducts?.products || [];
const queryInfo = result?.data?.filteredProducts?.queryInfo;
if (queryInfo?.totalCount) {
totalCount = queryInfo.totalCount;
}
console.log(
`[GraphQL Client] Page ${pageNum}: ${products.length} products (total so far: ${allProducts.length + products.length}/${totalCount})`
);
if (products.length === 0) {
consecutiveEmptyPages++;
if (consecutiveEmptyPages >= 2) {
console.log('[GraphQL Client] Multiple empty pages, stopping pagination');
break;
}
} else {
consecutiveEmptyPages = 0;
allProducts.push(...products);
}
// Stop if incomplete page (last page)
if (products.length < perPage) {
console.log(`[GraphQL Client] Incomplete page (${products.length} < ${perPage}), stopping`);
break;
}
pageNum++;
await new Promise((r) => setTimeout(r, pageDelayMs));
}
console.log(`[GraphQL Client] Fetched ${allProducts.length} total products (${crawlMode})`);
return { products: allProducts, totalCount: totalCount || allProducts.length, crawlMode };
}
// ============================================================
// LEGACY SINGLE-MODE INTERFACE
// ============================================================
/**
* Fetch all products for a dispensary (single mode)
*/
export async function fetchAllProducts(
platformDispensaryId: string,
pricingType: 'rec' | 'med' = 'rec',
options: {
perPage?: number;
maxPages?: number;
menuUrl?: string;
crawlMode?: CrawlMode;
cName?: string;
} = {}
): Promise<{ products: DutchieRawProduct[]; totalCount: number; crawlMode: CrawlMode }> {
const { crawlMode = 'mode_a' } = options;
// cName is now REQUIRED - no default fallback to avoid using wrong store's session
const cName = options.cName;
if (!cName) {
throw new Error('[GraphQL Client] cName is required for fetchAllProducts - cannot use another store\'s session');
}
const session = await createSession(cName);
try {
return await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, crawlMode);
} finally {
await closeSession(session);
}
}
// ============================================================
// MODE A+B MERGING
// ============================================================
/**
* Merge POSMetaData.children arrays from Mode A and Mode B products
*/
function mergeProductOptions(
modeAProduct: DutchieRawProduct,
modeBProduct: DutchieRawProduct
): DutchiePOSChild[] {
const modeAChildren = modeAProduct.POSMetaData?.children || [];
const modeBChildren = modeBProduct.POSMetaData?.children || [];
const getOptionKey = (child: DutchiePOSChild): string => {
return child.canonicalID || child.canonicalSKU || child.canonicalPackageId || child.option || '';
};
const mergedMap = new Map<string, DutchiePOSChild>();
for (const child of modeAChildren) {
const key = getOptionKey(child);
if (key) mergedMap.set(key, child);
}
for (const child of modeBChildren) {
const key = getOptionKey(child);
if (key && !mergedMap.has(key)) {
mergedMap.set(key, child);
}
}
return Array.from(mergedMap.values());
}
/**
* Merge a Mode A product with a Mode B product
*/
function mergeProducts(
modeAProduct: DutchieRawProduct,
modeBProduct: DutchieRawProduct | undefined
): DutchieRawProduct {
if (!modeBProduct) {
return modeAProduct;
}
const mergedChildren = mergeProductOptions(modeAProduct, modeBProduct);
return {
...modeAProduct,
POSMetaData: {
...modeAProduct.POSMetaData,
children: mergedChildren,
},
};
}
// ============================================================
// MAIN EXPORT: TWO-MODE CRAWL
// ============================================================
/**
* Fetch products using BOTH crawl modes with SINGLE session
* Runs Mode A then Mode B, merges results
*/
export async function fetchAllProductsBothModes(
platformDispensaryId: string,
pricingType: 'rec' | 'med' = 'rec',
options: {
perPage?: number;
maxPages?: number;
menuUrl?: string;
cName?: string;
} = {}
): Promise<{
modeA: { products: DutchieRawProduct[]; totalCount: number };
modeB: { products: DutchieRawProduct[]; totalCount: number };
merged: { products: DutchieRawProduct[]; totalCount: number };
}> {
// cName is now REQUIRED - no default fallback to avoid using wrong store's session
const cName = options.cName;
if (!cName) {
throw new Error('[GraphQL Client] cName is required for fetchAllProductsBothModes - cannot use another store\'s session');
}
console.log(`[GraphQL Client] Running two-mode crawl for ${cName} (${pricingType})...`);
console.log(`[GraphQL Client] Platform ID: ${platformDispensaryId}, cName: ${cName}`);
const session = await createSession(cName);
try {
// Mode A (UI parity)
const modeAResult = await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, 'mode_a');
// Delay between modes
await new Promise((r) => setTimeout(r, dutchieConfig.modeDelayMs));
// Mode B (MAX COVERAGE)
const modeBResult = await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, 'mode_b');
// Merge results
const modeBMap = new Map<string, DutchieRawProduct>();
for (const product of modeBResult.products) {
modeBMap.set(product._id, product);
}
const productMap = new Map<string, DutchieRawProduct>();
// Add Mode A products, merging with Mode B if exists
for (const product of modeAResult.products) {
const modeBProduct = modeBMap.get(product._id);
const mergedProduct = mergeProducts(product, modeBProduct);
productMap.set(product._id, mergedProduct);
}
// Add Mode B products not in Mode A
for (const product of modeBResult.products) {
if (!productMap.has(product._id)) {
productMap.set(product._id, product);
}
}
const mergedProducts = Array.from(productMap.values());
console.log(`[GraphQL Client] Merged: ${mergedProducts.length} unique products`);
console.log(`[GraphQL Client] Mode A: ${modeAResult.products.length}, Mode B: ${modeBResult.products.length}`);
return {
modeA: { products: modeAResult.products, totalCount: modeAResult.totalCount },
modeB: { products: modeBResult.products, totalCount: modeBResult.totalCount },
merged: { products: mergedProducts, totalCount: mergedProducts.length },
};
} finally {
await closeSession(session);
}
}

View File

@@ -1,665 +0,0 @@
/**
* Job Queue Service
*
* DB-backed job queue with claiming/locking for distributed workers.
* Ensures only one worker processes a given store at a time.
*/
import { query, getClient } from '../db/connection';
import { v4 as uuidv4 } from 'uuid';
import * as os from 'os';
import { DEFAULT_CONFIG } from './store-validator';
// Minimum gap between crawls for the same dispensary (in minutes)
const MIN_CRAWL_GAP_MINUTES = DEFAULT_CONFIG.minCrawlGapMinutes; // 2 minutes
// ============================================================
// TYPES
// ============================================================
export interface QueuedJob {
id: number;
jobType: string;
dispensaryId: number | null;
status: 'pending' | 'running' | 'completed' | 'failed';
priority: number;
retryCount: number;
maxRetries: number;
claimedBy: string | null;
claimedAt: Date | null;
workerHostname: string | null;
startedAt: Date | null;
completedAt: Date | null;
errorMessage: string | null;
productsFound: number;
productsUpserted: number;
snapshotsCreated: number;
currentPage: number;
totalPages: number | null;
lastHeartbeatAt: Date | null;
metadata: Record<string, any> | null;
createdAt: Date;
}
export interface EnqueueJobOptions {
jobType: string;
dispensaryId?: number;
priority?: number;
metadata?: Record<string, any>;
maxRetries?: number;
}
export interface ClaimJobOptions {
workerId: string;
jobTypes?: string[];
lockDurationMinutes?: number;
}
export interface JobProgress {
productsFound?: number;
productsUpserted?: number;
snapshotsCreated?: number;
currentPage?: number;
totalPages?: number;
}
// ============================================================
// WORKER IDENTITY
// ============================================================
let _workerId: string | null = null;
/**
* Get or create a unique worker ID for this process
* In Kubernetes, uses POD_NAME for clarity; otherwise generates a unique ID
*/
export function getWorkerId(): string {
if (!_workerId) {
// Prefer POD_NAME in K8s (set via fieldRef)
const podName = process.env.POD_NAME;
if (podName) {
_workerId = podName;
} else {
const hostname = os.hostname();
const pid = process.pid;
const uuid = uuidv4().slice(0, 8);
_workerId = `${hostname}-${pid}-${uuid}`;
}
}
return _workerId;
}
/**
* Get hostname for worker tracking
* In Kubernetes, uses POD_NAME; otherwise uses os.hostname()
*/
export function getWorkerHostname(): string {
return process.env.POD_NAME || os.hostname();
}
// ============================================================
// JOB ENQUEUEING
// ============================================================
export interface EnqueueResult {
jobId: number | null;
skipped: boolean;
reason?: 'already_queued' | 'too_soon' | 'error';
message?: string;
}
/**
* Enqueue a new job for processing
* Returns null if a pending/running job already exists for this dispensary
* or if a job was completed/failed within the minimum gap period
*/
export async function enqueueJob(options: EnqueueJobOptions): Promise<number | null> {
const result = await enqueueJobWithReason(options);
return result.jobId;
}
/**
* Enqueue a new job with detailed result info
* Enforces:
* 1. No duplicate pending/running jobs for same dispensary
* 2. Minimum 2-minute gap between crawls for same dispensary
*/
export async function enqueueJobWithReason(options: EnqueueJobOptions): Promise<EnqueueResult> {
const {
jobType,
dispensaryId,
priority = 0,
metadata,
maxRetries = 3,
} = options;
// Check if there's already a pending/running job for this dispensary
if (dispensaryId) {
const { rows: existing } = await query<any>(
`SELECT id FROM dispensary_crawl_jobs
WHERE dispensary_id = $1 AND status IN ('pending', 'running')
LIMIT 1`,
[dispensaryId]
);
if (existing.length > 0) {
console.log(`[JobQueue] Skipping enqueue - job already exists for dispensary ${dispensaryId}`);
return {
jobId: null,
skipped: true,
reason: 'already_queued',
message: `Job already pending/running for dispensary ${dispensaryId}`,
};
}
// Check minimum gap since last job (2 minutes)
const { rows: recent } = await query<any>(
`SELECT id, created_at, status
FROM dispensary_crawl_jobs
WHERE dispensary_id = $1
ORDER BY created_at DESC
LIMIT 1`,
[dispensaryId]
);
if (recent.length > 0) {
const lastJobTime = new Date(recent[0].created_at);
const minGapMs = MIN_CRAWL_GAP_MINUTES * 60 * 1000;
const timeSinceLastJob = Date.now() - lastJobTime.getTime();
if (timeSinceLastJob < minGapMs) {
const waitSeconds = Math.ceil((minGapMs - timeSinceLastJob) / 1000);
console.log(`[JobQueue] Skipping enqueue - minimum ${MIN_CRAWL_GAP_MINUTES}min gap not met for dispensary ${dispensaryId}. Wait ${waitSeconds}s`);
return {
jobId: null,
skipped: true,
reason: 'too_soon',
message: `Minimum ${MIN_CRAWL_GAP_MINUTES}-minute gap required. Try again in ${waitSeconds} seconds.`,
};
}
}
}
try {
const { rows } = await query<any>(
`INSERT INTO dispensary_crawl_jobs (job_type, dispensary_id, status, priority, max_retries, metadata, created_at)
VALUES ($1, $2, 'pending', $3, $4, $5, NOW())
RETURNING id`,
[jobType, dispensaryId || null, priority, maxRetries, metadata ? JSON.stringify(metadata) : null]
);
const jobId = rows[0].id;
console.log(`[JobQueue] Enqueued job ${jobId} (type=${jobType}, dispensary=${dispensaryId})`);
return { jobId, skipped: false };
} catch (error: any) {
// Handle database trigger rejection for minimum gap
if (error.message?.includes('Minimum') && error.message?.includes('gap')) {
console.log(`[JobQueue] DB rejected - minimum gap not met for dispensary ${dispensaryId}`);
return {
jobId: null,
skipped: true,
reason: 'too_soon',
message: error.message,
};
}
throw error;
}
}
export interface BulkEnqueueResult {
enqueued: number;
skipped: number;
skippedReasons: {
alreadyQueued: number;
tooSoon: number;
};
}
/**
* Bulk enqueue jobs for multiple dispensaries
* Skips dispensaries that already have pending/running jobs
* or have jobs within the minimum gap period
*/
export async function bulkEnqueueJobs(
jobType: string,
dispensaryIds: number[],
options: { priority?: number; metadata?: Record<string, any> } = {}
): Promise<BulkEnqueueResult> {
const { priority = 0, metadata } = options;
// Get dispensaries that already have pending/running jobs
const { rows: existing } = await query<any>(
`SELECT DISTINCT dispensary_id FROM dispensary_crawl_jobs
WHERE dispensary_id = ANY($1) AND status IN ('pending', 'running')`,
[dispensaryIds]
);
const existingSet = new Set(existing.map((r: any) => r.dispensary_id));
// Get dispensaries that have recent jobs within minimum gap
const { rows: recent } = await query<any>(
`SELECT DISTINCT dispensary_id FROM dispensary_crawl_jobs
WHERE dispensary_id = ANY($1)
AND created_at > NOW() - ($2 || ' minutes')::INTERVAL
AND dispensary_id NOT IN (
SELECT dispensary_id FROM dispensary_crawl_jobs
WHERE dispensary_id = ANY($1) AND status IN ('pending', 'running')
)`,
[dispensaryIds, MIN_CRAWL_GAP_MINUTES]
);
const recentSet = new Set(recent.map((r: any) => r.dispensary_id));
// Filter out dispensaries with existing or recent jobs
const toEnqueue = dispensaryIds.filter(id => !existingSet.has(id) && !recentSet.has(id));
if (toEnqueue.length === 0) {
return {
enqueued: 0,
skipped: dispensaryIds.length,
skippedReasons: {
alreadyQueued: existingSet.size,
tooSoon: recentSet.size,
},
};
}
// Bulk insert - each row needs 4 params: job_type, dispensary_id, priority, metadata
const metadataJson = metadata ? JSON.stringify(metadata) : null;
const values = toEnqueue.map((_, i) => {
const offset = i * 4;
return `($${offset + 1}, $${offset + 2}, 'pending', $${offset + 3}, 3, $${offset + 4}, NOW())`;
}).join(', ');
const params: any[] = [];
toEnqueue.forEach(dispensaryId => {
params.push(jobType, dispensaryId, priority, metadataJson);
});
await query(
`INSERT INTO dispensary_crawl_jobs (job_type, dispensary_id, status, priority, max_retries, metadata, created_at)
VALUES ${values}`,
params
);
console.log(`[JobQueue] Bulk enqueued ${toEnqueue.length} jobs, skipped ${existingSet.size} (queued) + ${recentSet.size} (recent)`);
return {
enqueued: toEnqueue.length,
skipped: existingSet.size + recentSet.size,
skippedReasons: {
alreadyQueued: existingSet.size,
tooSoon: recentSet.size,
},
};
}
// ============================================================
// JOB CLAIMING (with locking)
// ============================================================
/**
* Claim the next available job from the queue
* Uses SELECT FOR UPDATE SKIP LOCKED to prevent double-claims
*/
export async function claimNextJob(options: ClaimJobOptions): Promise<QueuedJob | null> {
const { workerId, jobTypes, lockDurationMinutes = 30 } = options;
const hostname = getWorkerHostname();
const client = await getClient();
try {
await client.query('BEGIN');
// Build job type filter
let typeFilter = '';
const params: any[] = [workerId, hostname, lockDurationMinutes];
let paramIndex = 4;
if (jobTypes && jobTypes.length > 0) {
typeFilter = `AND job_type = ANY($${paramIndex})`;
params.push(jobTypes);
paramIndex++;
}
// Claim the next pending job using FOR UPDATE SKIP LOCKED
// This atomically selects and locks a row, skipping any already locked by other workers
const { rows } = await client.query(
`UPDATE dispensary_crawl_jobs
SET
status = 'running',
claimed_by = $1,
claimed_at = NOW(),
worker_id = $1,
worker_hostname = $2,
started_at = NOW(),
locked_until = NOW() + ($3 || ' minutes')::INTERVAL,
last_heartbeat_at = NOW(),
updated_at = NOW()
WHERE id = (
SELECT id FROM dispensary_crawl_jobs
WHERE status = 'pending'
${typeFilter}
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *`,
params
);
await client.query('COMMIT');
if (rows.length === 0) {
return null;
}
const job = mapDbRowToJob(rows[0]);
console.log(`[JobQueue] Worker ${workerId} claimed job ${job.id} (type=${job.jobType}, dispensary=${job.dispensaryId})`);
return job;
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
// ============================================================
// JOB PROGRESS & COMPLETION
// ============================================================
/**
* Update job progress (for live monitoring)
*/
export async function updateJobProgress(jobId: number, progress: JobProgress): Promise<void> {
const updates: string[] = ['last_heartbeat_at = NOW()', 'updated_at = NOW()'];
const params: any[] = [];
let paramIndex = 1;
if (progress.productsFound !== undefined) {
updates.push(`products_found = $${paramIndex++}`);
params.push(progress.productsFound);
}
if (progress.productsUpserted !== undefined) {
updates.push(`products_upserted = $${paramIndex++}`);
params.push(progress.productsUpserted);
}
if (progress.snapshotsCreated !== undefined) {
updates.push(`snapshots_created = $${paramIndex++}`);
params.push(progress.snapshotsCreated);
}
if (progress.currentPage !== undefined) {
updates.push(`current_page = $${paramIndex++}`);
params.push(progress.currentPage);
}
if (progress.totalPages !== undefined) {
updates.push(`total_pages = $${paramIndex++}`);
params.push(progress.totalPages);
}
params.push(jobId);
await query(
`UPDATE dispensary_crawl_jobs SET ${updates.join(', ')} WHERE id = $${paramIndex}`,
params
);
}
/**
* Send heartbeat to keep job alive (prevents timeout)
*/
export async function heartbeat(jobId: number): Promise<void> {
await query(
`UPDATE dispensary_crawl_jobs
SET last_heartbeat_at = NOW(), locked_until = NOW() + INTERVAL '30 minutes'
WHERE id = $1 AND status = 'running'`,
[jobId]
);
}
/**
* Mark job as completed
*
* Stores visibility tracking stats (visibilityLostCount, visibilityRestoredCount)
* in the metadata JSONB column for dashboard analytics.
*/
export async function completeJob(
jobId: number,
result: {
productsFound?: number;
productsUpserted?: number;
snapshotsCreated?: number;
visibilityLostCount?: number;
visibilityRestoredCount?: number;
}
): Promise<void> {
// Build metadata with visibility stats if provided
const metadata: Record<string, any> = {};
if (result.visibilityLostCount !== undefined) {
metadata.visibilityLostCount = result.visibilityLostCount;
}
if (result.visibilityRestoredCount !== undefined) {
metadata.visibilityRestoredCount = result.visibilityRestoredCount;
}
if (result.snapshotsCreated !== undefined) {
metadata.snapshotsCreated = result.snapshotsCreated;
}
await query(
`UPDATE dispensary_crawl_jobs
SET
status = 'completed',
completed_at = NOW(),
products_found = COALESCE($2, products_found),
products_updated = COALESCE($3, products_updated),
metadata = COALESCE(metadata, '{}'::jsonb) || $4::jsonb,
updated_at = NOW()
WHERE id = $1`,
[
jobId,
result.productsFound,
result.productsUpserted,
JSON.stringify(metadata),
]
);
console.log(`[JobQueue] Job ${jobId} completed`);
}
/**
* Mark job as failed
*/
export async function failJob(jobId: number, errorMessage: string): Promise<boolean> {
// Check if we should retry
const { rows } = await query<any>(
`SELECT retry_count, max_retries FROM dispensary_crawl_jobs WHERE id = $1`,
[jobId]
);
if (rows.length === 0) return false;
const { retry_count, max_retries } = rows[0];
if (retry_count < max_retries) {
// Re-queue for retry
await query(
`UPDATE dispensary_crawl_jobs
SET
status = 'pending',
retry_count = retry_count + 1,
claimed_by = NULL,
claimed_at = NULL,
worker_id = NULL,
worker_hostname = NULL,
started_at = NULL,
locked_until = NULL,
last_heartbeat_at = NULL,
error_message = $2,
updated_at = NOW()
WHERE id = $1`,
[jobId, errorMessage]
);
console.log(`[JobQueue] Job ${jobId} failed, re-queued for retry (${retry_count + 1}/${max_retries})`);
return true; // Will retry
} else {
// Mark as failed permanently
await query(
`UPDATE dispensary_crawl_jobs
SET
status = 'failed',
completed_at = NOW(),
error_message = $2,
updated_at = NOW()
WHERE id = $1`,
[jobId, errorMessage]
);
console.log(`[JobQueue] Job ${jobId} failed permanently after ${retry_count} retries`);
return false; // No more retries
}
}
// ============================================================
// QUEUE MONITORING
// ============================================================
/**
* Get queue statistics
*/
export async function getQueueStats(): Promise<{
pending: number;
running: number;
completed1h: number;
failed1h: number;
activeWorkers: number;
avgDurationSeconds: number | null;
}> {
const { rows } = await query<any>(`SELECT * FROM v_queue_stats`);
const stats = rows[0] || {};
return {
pending: parseInt(stats.pending_jobs || '0', 10),
running: parseInt(stats.running_jobs || '0', 10),
completed1h: parseInt(stats.completed_1h || '0', 10),
failed1h: parseInt(stats.failed_1h || '0', 10),
activeWorkers: parseInt(stats.active_workers || '0', 10),
avgDurationSeconds: stats.avg_duration_seconds ? parseFloat(stats.avg_duration_seconds) : null,
};
}
/**
* Get active workers
*/
export async function getActiveWorkers(): Promise<Array<{
workerId: string;
hostname: string | null;
currentJobs: number;
totalProductsFound: number;
totalProductsUpserted: number;
totalSnapshots: number;
firstClaimedAt: Date;
lastHeartbeat: Date | null;
}>> {
const { rows } = await query<any>(`SELECT * FROM v_active_workers`);
return rows.map((row: any) => ({
workerId: row.worker_id,
hostname: row.worker_hostname,
currentJobs: parseInt(row.current_jobs || '0', 10),
totalProductsFound: parseInt(row.total_products_found || '0', 10),
totalProductsUpserted: parseInt(row.total_products_upserted || '0', 10),
totalSnapshots: parseInt(row.total_snapshots || '0', 10),
firstClaimedAt: new Date(row.first_claimed_at),
lastHeartbeat: row.last_heartbeat ? new Date(row.last_heartbeat) : null,
}));
}
/**
* Get running jobs with worker info
*/
export async function getRunningJobs(): Promise<QueuedJob[]> {
const { rows } = await query<any>(
`SELECT cj.*, d.name as dispensary_name, d.city
FROM dispensary_crawl_jobs cj
LEFT JOIN dispensaries d ON cj.dispensary_id = d.id
WHERE cj.status = 'running'
ORDER BY cj.started_at DESC`
);
return rows.map(mapDbRowToJob);
}
/**
* Recover stale jobs (workers that died without completing)
*/
export async function recoverStaleJobs(staleMinutes: number = 15): Promise<number> {
const { rowCount } = await query(
`UPDATE dispensary_crawl_jobs
SET
status = 'pending',
claimed_by = NULL,
claimed_at = NULL,
worker_id = NULL,
worker_hostname = NULL,
started_at = NULL,
locked_until = NULL,
error_message = 'Recovered from stale worker',
retry_count = retry_count + 1,
updated_at = NOW()
WHERE status = 'running'
AND last_heartbeat_at < NOW() - ($1 || ' minutes')::INTERVAL
AND retry_count < max_retries`,
[staleMinutes]
);
if (rowCount && rowCount > 0) {
console.log(`[JobQueue] Recovered ${rowCount} stale jobs`);
}
return rowCount || 0;
}
/**
* Clean up old completed/failed jobs
*/
export async function cleanupOldJobs(olderThanDays: number = 7): Promise<number> {
const { rowCount } = await query(
`DELETE FROM dispensary_crawl_jobs
WHERE status IN ('completed', 'failed')
AND completed_at < NOW() - ($1 || ' days')::INTERVAL`,
[olderThanDays]
);
if (rowCount && rowCount > 0) {
console.log(`[JobQueue] Cleaned up ${rowCount} old jobs`);
}
return rowCount || 0;
}
// ============================================================
// HELPERS
// ============================================================
function mapDbRowToJob(row: any): QueuedJob {
return {
id: row.id,
jobType: row.job_type,
dispensaryId: row.dispensary_id,
status: row.status,
priority: row.priority || 0,
retryCount: row.retry_count || 0,
maxRetries: row.max_retries || 3,
claimedBy: row.claimed_by,
claimedAt: row.claimed_at ? new Date(row.claimed_at) : null,
workerHostname: row.worker_hostname,
startedAt: row.started_at ? new Date(row.started_at) : null,
completedAt: row.completed_at ? new Date(row.completed_at) : null,
errorMessage: row.error_message,
productsFound: row.products_found || 0,
productsUpserted: row.products_upserted || 0,
snapshotsCreated: row.snapshots_created || 0,
currentPage: row.current_page || 0,
totalPages: row.total_pages,
lastHeartbeatAt: row.last_heartbeat_at ? new Date(row.last_heartbeat_at) : null,
metadata: row.metadata,
createdAt: new Date(row.created_at),
// Add extra fields from join if present
...(row.dispensary_name && { dispensaryName: row.dispensary_name }),
...(row.city && { city: row.city }),
};
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,435 +0,0 @@
/**
* Unified Retry Manager
*
* Handles retry logic with exponential backoff, jitter, and
* intelligent error-based decisions (rotate proxy, rotate UA, etc.)
*
* Phase 1: Crawler Reliability & Stabilization
*/
import {
CrawlErrorCodeType,
CrawlErrorCode,
classifyError,
getErrorMetadata,
isRetryable,
shouldRotateProxy,
shouldRotateUserAgent,
getBackoffMultiplier,
} from './error-taxonomy';
import { DEFAULT_CONFIG } from './store-validator';
// ============================================================
// RETRY CONFIGURATION
// ============================================================
export interface RetryConfig {
maxRetries: number;
baseBackoffMs: number;
maxBackoffMs: number;
backoffMultiplier: number;
jitterFactor: number; // 0.0 - 1.0 (percentage of backoff to randomize)
}
export const DEFAULT_RETRY_CONFIG: RetryConfig = {
maxRetries: DEFAULT_CONFIG.maxRetries,
baseBackoffMs: DEFAULT_CONFIG.baseBackoffMs,
maxBackoffMs: DEFAULT_CONFIG.maxBackoffMs,
backoffMultiplier: DEFAULT_CONFIG.backoffMultiplier,
jitterFactor: 0.25, // +/- 25% jitter
};
// ============================================================
// RETRY CONTEXT
// ============================================================
/**
* Context for tracking retry state across attempts
*/
export interface RetryContext {
attemptNumber: number;
maxAttempts: number;
lastErrorCode: CrawlErrorCodeType | null;
lastHttpStatus: number | null;
totalBackoffMs: number;
proxyRotated: boolean;
userAgentRotated: boolean;
startedAt: Date;
}
/**
* Decision about what to do after an error
*/
export interface RetryDecision {
shouldRetry: boolean;
reason: string;
backoffMs: number;
rotateProxy: boolean;
rotateUserAgent: boolean;
errorCode: CrawlErrorCodeType;
attemptNumber: number;
}
// ============================================================
// RETRY MANAGER CLASS
// ============================================================
export class RetryManager {
private config: RetryConfig;
private context: RetryContext;
constructor(config: Partial<RetryConfig> = {}) {
this.config = { ...DEFAULT_RETRY_CONFIG, ...config };
this.context = this.createInitialContext();
}
/**
* Create initial retry context
*/
private createInitialContext(): RetryContext {
return {
attemptNumber: 0,
maxAttempts: this.config.maxRetries + 1, // +1 for initial attempt
lastErrorCode: null,
lastHttpStatus: null,
totalBackoffMs: 0,
proxyRotated: false,
userAgentRotated: false,
startedAt: new Date(),
};
}
/**
* Reset retry state for a new operation
*/
reset(): void {
this.context = this.createInitialContext();
}
/**
* Get current attempt number (1-based)
*/
getAttemptNumber(): number {
return this.context.attemptNumber + 1;
}
/**
* Check if we should attempt (call before each attempt)
*/
shouldAttempt(): boolean {
return this.context.attemptNumber < this.context.maxAttempts;
}
/**
* Record an attempt (call at start of each attempt)
*/
recordAttempt(): void {
this.context.attemptNumber++;
}
/**
* Evaluate an error and decide what to do
*/
evaluateError(
error: Error | string | null,
httpStatus?: number
): RetryDecision {
const errorCode = classifyError(error, httpStatus);
const metadata = getErrorMetadata(errorCode);
const attemptNumber = this.context.attemptNumber;
// Update context
this.context.lastErrorCode = errorCode;
this.context.lastHttpStatus = httpStatus || null;
// Check if error is retryable
if (!isRetryable(errorCode)) {
return {
shouldRetry: false,
reason: `Error ${errorCode} is not retryable: ${metadata.description}`,
backoffMs: 0,
rotateProxy: false,
rotateUserAgent: false,
errorCode,
attemptNumber,
};
}
// Check if we've exhausted retries
if (!this.shouldAttempt()) {
return {
shouldRetry: false,
reason: `Max retries (${this.config.maxRetries}) exhausted`,
backoffMs: 0,
rotateProxy: false,
rotateUserAgent: false,
errorCode,
attemptNumber,
};
}
// Calculate backoff with exponential increase and jitter
const baseBackoff = this.calculateBackoff(attemptNumber, errorCode);
const backoffWithJitter = this.addJitter(baseBackoff);
// Track total backoff
this.context.totalBackoffMs += backoffWithJitter;
// Determine rotation needs
const rotateProxy = shouldRotateProxy(errorCode);
const rotateUserAgent = shouldRotateUserAgent(errorCode);
if (rotateProxy) this.context.proxyRotated = true;
if (rotateUserAgent) this.context.userAgentRotated = true;
const rotationInfo = [];
if (rotateProxy) rotationInfo.push('rotate proxy');
if (rotateUserAgent) rotationInfo.push('rotate UA');
const rotationStr = rotationInfo.length > 0 ? ` (${rotationInfo.join(', ')})` : '';
return {
shouldRetry: true,
reason: `Retrying after ${errorCode}${rotationStr}, backoff ${backoffWithJitter}ms`,
backoffMs: backoffWithJitter,
rotateProxy,
rotateUserAgent,
errorCode,
attemptNumber,
};
}
/**
* Calculate exponential backoff for an attempt
*/
private calculateBackoff(attemptNumber: number, errorCode: CrawlErrorCodeType): number {
// Base exponential: baseBackoff * multiplier^(attempt-1)
const exponential = this.config.baseBackoffMs *
Math.pow(this.config.backoffMultiplier, attemptNumber - 1);
// Apply error-specific multiplier
const errorMultiplier = getBackoffMultiplier(errorCode);
const adjusted = exponential * errorMultiplier;
// Cap at max backoff
return Math.min(adjusted, this.config.maxBackoffMs);
}
/**
* Add jitter to backoff to prevent thundering herd
*/
private addJitter(backoffMs: number): number {
const jitterRange = backoffMs * this.config.jitterFactor;
// Random between -jitterRange and +jitterRange
const jitter = (Math.random() * 2 - 1) * jitterRange;
return Math.max(0, Math.round(backoffMs + jitter));
}
/**
* Get retry context summary
*/
getSummary(): RetryContextSummary {
const elapsedMs = Date.now() - this.context.startedAt.getTime();
return {
attemptsMade: this.context.attemptNumber,
maxAttempts: this.context.maxAttempts,
lastErrorCode: this.context.lastErrorCode,
lastHttpStatus: this.context.lastHttpStatus,
totalBackoffMs: this.context.totalBackoffMs,
totalElapsedMs: elapsedMs,
proxyWasRotated: this.context.proxyRotated,
userAgentWasRotated: this.context.userAgentRotated,
};
}
}
export interface RetryContextSummary {
attemptsMade: number;
maxAttempts: number;
lastErrorCode: CrawlErrorCodeType | null;
lastHttpStatus: number | null;
totalBackoffMs: number;
totalElapsedMs: number;
proxyWasRotated: boolean;
userAgentWasRotated: boolean;
}
// ============================================================
// CONVENIENCE FUNCTIONS
// ============================================================
/**
* Sleep for specified milliseconds
*/
export function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
/**
* Execute a function with automatic retry logic
*/
export async function withRetry<T>(
fn: (attemptNumber: number) => Promise<T>,
config: Partial<RetryConfig> = {},
callbacks?: {
onRetry?: (decision: RetryDecision) => void | Promise<void>;
onRotateProxy?: () => void | Promise<void>;
onRotateUserAgent?: () => void | Promise<void>;
}
): Promise<{ result: T; summary: RetryContextSummary }> {
const manager = new RetryManager(config);
while (manager.shouldAttempt()) {
manager.recordAttempt();
const attemptNumber = manager.getAttemptNumber();
try {
const result = await fn(attemptNumber);
return { result, summary: manager.getSummary() };
} catch (error) {
const err = error instanceof Error ? error : new Error(String(error));
const httpStatus = (error as any)?.status || (error as any)?.statusCode;
const decision = manager.evaluateError(err, httpStatus);
if (!decision.shouldRetry) {
// Re-throw with enhanced context
const enhancedError = new RetryExhaustedError(
`${err.message} (${decision.reason})`,
err,
manager.getSummary()
);
throw enhancedError;
}
// Notify callbacks
if (callbacks?.onRetry) {
await callbacks.onRetry(decision);
}
if (decision.rotateProxy && callbacks?.onRotateProxy) {
await callbacks.onRotateProxy();
}
if (decision.rotateUserAgent && callbacks?.onRotateUserAgent) {
await callbacks.onRotateUserAgent();
}
// Log retry decision
console.log(
`[RetryManager] Attempt ${attemptNumber} failed: ${decision.errorCode}. ` +
`${decision.reason}. Waiting ${decision.backoffMs}ms before retry.`
);
// Wait before retry
await sleep(decision.backoffMs);
}
}
// Should not reach here, but handle edge case
throw new RetryExhaustedError(
'Max retries exhausted',
null,
manager.getSummary()
);
}
// ============================================================
// CUSTOM ERROR CLASS
// ============================================================
export class RetryExhaustedError extends Error {
public readonly originalError: Error | null;
public readonly summary: RetryContextSummary;
public readonly errorCode: CrawlErrorCodeType;
constructor(
message: string,
originalError: Error | null,
summary: RetryContextSummary
) {
super(message);
this.name = 'RetryExhaustedError';
this.originalError = originalError;
this.summary = summary;
this.errorCode = summary.lastErrorCode || CrawlErrorCode.UNKNOWN_ERROR;
}
}
// ============================================================
// BACKOFF CALCULATOR (for external use)
// ============================================================
/**
* Calculate next crawl time based on consecutive failures
*/
export function calculateNextCrawlDelay(
consecutiveFailures: number,
baseFrequencyMinutes: number,
maxBackoffMultiplier: number = 4.0
): number {
// Each failure doubles the delay, up to max multiplier
const multiplier = Math.min(
Math.pow(2, consecutiveFailures),
maxBackoffMultiplier
);
const delayMinutes = baseFrequencyMinutes * multiplier;
// Add jitter (0-10% of delay)
const jitterMinutes = delayMinutes * Math.random() * 0.1;
return Math.round(delayMinutes + jitterMinutes);
}
/**
* Calculate next crawl timestamp
*/
export function calculateNextCrawlAt(
consecutiveFailures: number,
baseFrequencyMinutes: number
): Date {
const delayMinutes = calculateNextCrawlDelay(consecutiveFailures, baseFrequencyMinutes);
return new Date(Date.now() + delayMinutes * 60 * 1000);
}
// ============================================================
// STATUS DETERMINATION
// ============================================================
/**
* Determine crawl status based on failure count
*/
export function determineCrawlStatus(
consecutiveFailures: number,
thresholds: { degraded: number; failed: number } = { degraded: 3, failed: 10 }
): 'active' | 'degraded' | 'failed' {
if (consecutiveFailures >= thresholds.failed) {
return 'failed';
}
if (consecutiveFailures >= thresholds.degraded) {
return 'degraded';
}
return 'active';
}
/**
* Determine if store should be auto-recovered
* (Called periodically to check if failed stores can be retried)
*/
export function shouldAttemptRecovery(
lastFailureAt: Date | null,
consecutiveFailures: number,
recoveryIntervalHours: number = 24
): boolean {
if (!lastFailureAt) return true;
// Wait longer for more failures
const waitHours = recoveryIntervalHours * Math.min(consecutiveFailures, 5);
const recoveryTime = new Date(lastFailureAt.getTime() + waitHours * 60 * 60 * 1000);
return new Date() >= recoveryTime;
}
// ============================================================
// SINGLETON INSTANCE
// ============================================================
export const retryManager = new RetryManager();

File diff suppressed because it is too large Load Diff

View File

@@ -1,465 +0,0 @@
/**
* Store Configuration Validator
*
* Validates and sanitizes store configurations before crawling.
* Applies defaults for missing values and logs warnings.
*
* Phase 1: Crawler Reliability & Stabilization
*/
import { CrawlErrorCode, CrawlErrorCodeType } from './error-taxonomy';
// ============================================================
// DEFAULT CONFIGURATION
// ============================================================
/**
* Default crawl configuration values
*/
export const DEFAULT_CONFIG = {
// Scheduling
crawlFrequencyMinutes: 240, // 4 hours
minCrawlGapMinutes: 2, // Minimum 2 minutes between crawls
// Retries
maxRetries: 3,
baseBackoffMs: 1000, // 1 second
maxBackoffMs: 60000, // 1 minute
backoffMultiplier: 2.0, // Exponential backoff
// Timeouts
requestTimeoutMs: 30000, // 30 seconds
pageLoadTimeoutMs: 60000, // 60 seconds
// Limits
maxProductsPerPage: 100,
maxPages: 50,
// Proxy
proxyRotationEnabled: true,
proxyRotationOnFailure: true,
// User Agent
userAgentRotationEnabled: true,
userAgentRotationOnFailure: true,
} as const;
// ============================================================
// STORE CONFIG INTERFACE
// ============================================================
/**
* Raw store configuration from database
*/
export interface RawStoreConfig {
id: number;
name: string;
slug?: string;
platform?: string;
menuType?: string;
platformDispensaryId?: string;
menuUrl?: string;
website?: string;
// Crawl config
crawlFrequencyMinutes?: number;
maxRetries?: number;
currentProxyId?: number;
currentUserAgent?: string;
// Status
crawlStatus?: string;
consecutiveFailures?: number;
backoffMultiplier?: number;
lastCrawlAt?: Date;
lastSuccessAt?: Date;
lastFailureAt?: Date;
lastErrorCode?: string;
nextCrawlAt?: Date;
}
/**
* Validated and sanitized store configuration
*/
export interface ValidatedStoreConfig {
id: number;
name: string;
slug: string;
platform: string;
menuType: string;
platformDispensaryId: string;
menuUrl: string;
// Crawl config (with defaults applied)
crawlFrequencyMinutes: number;
maxRetries: number;
currentProxyId: number | null;
currentUserAgent: string | null;
// Status
crawlStatus: 'active' | 'degraded' | 'paused' | 'failed';
consecutiveFailures: number;
backoffMultiplier: number;
lastCrawlAt: Date | null;
lastSuccessAt: Date | null;
lastFailureAt: Date | null;
lastErrorCode: CrawlErrorCodeType | null;
nextCrawlAt: Date | null;
// Validation metadata
isValid: boolean;
validationErrors: ValidationError[];
validationWarnings: ValidationWarning[];
}
// ============================================================
// VALIDATION TYPES
// ============================================================
export interface ValidationError {
field: string;
message: string;
code: CrawlErrorCodeType;
}
export interface ValidationWarning {
field: string;
message: string;
appliedDefault?: any;
}
export interface ValidationResult {
isValid: boolean;
config: ValidatedStoreConfig | null;
errors: ValidationError[];
warnings: ValidationWarning[];
}
// ============================================================
// VALIDATOR CLASS
// ============================================================
export class StoreValidator {
private errors: ValidationError[] = [];
private warnings: ValidationWarning[] = [];
/**
* Validate and sanitize a store configuration
*/
validate(raw: RawStoreConfig): ValidationResult {
this.errors = [];
this.warnings = [];
// Required field validation
this.validateRequired(raw);
// If critical errors, return early
if (this.errors.length > 0) {
return {
isValid: false,
config: null,
errors: this.errors,
warnings: this.warnings,
};
}
// Build validated config with defaults
const config = this.buildValidatedConfig(raw);
return {
isValid: this.errors.length === 0,
config,
errors: this.errors,
warnings: this.warnings,
};
}
/**
* Validate required fields
*/
private validateRequired(raw: RawStoreConfig): void {
if (!raw.id) {
this.addError('id', 'Store ID is required', CrawlErrorCode.INVALID_CONFIG);
}
if (!raw.name) {
this.addError('name', 'Store name is required', CrawlErrorCode.INVALID_CONFIG);
}
if (!raw.platformDispensaryId) {
this.addError(
'platformDispensaryId',
'Platform dispensary ID is required for crawling',
CrawlErrorCode.MISSING_PLATFORM_ID
);
}
if (!raw.menuType || raw.menuType === 'unknown') {
this.addError(
'menuType',
'Menu type must be detected before crawling',
CrawlErrorCode.INVALID_CONFIG
);
}
}
/**
* Build validated config with defaults applied
*/
private buildValidatedConfig(raw: RawStoreConfig): ValidatedStoreConfig {
// Slug
const slug = raw.slug || this.generateSlug(raw.name);
if (!raw.slug) {
this.addWarning('slug', 'Slug was missing, generated from name', slug);
}
// Platform
const platform = raw.platform || 'dutchie';
if (!raw.platform) {
this.addWarning('platform', 'Platform was missing, defaulting to dutchie', platform);
}
// Menu URL
const menuUrl = raw.menuUrl || this.generateMenuUrl(raw.platformDispensaryId!, platform);
if (!raw.menuUrl) {
this.addWarning('menuUrl', 'Menu URL was missing, generated from platform ID', menuUrl);
}
// Crawl frequency
const crawlFrequencyMinutes = this.validateNumeric(
raw.crawlFrequencyMinutes,
'crawlFrequencyMinutes',
DEFAULT_CONFIG.crawlFrequencyMinutes,
60, // min: 1 hour
1440 // max: 24 hours
);
// Max retries
const maxRetries = this.validateNumeric(
raw.maxRetries,
'maxRetries',
DEFAULT_CONFIG.maxRetries,
1, // min
10 // max
);
// Backoff multiplier
const backoffMultiplier = this.validateNumeric(
raw.backoffMultiplier,
'backoffMultiplier',
1.0,
1.0, // min
10.0 // max
);
// Crawl status
const crawlStatus = this.validateCrawlStatus(raw.crawlStatus);
// Consecutive failures
const consecutiveFailures = Math.max(0, raw.consecutiveFailures || 0);
// Last error code
const lastErrorCode = this.validateErrorCode(raw.lastErrorCode);
return {
id: raw.id,
name: raw.name,
slug,
platform,
menuType: raw.menuType!,
platformDispensaryId: raw.platformDispensaryId!,
menuUrl,
crawlFrequencyMinutes,
maxRetries,
currentProxyId: raw.currentProxyId || null,
currentUserAgent: raw.currentUserAgent || null,
crawlStatus,
consecutiveFailures,
backoffMultiplier,
lastCrawlAt: raw.lastCrawlAt || null,
lastSuccessAt: raw.lastSuccessAt || null,
lastFailureAt: raw.lastFailureAt || null,
lastErrorCode,
nextCrawlAt: raw.nextCrawlAt || null,
isValid: true,
validationErrors: [],
validationWarnings: this.warnings,
};
}
/**
* Validate numeric value with bounds
*/
private validateNumeric(
value: number | undefined,
field: string,
defaultValue: number,
min: number,
max: number
): number {
if (value === undefined || value === null) {
this.addWarning(field, `Missing, defaulting to ${defaultValue}`, defaultValue);
return defaultValue;
}
if (value < min) {
this.addWarning(field, `Value ${value} below minimum ${min}, using minimum`, min);
return min;
}
if (value > max) {
this.addWarning(field, `Value ${value} above maximum ${max}, using maximum`, max);
return max;
}
return value;
}
/**
* Validate crawl status
*/
private validateCrawlStatus(status?: string): 'active' | 'degraded' | 'paused' | 'failed' {
const validStatuses = ['active', 'degraded', 'paused', 'failed'];
if (!status || !validStatuses.includes(status)) {
if (status) {
this.addWarning('crawlStatus', `Invalid status "${status}", defaulting to active`, 'active');
}
return 'active';
}
return status as 'active' | 'degraded' | 'paused' | 'failed';
}
/**
* Validate error code
*/
private validateErrorCode(code?: string): CrawlErrorCodeType | null {
if (!code) return null;
const validCodes = Object.values(CrawlErrorCode);
if (!validCodes.includes(code as CrawlErrorCodeType)) {
this.addWarning('lastErrorCode', `Invalid error code "${code}"`, null);
return CrawlErrorCode.UNKNOWN_ERROR;
}
return code as CrawlErrorCodeType;
}
/**
* Generate slug from name
*/
private generateSlug(name: string): string {
return name
.toLowerCase()
.replace(/[^a-z0-9]+/g, '-')
.replace(/^-+|-+$/g, '')
.substring(0, 100);
}
/**
* Generate menu URL from platform ID
*/
private generateMenuUrl(platformId: string, platform: string): string {
if (platform === 'dutchie') {
return `https://dutchie.com/embedded-menu/${platformId}`;
}
return `https://${platform}.com/menu/${platformId}`;
}
/**
* Add validation error
*/
private addError(field: string, message: string, code: CrawlErrorCodeType): void {
this.errors.push({ field, message, code });
console.warn(`[StoreValidator] ERROR ${field}: ${message}`);
}
/**
* Add validation warning
*/
private addWarning(field: string, message: string, appliedDefault?: any): void {
this.warnings.push({ field, message, appliedDefault });
// Log at debug level - warnings are expected for incomplete configs
console.debug(`[StoreValidator] WARNING ${field}: ${message}`);
}
}
// ============================================================
// CONVENIENCE FUNCTIONS
// ============================================================
/**
* Validate a single store config
*/
export function validateStoreConfig(raw: RawStoreConfig): ValidationResult {
const validator = new StoreValidator();
return validator.validate(raw);
}
/**
* Validate multiple store configs
*/
export function validateStoreConfigs(raws: RawStoreConfig[]): {
valid: ValidatedStoreConfig[];
invalid: { raw: RawStoreConfig; errors: ValidationError[] }[];
warnings: { storeId: number; warnings: ValidationWarning[] }[];
} {
const valid: ValidatedStoreConfig[] = [];
const invalid: { raw: RawStoreConfig; errors: ValidationError[] }[] = [];
const warnings: { storeId: number; warnings: ValidationWarning[] }[] = [];
for (const raw of raws) {
const result = validateStoreConfig(raw);
if (result.isValid && result.config) {
valid.push(result.config);
if (result.warnings.length > 0) {
warnings.push({ storeId: raw.id, warnings: result.warnings });
}
} else {
invalid.push({ raw, errors: result.errors });
}
}
return { valid, invalid, warnings };
}
/**
* Quick check if a store is crawlable
*/
export function isCrawlable(raw: RawStoreConfig): boolean {
return !!(
raw.id &&
raw.name &&
raw.platformDispensaryId &&
raw.menuType &&
raw.menuType !== 'unknown' &&
raw.crawlStatus !== 'failed' &&
raw.crawlStatus !== 'paused'
);
}
/**
* Get reason why store is not crawlable
*/
export function getNotCrawlableReason(raw: RawStoreConfig): string | null {
if (!raw.platformDispensaryId) {
return 'Missing platform_dispensary_id';
}
if (!raw.menuType || raw.menuType === 'unknown') {
return 'Menu type not detected';
}
if (raw.crawlStatus === 'failed') {
return 'Store is marked as failed';
}
if (raw.crawlStatus === 'paused') {
return 'Crawling is paused';
}
return null;
}
// ============================================================
// SINGLETON INSTANCE
// ============================================================
export const storeValidator = new StoreValidator();

View File

@@ -1,750 +0,0 @@
/**
* Worker Service
*
* Polls the job queue and processes crawl jobs.
* Each worker instance runs independently, claiming jobs atomically.
*
* Phase 1: Enhanced with self-healing logic, error taxonomy, and retry management.
*/
import {
claimNextJob,
completeJob,
failJob,
updateJobProgress,
heartbeat,
getWorkerId,
getWorkerHostname,
recoverStaleJobs,
QueuedJob,
} from './job-queue';
import { crawlDispensaryProducts } from './product-crawler';
import { mapDbRowToDispensary } from './discovery';
import { query } from '../db/connection';
// Phase 1: Error taxonomy and retry management
import {
CrawlErrorCode,
CrawlErrorCodeType,
classifyError,
isRetryable,
shouldRotateProxy,
shouldRotateUserAgent,
createSuccessResult,
createFailureResult,
CrawlResult,
} from './error-taxonomy';
import {
RetryManager,
RetryDecision,
calculateNextCrawlAt,
determineCrawlStatus,
shouldAttemptRecovery,
sleep,
} from './retry-manager';
import {
CrawlRotator,
userAgentRotator,
updateDispensaryRotation,
} from './proxy-rotator';
import { DEFAULT_CONFIG, validateStoreConfig, isCrawlable } from './store-validator';
// Use shared dispensary columns (handles optional columns like provider_detection_data)
// NOTE: Using WITH_FAILED variant for worker compatibility checks
import { DISPENSARY_COLUMNS_WITH_FAILED as DISPENSARY_COLUMNS } from '../db/dispensary-columns';
// ============================================================
// WORKER CONFIG
// ============================================================
const POLL_INTERVAL_MS = 5000; // Check for jobs every 5 seconds
const HEARTBEAT_INTERVAL_MS = 60000; // Send heartbeat every 60 seconds
const STALE_CHECK_INTERVAL_MS = 300000; // Check for stale jobs every 5 minutes
const SHUTDOWN_GRACE_PERIOD_MS = 30000; // Wait 30s for job to complete on shutdown
// ============================================================
// WORKER STATE
// ============================================================
let isRunning = false;
let currentJob: QueuedJob | null = null;
let pollTimer: NodeJS.Timeout | null = null;
let heartbeatTimer: NodeJS.Timeout | null = null;
let staleCheckTimer: NodeJS.Timeout | null = null;
let shutdownPromise: Promise<void> | null = null;
// ============================================================
// WORKER LIFECYCLE
// ============================================================
/**
* Start the worker
*/
export async function startWorker(): Promise<void> {
if (isRunning) {
console.log('[Worker] Already running');
return;
}
const workerId = getWorkerId();
const hostname = getWorkerHostname();
console.log(`[Worker] Starting worker ${workerId} on ${hostname}`);
isRunning = true;
// Set up graceful shutdown
setupShutdownHandlers();
// Start polling for jobs
pollTimer = setInterval(pollForJobs, POLL_INTERVAL_MS);
// Start stale job recovery (only one worker should do this, but it's idempotent)
staleCheckTimer = setInterval(async () => {
try {
await recoverStaleJobs(15);
} catch (error) {
console.error('[Worker] Error recovering stale jobs:', error);
}
}, STALE_CHECK_INTERVAL_MS);
// Immediately poll for a job
await pollForJobs();
console.log(`[Worker] Worker ${workerId} started, polling every ${POLL_INTERVAL_MS}ms`);
}
/**
* Stop the worker gracefully
*/
export async function stopWorker(): Promise<void> {
if (!isRunning) return;
console.log('[Worker] Stopping worker...');
isRunning = false;
// Clear timers
if (pollTimer) {
clearInterval(pollTimer);
pollTimer = null;
}
if (heartbeatTimer) {
clearInterval(heartbeatTimer);
heartbeatTimer = null;
}
if (staleCheckTimer) {
clearInterval(staleCheckTimer);
staleCheckTimer = null;
}
// Wait for current job to complete
if (currentJob) {
console.log(`[Worker] Waiting for job ${currentJob.id} to complete...`);
const startWait = Date.now();
while (currentJob && Date.now() - startWait < SHUTDOWN_GRACE_PERIOD_MS) {
await new Promise(r => setTimeout(r, 1000));
}
if (currentJob) {
console.log(`[Worker] Job ${currentJob.id} did not complete in time, marking for retry`);
await failJob(currentJob.id, 'Worker shutdown');
}
}
console.log('[Worker] Worker stopped');
}
/**
* Get worker status
*/
export function getWorkerStatus(): {
isRunning: boolean;
workerId: string;
hostname: string;
currentJob: QueuedJob | null;
} {
return {
isRunning,
workerId: getWorkerId(),
hostname: getWorkerHostname(),
currentJob,
};
}
// ============================================================
// JOB PROCESSING
// ============================================================
/**
* Poll for and process the next available job
*/
async function pollForJobs(): Promise<void> {
if (!isRunning || currentJob) {
return; // Already processing a job
}
try {
const workerId = getWorkerId();
// Try to claim a job
const job = await claimNextJob({
workerId,
jobTypes: ['dutchie_product_crawl', 'menu_detection', 'menu_detection_single'],
lockDurationMinutes: 30,
});
if (!job) {
return; // No jobs available
}
currentJob = job;
console.log(`[Worker] Processing job ${job.id} (type=${job.jobType}, dispensary=${job.dispensaryId})`);
// Start heartbeat for this job
heartbeatTimer = setInterval(async () => {
if (currentJob) {
try {
await heartbeat(currentJob.id);
} catch (error) {
console.error('[Worker] Heartbeat error:', error);
}
}
}, HEARTBEAT_INTERVAL_MS);
// Process the job
await processJob(job);
} catch (error: any) {
console.error('[Worker] Error polling for jobs:', error);
if (currentJob) {
try {
await failJob(currentJob.id, error.message);
} catch (failError) {
console.error('[Worker] Error failing job:', failError);
}
}
} finally {
// Clear heartbeat timer
if (heartbeatTimer) {
clearInterval(heartbeatTimer);
heartbeatTimer = null;
}
currentJob = null;
}
}
/**
* Process a single job
*/
async function processJob(job: QueuedJob): Promise<void> {
try {
switch (job.jobType) {
case 'dutchie_product_crawl':
await processProductCrawlJob(job);
break;
case 'menu_detection':
await processMenuDetectionJob(job);
break;
case 'menu_detection_single':
await processSingleDetectionJob(job);
break;
default:
throw new Error(`Unknown job type: ${job.jobType}`);
}
} catch (error: any) {
console.error(`[Worker] Job ${job.id} failed:`, error);
await failJob(job.id, error.message);
}
}
// Thresholds for crawl status transitions
const DEGRADED_THRESHOLD = 3; // Mark as degraded after 3 consecutive failures
const FAILED_THRESHOLD = 10; // Mark as failed after 10 consecutive failures
// For backwards compatibility
const MAX_CONSECUTIVE_FAILURES = FAILED_THRESHOLD;
/**
* Record a successful crawl - resets failure counter and restores active status
*/
async function recordCrawlSuccess(
dispensaryId: number,
result: CrawlResult
): Promise<void> {
// Calculate next crawl time (use store's frequency or default)
const { rows: storeRows } = await query<any>(
`SELECT crawl_frequency_minutes FROM dispensaries WHERE id = $1`,
[dispensaryId]
);
const frequencyMinutes = storeRows[0]?.crawl_frequency_minutes || DEFAULT_CONFIG.crawlFrequencyMinutes;
const nextCrawlAt = calculateNextCrawlAt(0, frequencyMinutes);
// Reset failure state and schedule next crawl
await query(
`UPDATE dispensaries
SET consecutive_failures = 0,
crawl_status = 'active',
backoff_multiplier = 1.0,
last_crawl_at = NOW(),
last_success_at = NOW(),
last_error_code = NULL,
next_crawl_at = $2,
total_attempts = COALESCE(total_attempts, 0) + 1,
total_successes = COALESCE(total_successes, 0) + 1,
updated_at = NOW()
WHERE id = $1`,
[dispensaryId, nextCrawlAt]
);
// Log to crawl_attempts table for analytics
await logCrawlAttempt(dispensaryId, result);
console.log(`[Worker] Dispensary ${dispensaryId} crawl success. Next crawl at ${nextCrawlAt.toISOString()}`);
}
/**
* Record a crawl failure with self-healing logic
* - Rotates proxy/UA based on error type
* - Transitions through: active -> degraded -> failed
* - Calculates backoff for next attempt
*/
async function recordCrawlFailure(
dispensaryId: number,
errorMessage: string,
errorCode?: CrawlErrorCodeType,
httpStatus?: number,
context?: {
proxyUsed?: string;
userAgentUsed?: string;
attemptNumber?: number;
}
): Promise<{ wasFlagged: boolean; newStatus: string; shouldRotateProxy: boolean; shouldRotateUA: boolean }> {
// Classify the error if not provided
const code = errorCode || classifyError(errorMessage, httpStatus);
// Get current state
const { rows: storeRows } = await query<any>(
`SELECT
consecutive_failures,
crawl_status,
backoff_multiplier,
crawl_frequency_minutes,
current_proxy_id,
current_user_agent
FROM dispensaries WHERE id = $1`,
[dispensaryId]
);
if (storeRows.length === 0) {
return { wasFlagged: false, newStatus: 'unknown', shouldRotateProxy: false, shouldRotateUA: false };
}
const store = storeRows[0];
const currentFailures = (store.consecutive_failures || 0) + 1;
const frequencyMinutes = store.crawl_frequency_minutes || DEFAULT_CONFIG.crawlFrequencyMinutes;
// Determine if we should rotate proxy/UA based on error type
const rotateProxy = shouldRotateProxy(code);
const rotateUA = shouldRotateUserAgent(code);
// Get new proxy/UA if rotation is needed
let newProxyId = store.current_proxy_id;
let newUserAgent = store.current_user_agent;
if (rotateUA) {
newUserAgent = userAgentRotator.getNext();
console.log(`[Worker] Rotating user agent for dispensary ${dispensaryId} after ${code}`);
}
// Determine new crawl status
const newStatus = determineCrawlStatus(currentFailures, {
degraded: DEGRADED_THRESHOLD,
failed: FAILED_THRESHOLD,
});
// Calculate backoff multiplier and next crawl time
const newBackoffMultiplier = Math.min(
(store.backoff_multiplier || 1.0) * 1.5,
4.0 // Max 4x backoff
);
const nextCrawlAt = calculateNextCrawlAt(currentFailures, frequencyMinutes);
// Update dispensary with new failure state
if (newStatus === 'failed') {
// Mark as failed - won't be crawled again until manual intervention
await query(
`UPDATE dispensaries
SET consecutive_failures = $2,
crawl_status = $3,
backoff_multiplier = $4,
last_failure_at = NOW(),
last_error_code = $5,
failed_at = NOW(),
failure_notes = $6,
next_crawl_at = NULL,
current_proxy_id = $7,
current_user_agent = $8,
total_attempts = COALESCE(total_attempts, 0) + 1,
updated_at = NOW()
WHERE id = $1`,
[
dispensaryId,
currentFailures,
newStatus,
newBackoffMultiplier,
code,
`Auto-flagged after ${currentFailures} consecutive failures. Last error: ${errorMessage}`,
newProxyId,
newUserAgent,
]
);
console.log(`[Worker] Dispensary ${dispensaryId} marked as FAILED after ${currentFailures} failures (${code})`);
} else {
// Update failure count but keep crawling (active or degraded)
await query(
`UPDATE dispensaries
SET consecutive_failures = $2,
crawl_status = $3,
backoff_multiplier = $4,
last_failure_at = NOW(),
last_error_code = $5,
next_crawl_at = $6,
current_proxy_id = $7,
current_user_agent = $8,
total_attempts = COALESCE(total_attempts, 0) + 1,
updated_at = NOW()
WHERE id = $1`,
[
dispensaryId,
currentFailures,
newStatus,
newBackoffMultiplier,
code,
nextCrawlAt,
newProxyId,
newUserAgent,
]
);
if (newStatus === 'degraded') {
console.log(`[Worker] Dispensary ${dispensaryId} marked as DEGRADED (${currentFailures}/${FAILED_THRESHOLD} failures). Next crawl: ${nextCrawlAt.toISOString()}`);
} else {
console.log(`[Worker] Dispensary ${dispensaryId} failure recorded (${currentFailures}/${DEGRADED_THRESHOLD}). Next crawl: ${nextCrawlAt.toISOString()}`);
}
}
// Log to crawl_attempts table
const result = createFailureResult(
dispensaryId,
new Date(),
errorMessage,
httpStatus,
context
);
await logCrawlAttempt(dispensaryId, result);
return {
wasFlagged: newStatus === 'failed',
newStatus,
shouldRotateProxy: rotateProxy,
shouldRotateUA: rotateUA,
};
}
/**
* Log a crawl attempt to the crawl_attempts table for analytics
*/
async function logCrawlAttempt(
dispensaryId: number,
result: CrawlResult
): Promise<void> {
try {
await query(
`INSERT INTO crawl_attempts (
dispensary_id, started_at, finished_at, duration_ms,
error_code, error_message, http_status,
attempt_number, proxy_used, user_agent_used,
products_found, products_upserted, snapshots_created,
created_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, NOW())`,
[
dispensaryId,
result.startedAt,
result.finishedAt,
result.durationMs,
result.errorCode,
result.errorMessage || null,
result.httpStatus || null,
result.attemptNumber,
result.proxyUsed || null,
result.userAgentUsed || null,
result.productsFound || 0,
result.productsUpserted || 0,
result.snapshotsCreated || 0,
]
);
} catch (error) {
// Don't fail the job if logging fails
console.error(`[Worker] Failed to log crawl attempt for dispensary ${dispensaryId}:`, error);
}
}
/**
* Process a product crawl job for a single dispensary
*/
async function processProductCrawlJob(job: QueuedJob): Promise<void> {
const startedAt = new Date();
const userAgent = userAgentRotator.getCurrent();
if (!job.dispensaryId) {
throw new Error('Product crawl job requires dispensary_id');
}
// Get dispensary details
const { rows } = await query<any>(
`SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`,
[job.dispensaryId]
);
if (rows.length === 0) {
throw new Error(`Dispensary ${job.dispensaryId} not found`);
}
const dispensary = mapDbRowToDispensary(rows[0]);
const rawDispensary = rows[0];
// Check if dispensary is already flagged as failed
if (rawDispensary.failed_at) {
console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already flagged as failed`);
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
return;
}
// Check crawl status - skip if paused or failed
if (rawDispensary.crawl_status === 'paused' || rawDispensary.crawl_status === 'failed') {
console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - crawl_status is ${rawDispensary.crawl_status}`);
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
return;
}
if (!dispensary.platformDispensaryId) {
// Record failure with error taxonomy
const { wasFlagged } = await recordCrawlFailure(
job.dispensaryId,
'Missing platform_dispensary_id',
CrawlErrorCode.MISSING_PLATFORM_ID,
undefined,
{ userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 }
);
if (wasFlagged) {
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
return;
}
throw new Error(`Dispensary ${job.dispensaryId} has no platform_dispensary_id`);
}
// Get crawl options from job metadata
const pricingType = job.metadata?.pricingType || 'rec';
const useBothModes = job.metadata?.useBothModes !== false;
try {
// Crawl the dispensary
const result = await crawlDispensaryProducts(dispensary, pricingType, {
useBothModes,
onProgress: async (progress) => {
// Update progress for live monitoring
await updateJobProgress(job.id, {
productsFound: progress.productsFound,
productsUpserted: progress.productsUpserted,
snapshotsCreated: progress.snapshotsCreated,
currentPage: progress.currentPage,
totalPages: progress.totalPages,
});
},
});
if (result.success) {
// Success! Create result and record
const crawlResult = createSuccessResult(
job.dispensaryId,
startedAt,
{
productsFound: result.productsFetched,
productsUpserted: result.productsUpserted,
snapshotsCreated: result.snapshotsCreated,
},
{
attemptNumber: job.retryCount + 1,
userAgentUsed: userAgent,
}
);
await recordCrawlSuccess(job.dispensaryId, crawlResult);
await completeJob(job.id, {
productsFound: result.productsFetched,
productsUpserted: result.productsUpserted,
snapshotsCreated: result.snapshotsCreated,
// Visibility tracking stats for dashboard
visibilityLostCount: result.visibilityLostCount || 0,
visibilityRestoredCount: result.visibilityRestoredCount || 0,
});
} else {
// Crawl returned failure - classify error and record
const errorCode = classifyError(result.errorMessage || 'Crawl failed', result.httpStatus);
const { wasFlagged } = await recordCrawlFailure(
job.dispensaryId,
result.errorMessage || 'Crawl failed',
errorCode,
result.httpStatus,
{ userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 }
);
if (wasFlagged) {
// Dispensary is now flagged - complete the job
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
} else if (!isRetryable(errorCode)) {
// Non-retryable error - complete as failed
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
} else {
// Retryable error - let job queue handle retry
throw new Error(result.errorMessage || 'Crawl failed');
}
}
} catch (error: any) {
// Record the failure with error taxonomy
const errorCode = classifyError(error.message);
const { wasFlagged } = await recordCrawlFailure(
job.dispensaryId,
error.message,
errorCode,
undefined,
{ userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 }
);
if (wasFlagged) {
// Dispensary is now flagged - complete the job
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
} else if (!isRetryable(errorCode)) {
// Non-retryable error - complete as failed
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
} else {
throw error;
}
}
}
/**
* Process a menu detection job (bulk)
*/
async function processMenuDetectionJob(job: QueuedJob): Promise<void> {
const { executeMenuDetectionJob } = await import('./menu-detection');
const config = job.metadata || {};
const result = await executeMenuDetectionJob(config);
if (result.status === 'error') {
throw new Error(result.errorMessage || 'Menu detection failed');
}
await completeJob(job.id, {
productsFound: result.itemsProcessed,
productsUpserted: result.itemsSucceeded,
});
}
/**
* Process a single dispensary menu detection job
* This is the parallelizable version - each worker can detect one dispensary at a time
*/
async function processSingleDetectionJob(job: QueuedJob): Promise<void> {
if (!job.dispensaryId) {
throw new Error('Single detection job requires dispensary_id');
}
const { detectAndResolveDispensary } = await import('./menu-detection');
// Get dispensary details
const { rows } = await query<any>(
`SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`,
[job.dispensaryId]
);
if (rows.length === 0) {
throw new Error(`Dispensary ${job.dispensaryId} not found`);
}
const dispensary = rows[0];
// Skip if already detected or failed
if (dispensary.failed_at) {
console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already flagged as failed`);
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
return;
}
if (dispensary.menu_type && dispensary.menu_type !== 'unknown') {
console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already detected as ${dispensary.menu_type}`);
await completeJob(job.id, { productsFound: 0, productsUpserted: 1 });
return;
}
console.log(`[Worker] Detecting menu for dispensary ${job.dispensaryId} (${dispensary.name})...`);
try {
const result = await detectAndResolveDispensary(job.dispensaryId);
if (result.success) {
console.log(`[Worker] Dispensary ${job.dispensaryId}: detected ${result.detectedProvider}, platformId=${result.platformDispensaryId || 'none'}`);
await completeJob(job.id, {
productsFound: 1,
productsUpserted: result.platformDispensaryId ? 1 : 0,
});
} else {
// Detection failed - record failure
await recordCrawlFailure(job.dispensaryId, result.error || 'Detection failed');
throw new Error(result.error || 'Detection failed');
}
} catch (error: any) {
// Record the failure
const wasFlagged = await recordCrawlFailure(job.dispensaryId, error.message);
if (wasFlagged) {
// Dispensary is now flagged - complete the job rather than fail it
await completeJob(job.id, { productsFound: 0, productsUpserted: 0 });
} else {
throw error;
}
}
}
// ============================================================
// SHUTDOWN HANDLING
// ============================================================
function setupShutdownHandlers(): void {
const shutdown = async (signal: string) => {
if (shutdownPromise) return shutdownPromise;
console.log(`\n[Worker] Received ${signal}, shutting down...`);
shutdownPromise = stopWorker();
await shutdownPromise;
process.exit(0);
};
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
}
// ============================================================
// STANDALONE WORKER ENTRY POINT
// ============================================================
if (require.main === module) {
// Run as standalone worker
startWorker().catch((error) => {
console.error('[Worker] Fatal error:', error);
process.exit(1);
});
}

View File

@@ -1,751 +0,0 @@
/**
* Dutchie AZ Data Types
*
* Complete TypeScript interfaces for the isolated Dutchie Arizona data pipeline.
* These types map directly to Dutchie's GraphQL FilteredProducts response.
*/
// ============================================================
// GRAPHQL RESPONSE TYPES (from Dutchie API)
// ============================================================
/**
* Raw Dutchie brand object from GraphQL
*/
export interface DutchieBrand {
id: string;
_id?: string;
name: string;
parentBrandId?: string;
imageUrl?: string;
description?: string;
__typename?: string;
}
/**
* Raw Dutchie image object from GraphQL
*/
export interface DutchieImage {
url: string;
description?: string;
active?: boolean;
__typename?: string;
}
/**
* POSMetaData.children - option-level inventory/pricing
*/
export interface DutchiePOSChild {
activeBatchTags?: any;
canonicalBrandId?: string;
canonicalBrandName?: string;
canonicalCategory?: string;
canonicalCategoryId?: string;
canonicalEffectivePotencyMg?: number;
canonicalID?: string;
canonicalPackageId?: string;
canonicalImgUrl?: string;
canonicalLabResultUrl?: string;
canonicalName?: string;
canonicalSKU?: string;
canonicalProductTags?: string[];
canonicalStrainId?: string;
canonicalVendorId?: string;
kioskQuantityAvailable?: number;
medPrice?: number;
option?: string;
packageQuantity?: number;
price?: number;
quantity?: number;
quantityAvailable?: number;
recEquivalent?: number;
recPrice?: number;
standardEquivalent?: number;
__typename?: string;
}
/**
* POSMetaData object from GraphQL
*/
export interface DutchiePOSMetaData {
activeBatchTags?: any;
canonicalBrandId?: string;
canonicalBrandName?: string;
canonicalCategory?: string;
canonicalCategoryId?: string;
canonicalID?: string;
canonicalPackageId?: string;
canonicalImgUrl?: string;
canonicalLabResultUrl?: string;
canonicalName?: string;
canonicalProductTags?: string[];
canonicalSKU?: string;
canonicalStrainId?: string;
canonicalVendorId?: string;
children?: DutchiePOSChild[];
integrationID?: string;
__typename?: string;
}
/**
* THC/CBD Content structure
*/
export interface DutchiePotencyContent {
unit?: string;
range?: number[];
}
/**
* CannabinoidV2 structure
*/
export interface DutchieCannabinoidV2 {
value: number;
unit: string;
cannabinoid: {
name: string;
};
}
/**
* Special data structure
*/
export interface DutchieSpecialData {
saleSpecials?: Array<{
specialId: string;
specialName: string;
discount: number;
percentDiscount: boolean;
dollarDiscount: boolean;
specialType: string;
}>;
bogoSpecials?: any;
}
/**
* Complete raw product from Dutchie GraphQL FilteredProducts
*/
export interface DutchieRawProduct {
_id: string;
id?: string;
AdditionalOptions?: any;
duplicatedProductId?: string;
libraryProductId?: string;
libraryProductScore?: number;
// Brand
brand?: DutchieBrand;
brandId?: string;
brandName?: string;
brandLogo?: string;
// Potency
CBD?: number;
CBDContent?: DutchiePotencyContent;
THC?: number;
THCContent?: DutchiePotencyContent;
cannabinoidsV2?: DutchieCannabinoidV2[];
// Flags
certificateOfAnalysisEnabled?: boolean;
collectionCardBadge?: string;
comingSoon?: boolean;
featured?: boolean;
medicalOnly?: boolean;
recOnly?: boolean;
nonArmsLength?: boolean;
vapeTaxApplicable?: boolean;
useBetterPotencyTaxes?: boolean;
// Timestamps
createdAt?: string;
updatedAt?: string;
// Dispensary
DispensaryID: string;
enterpriseProductId?: string;
// Images
Image?: string;
images?: DutchieImage[];
// Measurements
measurements?: {
netWeight?: {
unit: string;
values: number[];
};
volume?: any;
};
weight?: number | string;
// Product identity
Name: string;
cName: string;
pastCNames?: string[];
// Options
Options?: string[];
rawOptions?: string[];
limitsPerCustomer?: any;
manualInventory?: boolean;
// POS data
POSMetaData?: DutchiePOSMetaData;
// Pricing
Prices?: number[];
recPrices?: number[];
medicalPrices?: number[];
recSpecialPrices?: number[];
medicalSpecialPrices?: number[];
wholesalePrices?: number[];
pricingTierData?: any;
specialIdsPerOption?: any;
// Specials
special?: boolean;
specialData?: DutchieSpecialData;
// Classification
Status?: string;
strainType?: string;
subcategory?: string;
type?: string;
provider?: string;
effects?: Record<string, any>;
// Threshold flags
isBelowThreshold?: boolean;
isBelowKioskThreshold?: boolean;
optionsBelowThreshold?: boolean;
optionsBelowKioskThreshold?: boolean;
// Misc
bottleDepositTaxCents?: number;
__typename?: string;
}
// ============================================================
// DERIVED TYPES
// ============================================================
/**
* StockStatus - derived from POSMetaData.children quantityAvailable
* - 'in_stock': At least one option has quantityAvailable > 0
* - 'out_of_stock': All options have quantityAvailable === 0
* - 'unknown': No POSMetaData.children or quantityAvailable data
* - 'missing_from_feed': Product was not present in the latest crawl feed
*/
export type StockStatus = 'in_stock' | 'out_of_stock' | 'unknown' | 'missing_from_feed';
/**
* CrawlMode - defines how products are fetched from Dutchie
* - 'mode_a': UI parity - Status: 'Active', threshold removal ON
* - 'mode_b': MAX COVERAGE - No Status filter, bypass thresholds
*/
export type CrawlMode = 'mode_a' | 'mode_b';
/**
* Per-option stock status type
*/
export type OptionStockStatus = 'in_stock' | 'out_of_stock' | 'unknown';
/**
* Get available quantity for a single option
* Priority: quantityAvailable > kioskQuantityAvailable > quantity
*/
export function getOptionQuantity(child: DutchiePOSChild): number | null {
if (typeof child.quantityAvailable === 'number') return child.quantityAvailable;
if (typeof child.kioskQuantityAvailable === 'number') return child.kioskQuantityAvailable;
if (typeof child.quantity === 'number') return child.quantity;
return null; // No quantity data available
}
/**
* Derive stock status for a single option
* Returns: 'in_stock' if qty > 0, 'out_of_stock' if qty === 0, 'unknown' if no data
*/
export function deriveOptionStockStatus(child: DutchiePOSChild): OptionStockStatus {
const qty = getOptionQuantity(child);
if (qty === null) return 'unknown';
return qty > 0 ? 'in_stock' : 'out_of_stock';
}
/**
* Derive product-level stock status from POSMetaData.children
*
* Logic per spec:
* - If ANY child is "in_stock" → product is "in_stock"
* - Else if ALL children are "out_of_stock" → product is "out_of_stock"
* - Else → product is "unknown"
*
* IMPORTANT: Threshold flags (isBelowThreshold, etc.) do NOT override stock status.
* They only indicate "low stock" - if qty > 0, status stays "in_stock".
*/
export function deriveStockStatus(product: DutchieRawProduct): StockStatus {
const children = product.POSMetaData?.children;
// No children data - unknown
if (!children || children.length === 0) {
return 'unknown';
}
// Get stock status for each option
const optionStatuses = children.map(deriveOptionStockStatus);
// If ANY option is in_stock → product is in_stock
if (optionStatuses.some(status => status === 'in_stock')) {
return 'in_stock';
}
// If ALL options are out_of_stock → product is out_of_stock
if (optionStatuses.every(status => status === 'out_of_stock')) {
return 'out_of_stock';
}
// Otherwise (mix of out_of_stock and unknown) → unknown
return 'unknown';
}
/**
* Calculate total quantity available across all options
* Returns null if no children data (unknown inventory), 0 if children exist but all have 0 qty
*/
export function calculateTotalQuantity(product: DutchieRawProduct): number | null {
const children = product.POSMetaData?.children;
// No children = unknown inventory, return null (NOT 0)
if (!children || children.length === 0) return null;
// Check if any child has quantity data
const hasAnyQtyData = children.some(child => getOptionQuantity(child) !== null);
if (!hasAnyQtyData) return null; // All children lack qty data = unknown
return children.reduce((sum, child) => {
const qty = getOptionQuantity(child);
return sum + (qty ?? 0);
}, 0);
}
/**
* Calculate total kiosk quantity available across all options
*/
export function calculateTotalKioskQuantity(product: DutchieRawProduct): number | null {
const children = product.POSMetaData?.children;
if (!children || children.length === 0) return null;
const hasAnyKioskQty = children.some(child => typeof child.kioskQuantityAvailable === 'number');
if (!hasAnyKioskQty) return null;
return children.reduce((sum, child) => sum + (child.kioskQuantityAvailable ?? 0), 0);
}
// ============================================================
// DATABASE ENTITY TYPES
// ============================================================
/**
* Dispensary - represents a Dutchie store in Arizona
*/
export interface Dispensary {
id: number;
platform: 'dutchie';
name: string;
dbaName?: string;
slug: string;
city: string;
state: string;
postalCode?: string;
latitude?: number;
longitude?: number;
address?: string;
platformDispensaryId?: string; // Resolved internal ID (e.g., "6405ef617056e8014d79101b")
isDelivery?: boolean;
isPickup?: boolean;
rawMetadata?: any; // Full discovery node
lastCrawledAt?: Date;
productCount?: number;
createdAt: Date;
updatedAt: Date;
menuType?: string;
menuUrl?: string;
scrapeEnabled?: boolean;
providerDetectionData?: any;
platformDispensaryIdResolvedAt?: Date;
website?: string; // The dispensary's own website (from raw_metadata or direct column)
}
/**
* DutchieProduct - canonical product identity per store
*/
export interface DutchieProduct {
id: number;
dispensaryId: number;
platform: 'dutchie';
externalProductId: string; // from _id or id
platformDispensaryId: string; // mirror of Dispensary.platformDispensaryId
cName?: string; // cName / slug
name: string; // Name
// Brand
brandName?: string;
brandId?: string;
brandLogoUrl?: string;
// Classification
type?: string;
subcategory?: string;
strainType?: string;
provider?: string;
// Potency
thc?: number;
thcContent?: number;
cbd?: number;
cbdContent?: number;
cannabinoidsV2?: DutchieCannabinoidV2[];
effects?: Record<string, any>;
// Status / flags
status?: string;
medicalOnly: boolean;
recOnly: boolean;
featured: boolean;
comingSoon: boolean;
certificateOfAnalysisEnabled: boolean;
isBelowThreshold: boolean;
isBelowKioskThreshold: boolean;
optionsBelowThreshold: boolean;
optionsBelowKioskThreshold: boolean;
// Derived stock status (from POSMetaData.children quantityAvailable)
stockStatus: StockStatus;
totalQuantityAvailable?: number | null; // null = unknown (no children), 0 = all OOS
// Images
primaryImageUrl?: string;
images?: DutchieImage[];
// Misc
measurements?: any;
weight?: string;
pastCNames?: string[];
createdAtDutchie?: Date;
updatedAtDutchie?: Date;
latestRawPayload?: any; // Full product node from last crawl
createdAt: Date;
updatedAt: Date;
}
/**
* DutchieProductOptionSnapshot - child-level option data from POSMetaData.children
*/
export interface DutchieProductOptionSnapshot {
optionId: string; // canonicalID or canonicalPackageId or canonicalSKU
canonicalId?: string;
canonicalPackageId?: string;
canonicalSKU?: string;
canonicalName?: string;
canonicalCategory?: string;
canonicalCategoryId?: string;
canonicalBrandId?: string;
canonicalBrandName?: string;
canonicalStrainId?: string;
canonicalVendorId?: string;
optionLabel?: string; // from option field
packageQuantity?: number;
recEquivalent?: number;
standardEquivalent?: number;
priceCents?: number; // price * 100
recPriceCents?: number; // recPrice * 100
medPriceCents?: number; // medPrice * 100
quantity?: number;
quantityAvailable?: number;
kioskQuantityAvailable?: number;
activeBatchTags?: any;
canonicalImgUrl?: string;
canonicalLabResultUrl?: string;
canonicalEffectivePotencyMg?: number;
rawChildPayload?: any; // Full POSMetaData.children node
}
/**
* DutchieProductSnapshot - per crawl, includes options[]
*/
export interface DutchieProductSnapshot {
id: number;
dutchieProductId: number;
dispensaryId: number;
platformDispensaryId: string;
externalProductId: string;
pricingType: 'rec' | 'med' | 'unknown';
crawlMode: CrawlMode; // Which crawl mode captured this snapshot
status?: string;
featured: boolean;
special: boolean;
medicalOnly: boolean;
recOnly: boolean;
// Flag indicating if product was present in feed (false = missing_from_feed snapshot)
isPresentInFeed: boolean;
// Derived stock status for this snapshot
stockStatus: StockStatus;
// Price summary (aggregated from children, in cents)
recMinPriceCents?: number;
recMaxPriceCents?: number;
recMinSpecialPriceCents?: number;
medMinPriceCents?: number;
medMaxPriceCents?: number;
medMinSpecialPriceCents?: number;
wholesaleMinPriceCents?: number;
// Inventory summary (aggregated from POSMetaData.children)
totalQuantityAvailable?: number | null; // null = unknown (no children), 0 = all OOS
totalKioskQuantityAvailable?: number | null;
manualInventory: boolean;
isBelowThreshold: boolean;
isBelowKioskThreshold: boolean;
// Option-level data
options: DutchieProductOptionSnapshot[];
// Full raw product node at this crawl time
rawPayload: any;
crawledAt: Date;
createdAt: Date;
updatedAt: Date;
}
/**
* CrawlJob - tracks crawl execution status
*/
export interface CrawlJob {
id: number;
jobType: 'discovery' | 'product_crawl' | 'resolve_ids';
dispensaryId?: number;
status: 'pending' | 'running' | 'completed' | 'failed';
startedAt?: Date;
completedAt?: Date;
errorMessage?: string;
productsFound?: number;
snapshotsCreated?: number;
metadata?: any;
createdAt: Date;
updatedAt: Date;
}
/**
* JobSchedule - recurring job configuration with jitter support
* Times "wander" around the clock due to random jitter after each run
*/
export type JobStatus = 'success' | 'error' | 'partial' | 'running' | null;
export interface JobSchedule {
id: number;
jobName: string;
description?: string;
enabled: boolean;
// Timing configuration
baseIntervalMinutes: number; // e.g., 240 (4 hours)
jitterMinutes: number; // e.g., 30 (±30 minutes)
// Worker identity
workerName?: string; // e.g., "Alice", "Henry", "Bella", "Oscar"
workerRole?: string; // e.g., "Store Discovery Worker", "GraphQL Product Sync"
// Last run tracking
lastRunAt?: Date;
lastStatus?: JobStatus;
lastErrorMessage?: string;
lastDurationMs?: number;
// Next run (calculated with jitter)
nextRunAt?: Date;
// Job-specific config
jobConfig?: Record<string, any>;
createdAt: Date;
updatedAt: Date;
}
/**
* JobRunLog - history of job executions
*/
export interface JobRunLog {
id: number;
scheduleId: number;
jobName: string;
status: 'pending' | 'running' | 'success' | 'error' | 'partial';
startedAt?: Date;
completedAt?: Date;
durationMs?: number;
errorMessage?: string;
// Worker identity (propagated from schedule)
workerName?: string; // e.g., "Alice", "Henry", "Bella", "Oscar"
runRole?: string; // e.g., "Store Discovery Worker"
// Results summary
itemsProcessed?: number;
itemsSucceeded?: number;
itemsFailed?: number;
metadata?: any;
createdAt: Date;
}
// ============================================================
// GRAPHQL OPERATION TYPES
// ============================================================
export interface FilteredProductsVariables {
includeEnterpriseSpecials: boolean;
productsFilter: {
dispensaryId: string;
pricingType: 'rec' | 'med';
strainTypes?: string[];
subcategories?: string[];
Status?: string;
types?: string[];
useCache?: boolean;
isDefaultSort?: boolean;
sortBy?: string;
sortDirection?: number;
bypassOnlineThresholds?: boolean;
isKioskMenu?: boolean;
removeProductsBelowOptionThresholds?: boolean;
};
page: number;
perPage: number;
}
export interface GetAddressBasedDispensaryDataVariables {
input: {
dispensaryId: string; // The slug like "AZ-Deeply-Rooted"
};
}
export interface ConsumerDispensariesVariables {
filter: {
lat: number;
lng: number;
radius: number; // in meters or km
isDelivery?: boolean;
searchText?: string;
};
}
// ============================================================
// API RESPONSE TYPES
// ============================================================
export interface DashboardStats {
dispensaryCount: number;
productCount: number;
snapshotCount24h: number;
lastCrawlTime?: Date;
failedJobCount: number;
brandCount: number;
categoryCount: number;
}
export interface CategorySummary {
type: string;
subcategory: string;
productCount: number;
dispensaryCount: number;
avgPrice?: number;
}
export interface BrandSummary {
brandName: string;
brandId?: string;
brandLogoUrl?: string;
productCount: number;
dispensaryCount: number;
}
// ============================================================
// CRAWLER PROFILE TYPES
// ============================================================
/**
* DispensaryCrawlerProfile - per-store crawler configuration
*
* Allows each dispensary to have customized crawler settings without
* affecting shared crawler logic. A dispensary can have multiple profiles
* but only one is active at a time (via dispensaries.active_crawler_profile_id).
*/
export interface DispensaryCrawlerProfile {
id: number;
dispensaryId: number;
profileName: string;
crawlerType: string; // 'dutchie', 'treez', 'jane', 'sandbox', 'custom'
profileKey: string | null; // Optional key for per-store module mapping
config: Record<string, any>; // Crawler-specific configuration
timeoutMs: number | null;
downloadImages: boolean;
trackStock: boolean;
version: number;
enabled: boolean;
createdAt: Date;
updatedAt: Date;
}
/**
* DispensaryCrawlerProfileCreate - input type for creating a new profile
*/
export interface DispensaryCrawlerProfileCreate {
dispensaryId: number;
profileName: string;
crawlerType: string;
profileKey?: string | null;
config?: Record<string, any>;
timeoutMs?: number | null;
downloadImages?: boolean;
trackStock?: boolean;
version?: number;
enabled?: boolean;
}
/**
* DispensaryCrawlerProfileUpdate - input type for updating an existing profile
*/
export interface DispensaryCrawlerProfileUpdate {
profileName?: string;
crawlerType?: string;
profileKey?: string | null;
config?: Record<string, any>;
timeoutMs?: number | null;
downloadImages?: boolean;
trackStock?: boolean;
version?: number;
enabled?: boolean;
}
/**
* CrawlerProfileOptions - runtime options derived from a profile
* Used when invoking the actual crawler
*/
export interface CrawlerProfileOptions {
timeoutMs: number;
downloadImages: boolean;
trackStock: boolean;
config: Record<string, any>;
}

View File

@@ -669,12 +669,4 @@ export async function syncRecentCrawls(
return { synced, errors };
}
// ============================================================
// EXPORTS
// ============================================================
export {
CrawlResult,
SyncOptions,
SyncResult,
};
// Types CrawlResult, SyncOptions, and SyncResult are already exported at their declarations

View File

@@ -6,6 +6,7 @@ import { initializeMinio, isMinioEnabled } from './utils/minio';
import { initializeImageStorage } from './utils/image-storage';
import { logger } from './services/logger';
import { cleanupOrphanedJobs } from './services/proxyTestQueue';
import healthRoutes from './routes/health';
dotenv.config();
@@ -58,22 +59,15 @@ import scraperMonitorRoutes from './routes/scraper-monitor';
import apiTokensRoutes from './routes/api-tokens';
import apiPermissionsRoutes from './routes/api-permissions';
import parallelScrapeRoutes from './routes/parallel-scrape';
import scheduleRoutes from './routes/schedule';
import crawlerSandboxRoutes from './routes/crawler-sandbox';
import versionRoutes from './routes/version';
import publicApiRoutes from './routes/public-api';
import usersRoutes from './routes/users';
import staleProcessesRoutes from './routes/stale-processes';
import orchestratorAdminRoutes from './routes/orchestrator-admin';
import adminRoutes from './routes/admin';
import healthRoutes from './routes/health';
import workersRoutes from './routes/workers';
import { dutchieAZRouter, startScheduler as startDutchieAZScheduler, initializeDefaultSchedules } from './dutchie-az';
import { getPool } from './dutchie-az/db/connection';
import { createAnalyticsRouter } from './dutchie-az/routes/analytics';
import { createMultiStateRoutes } from './multi-state';
import { trackApiUsage, checkRateLimit } from './middleware/apiTokenTracker';
import { startCrawlScheduler } from './services/crawl-scheduler';
import { validateWordPressPermissions } from './middleware/wordpressPermissions';
import { markTrustedDomains } from './middleware/trustedDomains';
import { createSystemRouter, createPrometheusRouter } from './system/routes';
@@ -81,7 +75,7 @@ import { createPortalRoutes } from './portals';
import { createStatesRouter } from './routes/states';
import { createAnalyticsV2Router } from './routes/analytics-v2';
import { createDiscoveryRoutes } from './discovery';
import { createDutchieDiscoveryRoutes, promoteDiscoveryLocation } from './dutchie-az/discovery';
import { getPool } from './db/pool';
// Consumer API routes (findadispo.com, findagram.co)
import consumerAuthRoutes from './routes/consumer-auth';
@@ -132,41 +126,22 @@ app.use('/api/scraper-monitor', scraperMonitorRoutes);
app.use('/api/api-tokens', apiTokensRoutes);
app.use('/api/api-permissions', apiPermissionsRoutes);
app.use('/api/parallel-scrape', parallelScrapeRoutes);
app.use('/api/schedule', scheduleRoutes);
app.use('/api/crawler-sandbox', crawlerSandboxRoutes);
app.use('/api/version', versionRoutes);
app.use('/api/users', usersRoutes);
app.use('/api/stale-processes', staleProcessesRoutes);
// Admin routes - operator actions (crawl triggers, health checks)
app.use('/api/admin', adminRoutes);
// Admin routes - orchestrator actions
app.use('/api/admin/orchestrator', orchestratorAdminRoutes);
// SEO orchestrator routes
app.use('/api/seo', seoRoutes);
// Provider-agnostic worker management routes (replaces /api/dutchie-az/admin/schedules)
// Provider-agnostic worker management routes
app.use('/api/workers', workersRoutes);
// Monitor routes - aliased from workers for convenience
app.use('/api/monitor', workersRoutes);
console.log('[Workers] Routes registered at /api/workers and /api/monitor');
// Market data pipeline routes (provider-agnostic)
app.use('/api/markets', dutchieAZRouter);
// Legacy aliases (deprecated - remove after frontend migration)
app.use('/api/az', dutchieAZRouter);
app.use('/api/dutchie-az', dutchieAZRouter);
// Phase 3: Analytics Dashboards - price trends, penetration, category growth, etc.
try {
const analyticsRouter = createAnalyticsRouter(getPool());
app.use('/api/markets/analytics', analyticsRouter);
// Legacy alias for backwards compatibility
app.use('/api/az/analytics', analyticsRouter);
console.log('[Analytics] Routes registered at /api/markets/analytics');
} catch (error) {
console.warn('[Analytics] Failed to register routes:', error);
}
// Phase 3: Analytics V2 - Enhanced analytics with rec/med state segmentation
try {
const analyticsV2Router = createAnalyticsV2Router(getPool());
@@ -239,43 +214,7 @@ try {
}
// Platform-specific Discovery Routes
// Uses neutral slugs to avoid trademark issues in URLs:
// dt = Dutchie, jn = Jane, wm = Weedmaps, etc.
// Routes: /api/discovery/platforms/:platformSlug/*
try {
const dtDiscoveryRoutes = createDutchieDiscoveryRoutes(getPool());
app.use('/api/discovery/platforms/dt', dtDiscoveryRoutes);
console.log('[Discovery] Platform routes registered at /api/discovery/platforms/dt');
} catch (error) {
console.warn('[Discovery] Failed to register platform routes:', error);
}
// Orchestrator promotion endpoint (platform-agnostic)
// Route: /api/orchestrator/platforms/:platformSlug/promote/:id
app.post('/api/orchestrator/platforms/:platformSlug/promote/:id', async (req, res) => {
try {
const { platformSlug, id } = req.params;
// Validate platform slug
const validPlatforms = ['dt']; // dt = Dutchie
if (!validPlatforms.includes(platformSlug)) {
return res.status(400).json({
success: false,
error: `Invalid platform slug: ${platformSlug}. Valid slugs: ${validPlatforms.join(', ')}`
});
}
const result = await promoteDiscoveryLocation(getPool(), parseInt(id, 10));
if (result.success) {
res.json(result);
} else {
res.status(400).json(result);
}
} catch (error: any) {
console.error('[Orchestrator] Promotion error:', error);
res.status(500).json({ success: false, error: error.message });
}
});
// TODO: Rebuild with /platforms/dutchie/ module
async function startServer() {
try {
@@ -288,15 +227,6 @@ async function startServer() {
// Clean up any orphaned proxy test jobs from previous server runs
await cleanupOrphanedJobs();
// Start the crawl scheduler (checks every minute for jobs to run)
startCrawlScheduler();
logger.info('system', 'Crawl scheduler started');
// Start the Dutchie AZ scheduler (enqueues jobs for workers)
await initializeDefaultSchedules();
startDutchieAZScheduler();
logger.info('system', 'Dutchie AZ scheduler started');
app.listen(PORT, () => {
logger.info('system', `Server running on port ${PORT}`);
console.log(`🚀 Server running on port ${PORT}`);

View File

@@ -0,0 +1,544 @@
/**
* ============================================================
* DUTCHIE PLATFORM CLIENT - LOCKED MODULE
* ============================================================
*
* DO NOT MODIFY THIS FILE WITHOUT EXPLICIT AUTHORIZATION.
*
* This is the canonical HTTP client for all Dutchie communication.
* All Dutchie workers (Alice, Bella, etc.) MUST use this client.
*
* IMPLEMENTATION:
* - Uses curl via child_process.execSync (bypasses TLS fingerprinting)
* - NO Puppeteer, NO axios, NO fetch
* - Fingerprint rotation on 403
* - Residential IP compatible
*
* USAGE:
* import { curlPost, curlGet, executeGraphQL } from '@dutchie/client';
*
* ============================================================
*/
import { execSync } from 'child_process';
// ============================================================
// TYPES
// ============================================================
export interface CurlResponse {
status: number;
data: any;
error?: string;
}
export interface Fingerprint {
userAgent: string;
acceptLanguage: string;
secChUa?: string;
secChUaPlatform?: string;
secChUaMobile?: string;
}
// ============================================================
// CONFIGURATION
// ============================================================
export const DUTCHIE_CONFIG = {
graphqlEndpoint: 'https://dutchie.com/api-3/graphql',
baseUrl: 'https://dutchie.com',
timeout: 30000,
maxRetries: 3,
perPage: 100,
maxPages: 200,
pageDelayMs: 500,
modeDelayMs: 2000,
};
// ============================================================
// PROXY SUPPORT
// ============================================================
// Integrates with the CrawlRotator system from proxy-rotator.ts
// On 403 errors:
// 1. Record failure on current proxy
// 2. Rotate to next proxy
// 3. Retry with new proxy
// ============================================================
import type { CrawlRotator, Proxy } from '../../services/crawl-rotator';
let currentProxy: string | null = null;
let crawlRotator: CrawlRotator | null = null;
/**
* Set proxy for all Dutchie requests
* Format: http://user:pass@host:port or socks5://host:port
*/
export function setProxy(proxy: string | null): void {
currentProxy = proxy;
if (proxy) {
console.log(`[Dutchie Client] Proxy set: ${proxy.replace(/:[^:@]+@/, ':***@')}`);
} else {
console.log('[Dutchie Client] Proxy disabled (direct connection)');
}
}
/**
* Get current proxy URL
*/
export function getProxy(): string | null {
return currentProxy;
}
/**
* Set CrawlRotator for proxy rotation on 403s
* This enables automatic proxy rotation when blocked
*/
export function setCrawlRotator(rotator: CrawlRotator | null): void {
crawlRotator = rotator;
if (rotator) {
console.log('[Dutchie Client] CrawlRotator attached - proxy rotation enabled');
// Set initial proxy from rotator
const proxy = rotator.proxy.getCurrent();
if (proxy) {
currentProxy = rotator.proxy.getProxyUrl(proxy);
console.log(`[Dutchie Client] Initial proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
}
}
}
/**
* Get attached CrawlRotator
*/
export function getCrawlRotator(): CrawlRotator | null {
return crawlRotator;
}
/**
* Rotate to next proxy (called on 403)
*/
async function rotateProxyOn403(error?: string): Promise<boolean> {
if (!crawlRotator) {
return false;
}
// Record failure on current proxy
await crawlRotator.recordFailure(error || '403 Forbidden');
// Rotate to next proxy
const nextProxy = crawlRotator.rotateProxy();
if (nextProxy) {
currentProxy = crawlRotator.proxy.getProxyUrl(nextProxy);
console.log(`[Dutchie Client] Rotated proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`);
return true;
}
console.warn('[Dutchie Client] No more proxies available');
return false;
}
/**
* Record success on current proxy
*/
async function recordProxySuccess(responseTimeMs?: number): Promise<void> {
if (crawlRotator) {
await crawlRotator.recordSuccess(responseTimeMs);
}
}
/**
* Build curl proxy argument
*/
function getProxyArg(): string {
if (!currentProxy) return '';
return `--proxy '${currentProxy}'`;
}
export const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
};
// ============================================================
// FINGERPRINTS - Browser profiles for anti-detect
// ============================================================
const FINGERPRINTS: Fingerprint[] = [
// Chrome Windows (latest) - typical residential user, use first
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Chrome Mac (latest)
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"macOS"',
secChUaMobile: '?0',
},
// Chrome Windows (120)
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
// Firefox Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0',
acceptLanguage: 'en-US,en;q=0.5',
},
// Safari Mac
{
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
acceptLanguage: 'en-US,en;q=0.9',
},
// Edge Windows
{
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
acceptLanguage: 'en-US,en;q=0.9',
secChUa: '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
secChUaPlatform: '"Windows"',
secChUaMobile: '?0',
},
];
let currentFingerprintIndex = 0;
export function getFingerprint(): Fingerprint {
return FINGERPRINTS[currentFingerprintIndex];
}
export function rotateFingerprint(): Fingerprint {
currentFingerprintIndex = (currentFingerprintIndex + 1) % FINGERPRINTS.length;
const fp = FINGERPRINTS[currentFingerprintIndex];
console.log(`[Dutchie Client] Rotated to fingerprint: ${fp.userAgent.slice(0, 50)}...`);
return fp;
}
export function resetFingerprint(): void {
currentFingerprintIndex = 0;
}
// ============================================================
// CURL HTTP CLIENT
// ============================================================
/**
* Build headers for Dutchie requests
*/
export function buildHeaders(refererPath: string, fingerprint?: Fingerprint): Record<string, string> {
const fp = fingerprint || getFingerprint();
const refererUrl = `https://dutchie.com${refererPath}`;
const headers: Record<string, string> = {
'accept': 'application/json, text/plain, */*',
'accept-language': fp.acceptLanguage,
'content-type': 'application/json',
'origin': 'https://dutchie.com',
'referer': refererUrl,
'user-agent': fp.userAgent,
'apollographql-client-name': 'Marketplace (production)',
};
if (fp.secChUa) {
headers['sec-ch-ua'] = fp.secChUa;
headers['sec-ch-ua-mobile'] = fp.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fp.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'empty';
headers['sec-fetch-mode'] = 'cors';
headers['sec-fetch-site'] = 'same-site';
}
return headers;
}
/**
* Execute HTTP POST using curl (bypasses TLS fingerprinting)
*/
export function curlPost(url: string, body: any, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
.join(' ');
const bodyJson = JSON.stringify(body).replace(/'/g, "'\\''");
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} -d '${bodyJson}' '${url}'`;
try {
const output = execSync(cmd, {
encoding: 'utf-8',
maxBuffer: 10 * 1024 * 1024,
timeout: timeout + 5000
});
const separatorIndex = output.lastIndexOf(separator);
if (separatorIndex === -1) {
const lines = output.trim().split('\n');
const statusCode = parseInt(lines.pop() || '0', 10);
const responseBody = lines.join('\n');
try {
return { status: statusCode, data: JSON.parse(responseBody) };
} catch {
return { status: statusCode, data: responseBody };
}
}
const responseBody = output.slice(0, separatorIndex);
const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10);
try {
return { status: statusCode, data: JSON.parse(responseBody) };
} catch {
return { status: statusCode, data: responseBody };
}
} catch (error: any) {
return {
status: 0,
data: null,
error: error.message || 'curl request failed'
};
}
}
/**
* Execute HTTP GET using curl (bypasses TLS fingerprinting)
* Returns HTML or JSON depending on response content-type
*/
export function curlGet(url: string, headers: Record<string, string>, timeout = 30000): CurlResponse {
const filteredHeaders = Object.entries(headers)
.filter(([k]) => k.toLowerCase() !== 'accept-encoding')
.map(([k, v]) => `-H '${k}: ${v}'`)
.join(' ');
const timeoutSec = Math.ceil(timeout / 1000);
const separator = '___HTTP_STATUS___';
const proxyArg = getProxyArg();
const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} '${url}'`;
try {
const output = execSync(cmd, {
encoding: 'utf-8',
maxBuffer: 10 * 1024 * 1024,
timeout: timeout + 5000
});
const separatorIndex = output.lastIndexOf(separator);
if (separatorIndex === -1) {
const lines = output.trim().split('\n');
const statusCode = parseInt(lines.pop() || '0', 10);
const responseBody = lines.join('\n');
return { status: statusCode, data: responseBody };
}
const responseBody = output.slice(0, separatorIndex);
const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10);
// Try to parse as JSON, otherwise return as string (HTML)
try {
return { status: statusCode, data: JSON.parse(responseBody) };
} catch {
return { status: statusCode, data: responseBody };
}
} catch (error: any) {
return {
status: 0,
data: null,
error: error.message || 'curl request failed'
};
}
}
// ============================================================
// GRAPHQL EXECUTION
// ============================================================
export interface ExecuteGraphQLOptions {
maxRetries?: number;
retryOn403?: boolean;
cName: string;
}
/**
* Execute GraphQL query with curl (bypasses TLS fingerprinting)
*/
export async function executeGraphQL(
operationName: string,
variables: any,
hash: string,
options: ExecuteGraphQLOptions
): Promise<any> {
const { maxRetries = 3, retryOn403 = true, cName } = options;
const body = {
operationName,
variables,
extensions: {
persistedQuery: { version: 1, sha256Hash: hash },
},
};
let lastError: Error | null = null;
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers = buildHeaders(`/embedded-menu/${cName}`, fingerprint);
console.log(`[Dutchie Client] curl POST ${operationName} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, headers, DUTCHIE_CONFIG.timeout);
console.log(`[Dutchie Client] Response status: ${response.status}`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
lastError = new Error(response.error);
attempt++;
if (attempt <= maxRetries) {
await sleep(1000 * attempt);
}
continue;
}
if (response.status === 200) {
if (response.data?.errors?.length > 0) {
console.warn(`[Dutchie Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`);
}
return response.data;
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`);
rotateFingerprint();
attempt++;
await sleep(1000 * attempt);
continue;
}
const bodyPreview = typeof response.data === 'string'
? response.data.slice(0, 200)
: JSON.stringify(response.data).slice(0, 200);
console.error(`[Dutchie Client] HTTP ${response.status}: ${bodyPreview}`);
lastError = new Error(`HTTP ${response.status}`);
attempt++;
if (attempt <= maxRetries) {
await sleep(1000 * attempt);
}
}
throw lastError || new Error('Max retries exceeded');
}
// ============================================================
// HTML PAGE FETCHING
// ============================================================
export interface FetchPageOptions {
maxRetries?: number;
retryOn403?: boolean;
}
/**
* Fetch HTML page from Dutchie (for city pages, dispensary pages, etc.)
* Returns raw HTML string
*/
export async function fetchPage(
path: string,
options: FetchPageOptions = {}
): Promise<{ html: string; status: number } | null> {
const { maxRetries = 3, retryOn403 = true } = options;
const url = `${DUTCHIE_CONFIG.baseUrl}${path}`;
let attempt = 0;
while (attempt <= maxRetries) {
const fingerprint = getFingerprint();
const headers: Record<string, string> = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'accept-language': fingerprint.acceptLanguage,
'user-agent': fingerprint.userAgent,
};
if (fingerprint.secChUa) {
headers['sec-ch-ua'] = fingerprint.secChUa;
headers['sec-ch-ua-mobile'] = fingerprint.secChUaMobile || '?0';
headers['sec-ch-ua-platform'] = fingerprint.secChUaPlatform || '"Windows"';
headers['sec-fetch-dest'] = 'document';
headers['sec-fetch-mode'] = 'navigate';
headers['sec-fetch-site'] = 'none';
headers['sec-fetch-user'] = '?1';
headers['upgrade-insecure-requests'] = '1';
}
console.log(`[Dutchie Client] curl GET ${path} (attempt ${attempt + 1}/${maxRetries + 1})`);
const response = curlGet(url, headers, DUTCHIE_CONFIG.timeout);
console.log(`[Dutchie Client] Response status: ${response.status}`);
if (response.error) {
console.error(`[Dutchie Client] curl error: ${response.error}`);
attempt++;
if (attempt <= maxRetries) {
await sleep(1000 * attempt);
}
continue;
}
if (response.status === 200) {
return { html: response.data, status: response.status };
}
if (response.status === 403 && retryOn403) {
console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`);
rotateFingerprint();
attempt++;
await sleep(1000 * attempt);
continue;
}
console.error(`[Dutchie Client] HTTP ${response.status}`);
attempt++;
if (attempt <= maxRetries) {
await sleep(1000 * attempt);
}
}
return null;
}
/**
* Extract __NEXT_DATA__ from HTML page
*/
export function extractNextData(html: string): any | null {
const match = html.match(/<script id="__NEXT_DATA__" type="application\/json">([^<]+)<\/script>/);
if (match && match[1]) {
try {
return JSON.parse(match[1]);
} catch (e) {
console.error('[Dutchie Client] Failed to parse __NEXT_DATA__:', e);
return null;
}
}
return null;
}
// ============================================================
// UTILITY
// ============================================================
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}

View File

@@ -0,0 +1,49 @@
/**
* Dutchie Platform Module
*
* Single export point for all Dutchie communication.
* All Dutchie workers MUST import from this module.
*/
export {
// HTTP Client
curlPost,
curlGet,
executeGraphQL,
fetchPage,
extractNextData,
// Headers & Fingerprints
buildHeaders,
getFingerprint,
rotateFingerprint,
resetFingerprint,
// Proxy
setProxy,
getProxy,
setCrawlRotator,
getCrawlRotator,
// Configuration
DUTCHIE_CONFIG,
GRAPHQL_HASHES,
// Types
type CurlResponse,
type Fingerprint,
type ExecuteGraphQLOptions,
type FetchPageOptions,
} from './client';
// Re-export CrawlRotator types from canonical location
export type { CrawlRotator, Proxy, ProxyStats } from '../../services/crawl-rotator';
// GraphQL Queries
export {
resolveDispensaryId,
resolveDispensaryIdWithDetails,
getDispensaryInfo,
type ResolveDispensaryResult,
type DispensaryInfo,
} from './queries';

View File

@@ -0,0 +1,187 @@
/**
* Dutchie GraphQL Queries
*
* High-level GraphQL operations built on top of the client.
*/
import { executeGraphQL, GRAPHQL_HASHES, DUTCHIE_CONFIG } from './client';
// ============================================================
// TYPES
// ============================================================
export interface ResolveDispensaryResult {
dispensaryId: string | null;
httpStatus?: number;
error?: string;
source?: 'graphql' | 'html';
}
// ============================================================
// DISPENSARY ID RESOLUTION
// ============================================================
/**
* Resolve a dispensary slug to its internal platform ID via GraphQL
*/
export async function resolveDispensaryId(slug: string): Promise<string | null> {
const result = await resolveDispensaryIdWithDetails(slug);
return result.dispensaryId;
}
/**
* Resolve with full details for error handling
*/
export async function resolveDispensaryIdWithDetails(slug: string): Promise<ResolveDispensaryResult> {
console.log(`[Dutchie Queries] Resolving dispensary ID for slug: ${slug}`);
try {
const variables = {
dispensaryFilter: {
cNameOrID: slug,
},
};
const result = await executeGraphQL(
'GetAddressBasedDispensaryData',
variables,
GRAPHQL_HASHES.GetAddressBasedDispensaryData,
{ cName: slug, maxRetries: 3, retryOn403: true }
);
const dispensaryId = result?.data?.dispensaryBySlug?.id ||
result?.data?.dispensary?.id ||
result?.data?.getAddressBasedDispensaryData?.dispensary?.id;
if (dispensaryId) {
console.log(`[Dutchie Queries] Resolved ${slug} -> ${dispensaryId}`);
return { dispensaryId, source: 'graphql' };
}
console.log(`[Dutchie Queries] No dispensaryId in response for ${slug}`);
return {
dispensaryId: null,
error: 'Could not extract dispensaryId from GraphQL response',
};
} catch (error: any) {
const status = error.message?.match(/HTTP (\d+)/)?.[1];
if (status === '403' || status === '404') {
return {
dispensaryId: null,
httpStatus: parseInt(status),
error: `HTTP ${status}: Store may be removed or blocked`,
};
}
return {
dispensaryId: null,
error: error.message,
};
}
}
// ============================================================
// DISPENSARY INFO
// ============================================================
export interface DispensaryInfo {
id: string;
name: string;
slug: string;
isOpen: boolean;
timezone: string;
address: string;
city: string;
state: string;
zip: string;
phone: string;
email: string;
hours: {
monday?: { open: string; close: string } | null;
tuesday?: { open: string; close: string } | null;
wednesday?: { open: string; close: string } | null;
thursday?: { open: string; close: string } | null;
friday?: { open: string; close: string } | null;
saturday?: { open: string; close: string } | null;
sunday?: { open: string; close: string } | null;
};
acceptsCredit: boolean;
offersCurbside: boolean;
offersDelivery: boolean;
offersPickup: boolean;
featureFlags: string[];
}
/**
* Get dispensary info including business hours
*/
export async function getDispensaryInfo(cNameOrSlug: string): Promise<DispensaryInfo | null> {
console.log(`[Dutchie Queries] Getting dispensary info for: ${cNameOrSlug}`);
try {
const variables = {
dispensaryFilter: {
cNameOrID: cNameOrSlug,
},
};
const result = await executeGraphQL(
'GetAddressBasedDispensaryData',
variables,
GRAPHQL_HASHES.GetAddressBasedDispensaryData,
{ cName: cNameOrSlug, maxRetries: 2, retryOn403: true }
);
const dispensary = result?.data?.dispensary ||
result?.data?.dispensaryBySlug ||
result?.data?.getAddressBasedDispensaryData?.dispensary;
if (!dispensary) {
console.log(`[Dutchie Queries] No dispensary data found for ${cNameOrSlug}`);
return null;
}
const hoursSettings = dispensary.hoursSettings || dispensary.operatingHours || {};
const parseHours = (dayHours: any) => {
if (!dayHours || dayHours.isClosed) return null;
return {
open: dayHours.openTime || dayHours.open || '',
close: dayHours.closeTime || dayHours.close || '',
};
};
return {
id: dispensary.id || dispensary._id || '',
name: dispensary.name || '',
slug: dispensary.cName || dispensary.slug || cNameOrSlug,
isOpen: dispensary.isOpen ?? dispensary.openNow ?? false,
timezone: dispensary.timezone || '',
address: dispensary.address || dispensary.location?.address || '',
city: dispensary.city || dispensary.location?.city || '',
state: dispensary.state || dispensary.location?.state || '',
zip: dispensary.zip || dispensary.zipcode || dispensary.location?.zip || '',
phone: dispensary.phone || dispensary.phoneNumber || '',
email: dispensary.email || '',
hours: {
monday: parseHours(hoursSettings.monday),
tuesday: parseHours(hoursSettings.tuesday),
wednesday: parseHours(hoursSettings.wednesday),
thursday: parseHours(hoursSettings.thursday),
friday: parseHours(hoursSettings.friday),
saturday: parseHours(hoursSettings.saturday),
sunday: parseHours(hoursSettings.sunday),
},
acceptsCredit: dispensary.acceptsCreditCards ?? dispensary.creditCardAccepted ?? false,
offersCurbside: dispensary.offersCurbside ?? dispensary.curbsidePickup ?? false,
offersDelivery: dispensary.offersDelivery ?? dispensary.delivery ?? false,
offersPickup: dispensary.offersPickup ?? dispensary.pickup ?? true,
featureFlags: dispensary.featureFlags || [],
};
} catch (error: any) {
console.error(`[Dutchie Queries] Error getting dispensary info: ${error.message}`);
return null;
}
}

View File

@@ -1,53 +0,0 @@
/**
* Admin Routes
*
* Top-level admin/operator actions (crawl triggers, health checks, etc.)
*
* Route semantics:
* /api/admin/... = Admin/operator actions
* /api/az/... = Arizona data slice (stores, products, metrics)
*/
import { Router, Request, Response } from 'express';
import { getDispensaryById, crawlSingleDispensary } from '../dutchie-az';
const router = Router();
// ============================================================
// CRAWL TRIGGER
// ============================================================
/**
* POST /api/admin/crawl/:dispensaryId
*
* Trigger a crawl for a specific dispensary.
* This is the CANONICAL endpoint for triggering crawls.
*
* Request body (optional):
* - pricingType: 'rec' | 'med' (default: 'rec')
* - useBothModes: boolean (default: true)
*
* Response:
* - On success: crawl result with product counts
* - On 404: dispensary not found
* - On 500: crawl error
*/
router.post('/crawl/:dispensaryId', async (req: Request, res: Response) => {
try {
const { dispensaryId } = req.params;
const { pricingType = 'rec', useBothModes = true } = req.body;
// Fetch the dispensary first
const dispensary = await getDispensaryById(parseInt(dispensaryId, 10));
if (!dispensary) {
return res.status(404).json({ error: 'Dispensary not found' });
}
const result = await crawlSingleDispensary(dispensary, pricingType, { useBothModes });
res.json(result);
} catch (error: any) {
res.status(500).json({ error: error.message });
}
});
export default router;

View File

@@ -1,6 +1,6 @@
import { Router } from 'express';
import { authMiddleware } from '../auth/middleware';
import { query as azQuery } from '../dutchie-az/db/connection';
import { pool } from '../db/pool';
const router = Router();
router.use(authMiddleware);
@@ -10,7 +10,7 @@ router.use(authMiddleware);
router.get('/stats', async (req, res) => {
try {
// All stats in a single query using CTEs
const result = await azQuery(`
const result = await pool.query(`
WITH dispensary_stats AS (
SELECT
COUNT(*) as total,
@@ -93,7 +93,7 @@ router.get('/activity', async (req, res) => {
const { limit = 20 } = req.query;
// Recent crawls from dispensaries (with product counts from dutchie_products)
const scrapesResult = await azQuery(`
const scrapesResult = await pool.query(`
SELECT
d.name,
d.last_crawled_at as last_scraped_at,
@@ -105,7 +105,7 @@ router.get('/activity', async (req, res) => {
`, [limit]);
// Recent products from dutchie_products
const productsResult = await azQuery(`
const productsResult = await pool.query(`
SELECT
p.name,
0 as price,

View File

@@ -11,12 +11,11 @@ const VALID_MENU_TYPES = ['dutchie', 'treez', 'jane', 'weedmaps', 'leafly', 'mea
// Get all dispensaries
router.get('/', async (req, res) => {
try {
const { menu_type } = req.query;
const { menu_type, city, state } = req.query;
let query = `
SELECT
id,
azdhs_id,
name,
company_name,
slug,
@@ -25,36 +24,46 @@ router.get('/', async (req, res) => {
state,
zip,
phone,
email,
website,
dba_name,
google_rating,
google_review_count,
status_line,
azdhs_url,
latitude,
longitude,
menu_url,
menu_type,
menu_provider,
menu_provider_confidence,
scraper_template,
last_menu_scrape,
menu_scrape_status,
platform,
platform_dispensary_id,
product_count,
last_crawl_at,
created_at,
updated_at
FROM dispensaries
`;
const params: any[] = [];
const conditions: string[] = [];
// Filter by menu_type if provided
if (menu_type) {
query += ` WHERE menu_type = $1`;
conditions.push(`menu_type = $${params.length + 1}`);
params.push(menu_type);
}
// Filter by city if provided
if (city) {
conditions.push(`city ILIKE $${params.length + 1}`);
params.push(city);
}
// Filter by state if provided
if (state) {
conditions.push(`state = $${params.length + 1}`);
params.push(state);
}
if (conditions.length > 0) {
query += ` WHERE ${conditions.join(' AND ')}`;
}
query += ` ORDER BY name`;
const result = await pool.query(query, params);
@@ -82,15 +91,15 @@ router.get('/stats/menu-types', async (req, res) => {
}
});
// Get single dispensary by slug
router.get('/:slug', async (req, res) => {
// Get single dispensary by slug or ID
router.get('/:slugOrId', async (req, res) => {
try {
const { slug } = req.params;
const { slugOrId } = req.params;
const isNumeric = /^\d+$/.test(slugOrId);
const result = await pool.query(`
SELECT
id,
azdhs_id,
name,
company_name,
slug,
@@ -99,29 +108,22 @@ router.get('/:slug', async (req, res) => {
state,
zip,
phone,
email,
website,
dba_name,
google_rating,
google_review_count,
status_line,
azdhs_url,
latitude,
longitude,
menu_url,
menu_type,
menu_provider,
menu_provider_confidence,
scraper_template,
scraper_config,
last_menu_scrape,
menu_scrape_status,
platform,
platform_dispensary_id,
product_count,
last_crawl_at,
raw_metadata,
created_at,
updated_at
FROM dispensaries
WHERE slug = $1
`, [slug]);
WHERE ${isNumeric ? 'id = $1' : 'slug = $1'}
`, [isNumeric ? parseInt(slugOrId) : slugOrId]);
if (result.rows.length === 0) {
return res.status(404).json({ error: 'Dispensary not found' });
@@ -139,17 +141,22 @@ router.put('/:id', async (req, res) => {
try {
const { id } = req.params;
const {
name,
dba_name,
company_name,
website,
phone,
email,
google_rating,
google_review_count,
address,
city,
state,
zip,
latitude,
longitude,
menu_url,
menu_type,
scraper_template,
scraper_config,
menu_scrape_status
platform,
platform_dispensary_id,
slug,
} = req.body;
// Validate menu_type if provided
@@ -162,32 +169,42 @@ router.put('/:id', async (req, res) => {
const result = await pool.query(`
UPDATE dispensaries
SET
dba_name = COALESCE($1, dba_name),
website = COALESCE($2, website),
phone = COALESCE($3, phone),
email = COALESCE($4, email),
google_rating = COALESCE($5, google_rating),
google_review_count = COALESCE($6, google_review_count),
menu_url = COALESCE($7, menu_url),
menu_type = COALESCE($8, menu_type),
scraper_template = COALESCE($9, scraper_template),
scraper_config = COALESCE($10, scraper_config),
menu_scrape_status = COALESCE($11, menu_scrape_status),
name = COALESCE($1, name),
dba_name = COALESCE($2, dba_name),
company_name = COALESCE($3, company_name),
website = COALESCE($4, website),
phone = COALESCE($5, phone),
address = COALESCE($6, address),
city = COALESCE($7, city),
state = COALESCE($8, state),
zip = COALESCE($9, zip),
latitude = COALESCE($10, latitude),
longitude = COALESCE($11, longitude),
menu_url = COALESCE($12, menu_url),
menu_type = COALESCE($13, menu_type),
platform = COALESCE($14, platform),
platform_dispensary_id = COALESCE($15, platform_dispensary_id),
slug = COALESCE($16, slug),
updated_at = CURRENT_TIMESTAMP
WHERE id = $12
WHERE id = $17
RETURNING *
`, [
name,
dba_name,
company_name,
website,
phone,
email,
google_rating,
google_review_count,
address,
city,
state,
zip,
latitude,
longitude,
menu_url,
menu_type,
scraper_template,
scraper_config,
menu_scrape_status,
platform,
platform_dispensary_id,
slug,
id
]);
@@ -468,6 +485,100 @@ router.patch('/:id/menu-type', async (req, res) => {
}
});
// Sync dispensary from discovery (upsert by platform_dispensary_id or slug)
// Used by Alice worker to sync discovered dispensaries to DB
router.post('/sync', async (req, res) => {
try {
const {
name,
slug,
city,
state,
address,
postalCode,
latitude,
longitude,
platformDispensaryId,
menuType,
menuUrl,
platform,
} = req.body;
if (!slug || !platformDispensaryId) {
return res.status(400).json({ error: 'slug and platformDispensaryId are required' });
}
// Try to find existing by platform_dispensary_id first, then by slug
const existingResult = await pool.query(`
SELECT id, name, slug, platform_dispensary_id, menu_type
FROM dispensaries
WHERE platform_dispensary_id = $1
OR (slug = $2 AND platform_dispensary_id IS NULL)
LIMIT 1
`, [platformDispensaryId, slug]);
if (existingResult.rows.length > 0) {
// Update existing
const existing = existingResult.rows[0];
const result = await pool.query(`
UPDATE dispensaries
SET
name = COALESCE($1, name),
slug = COALESCE($2, slug),
city = COALESCE($3, city),
state = COALESCE($4, state),
address = COALESCE($5, address),
zip = COALESCE($6, zip),
latitude = COALESCE($7, latitude),
longitude = COALESCE($8, longitude),
platform_dispensary_id = COALESCE($9, platform_dispensary_id),
menu_type = COALESCE($10, menu_type),
menu_url = COALESCE($11, menu_url),
platform = COALESCE($12, platform),
updated_at = CURRENT_TIMESTAMP
WHERE id = $13
RETURNING id, name, slug, platform_dispensary_id, menu_type
`, [
name, slug, city, state, address, postalCode,
latitude, longitude, platformDispensaryId, menuType, menuUrl, platform,
existing.id
]);
const updated = result.rows[0];
const changed = existing.platform_dispensary_id !== updated.platform_dispensary_id ||
existing.menu_type !== updated.menu_type ||
existing.slug !== updated.slug;
return res.json({
action: changed ? 'updated' : 'matched',
dispensary: updated,
});
}
// Insert new
const result = await pool.query(`
INSERT INTO dispensaries (
name, slug, city, state, address, zip,
latitude, longitude, platform_dispensary_id, menu_type, menu_url, platform,
created_at, updated_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
RETURNING id, name, slug, platform_dispensary_id, menu_type
`, [
name, slug, city, state, address, postalCode,
latitude, longitude, platformDispensaryId, menuType, menuUrl, platform
]);
return res.json({
action: 'inserted',
dispensary: result.rows[0],
});
} catch (error) {
console.error('Error syncing dispensary:', error);
res.status(500).json({ error: 'Failed to sync dispensary' });
}
});
// Bulk update menu_type for multiple dispensaries
router.post('/bulk/menu-type', async (req, res) => {
try {

View File

@@ -14,7 +14,7 @@
*/
import { Router, Request, Response } from 'express';
import { getPool, healthCheck as dbHealthCheck } from '../dutchie-az/db/connection';
import { pool } from '../db/pool';
import { getRedis } from '../lib/redis';
import * as fs from 'fs';
import * as path from 'path';
@@ -119,7 +119,7 @@ async function getApiHealth(): Promise<ApiHealth> {
async function getDbHealth(): Promise<DbHealth> {
const start = Date.now();
try {
const pool = getPool();
// pool imported from db/pool
await pool.query('SELECT 1');
return {
status: 'ok',
@@ -175,7 +175,7 @@ async function getRedisHealth(): Promise<RedisHealth> {
async function getWorkersHealth(): Promise<WorkersHealth> {
try {
const pool = getPool();
// pool imported from db/pool
// Get queue stats from v_queue_stats view or equivalent
const queueStatsResult = await pool.query(`
@@ -248,7 +248,7 @@ async function getWorkersHealth(): Promise<WorkersHealth> {
async function getCrawlsHealth(): Promise<CrawlsHealth> {
try {
const pool = getPool();
// pool imported from db/pool
// Get crawl statistics
const statsResult = await pool.query(`
@@ -299,7 +299,7 @@ async function getCrawlsHealth(): Promise<CrawlsHealth> {
async function getAnalyticsHealth(): Promise<AnalyticsHealth> {
try {
const pool = getPool();
// pool imported from db/pool
// Check analytics/aggregate job runs
const statsResult = await pool.query(`

View File

@@ -9,7 +9,6 @@
import { Router, Request, Response, NextFunction } from 'express';
import { pool } from '../db/pool';
import { query as dutchieAzQuery } from '../dutchie-az/db/connection';
import ipaddr from 'ipaddr.js';
import {
ApiScope,
@@ -140,7 +139,7 @@ async function validatePublicApiKey(
try {
// Query WordPress permissions table with store info
const result = await pool.query<ApiKeyPermission>(`
const result = await pool.query(`
SELECT
p.id,
p.user_name,
@@ -198,7 +197,7 @@ async function validatePublicApiKey(
// Resolve the dutchie_az store for wordpress keys
if (permission.key_type === 'wordpress' && permission.store_name) {
const storeResult = await dutchieAzQuery<{ id: number }>(`
const storeResult = await pool.query(`
SELECT id FROM dispensaries
WHERE LOWER(TRIM(name)) = LOWER(TRIM($1))
OR LOWER(TRIM(name)) LIKE LOWER(TRIM($1)) || '%'
@@ -439,7 +438,7 @@ router.get('/products', async (req: PublicApiRequest, res: Response) => {
// Query products with latest snapshot data
// Note: Price filters use HAVING clause since they reference the snapshot subquery
const { rows: products } = await dutchieAzQuery(`
const { rows: products } = await pool.query(`
SELECT
p.id,
p.dispensary_id,
@@ -482,7 +481,7 @@ router.get('/products', async (req: PublicApiRequest, res: Response) => {
`, params);
// Get total count for pagination (include price filters if specified)
const { rows: countRows } = await dutchieAzQuery(`
const { rows: countRows } = await pool.query(`
SELECT COUNT(*) as total FROM dutchie_products p
LEFT JOIN LATERAL (
SELECT rec_min_price_cents, special FROM dutchie_product_snapshots
@@ -567,7 +566,7 @@ router.get('/products/:id', async (req: PublicApiRequest, res: Response) => {
const { id } = req.params;
// Get product (without dispensary filter to check access afterward)
const { rows: products } = await dutchieAzQuery(`
const { rows: products } = await pool.query(`
SELECT
p.*,
s.rec_min_price_cents,
@@ -677,7 +676,7 @@ router.get('/categories', async (req: PublicApiRequest, res: Response) => {
});
}
const { rows: categories } = await dutchieAzQuery(`
const { rows: categories } = await pool.query(`
SELECT
type as category,
subcategory,
@@ -733,7 +732,7 @@ router.get('/brands', async (req: PublicApiRequest, res: Response) => {
});
}
const { rows: brands } = await dutchieAzQuery(`
const { rows: brands } = await pool.query(`
SELECT
brand_name as brand,
COUNT(*) as product_count,
@@ -796,7 +795,7 @@ router.get('/specials', async (req: PublicApiRequest, res: Response) => {
params.push(limitNum, offsetNum);
const { rows: products } = await dutchieAzQuery(`
const { rows: products } = await pool.query(`
SELECT
p.id,
p.dispensary_id,
@@ -828,7 +827,7 @@ router.get('/specials', async (req: PublicApiRequest, res: Response) => {
// Get total count
const countParams = params.slice(0, -2);
const { rows: countRows } = await dutchieAzQuery(`
const { rows: countRows } = await pool.query(`
SELECT COUNT(*) as total
FROM dutchie_products p
INNER JOIN LATERAL (
@@ -906,7 +905,7 @@ router.get('/dispensaries', async (req: PublicApiRequest, res: Response) => {
}
// Get single dispensary for wordpress key
const { rows: dispensaries } = await dutchieAzQuery(`
const { rows: dispensaries } = await pool.query(`
SELECT
d.id,
d.name,
@@ -1013,7 +1012,7 @@ router.get('/dispensaries', async (req: PublicApiRequest, res: Response) => {
const limitNum = Math.min(parseInt(limit as string, 10) || 100, 500);
const offsetNum = parseInt(offset as string, 10) || 0;
const { rows: dispensaries } = await dutchieAzQuery(`
const { rows: dispensaries } = await pool.query(`
SELECT
d.id,
d.name,
@@ -1051,7 +1050,7 @@ router.get('/dispensaries', async (req: PublicApiRequest, res: Response) => {
LIMIT $${paramIndex} OFFSET $${paramIndex + 1}
`, [...params, limitNum, offsetNum]);
const { rows: countRows } = await dutchieAzQuery(`
const { rows: countRows } = await pool.query(`
SELECT COUNT(*) as total
FROM dispensaries d
LEFT JOIN LATERAL (
@@ -1178,7 +1177,7 @@ router.get('/search', async (req: PublicApiRequest, res: Response) => {
params.push(limitNum, offsetNum);
const { rows: products } = await dutchieAzQuery(`
const { rows: products } = await pool.query(`
SELECT
p.id,
p.dispensary_id,
@@ -1221,7 +1220,7 @@ router.get('/search', async (req: PublicApiRequest, res: Response) => {
// Count query (without relevance param)
const countParams = params.slice(0, paramIndex - 3); // Remove relevance, limit, offset
const { rows: countRows } = await dutchieAzQuery(`
const { rows: countRows } = await pool.query(`
SELECT COUNT(*) as total
FROM dutchie_products p
${whereClause}
@@ -1302,7 +1301,7 @@ router.get('/menu', async (req: PublicApiRequest, res: Response) => {
}
// Get counts by category
const { rows: categoryCounts } = await dutchieAzQuery(`
const { rows: categoryCounts } = await pool.query(`
SELECT
type as category,
COUNT(*) as total,
@@ -1314,7 +1313,7 @@ router.get('/menu', async (req: PublicApiRequest, res: Response) => {
`, params);
// Get overall stats
const { rows: stats } = await dutchieAzQuery(`
const { rows: stats } = await pool.query(`
SELECT
COUNT(*) as total_products,
COUNT(*) FILTER (WHERE stock_status = 'in_stock') as in_stock_count,
@@ -1326,7 +1325,7 @@ router.get('/menu', async (req: PublicApiRequest, res: Response) => {
`, params);
// Get specials count
const { rows: specialsCount } = await dutchieAzQuery(`
const { rows: specialsCount } = await pool.query(`
SELECT COUNT(*) as count
FROM dutchie_products p
INNER JOIN LATERAL (

View File

@@ -1,986 +0,0 @@
import { Router, Request, Response } from 'express';
import { authMiddleware, requireRole } from '../auth/middleware';
import {
getGlobalSchedule,
updateGlobalSchedule,
getStoreScheduleStatuses,
getStoreSchedule,
updateStoreSchedule,
getAllRecentJobs,
getRecentJobs,
triggerManualCrawl,
triggerAllStoresCrawl,
cancelJob,
restartCrawlScheduler,
setSchedulerMode,
getSchedulerMode,
} from '../services/crawl-scheduler';
import {
runStoreCrawlOrchestrator,
runBatchOrchestrator,
getStoresDueForOrchestration,
} from '../services/store-crawl-orchestrator';
import {
runDispensaryOrchestrator,
runBatchDispensaryOrchestrator,
getDispensariesDueForOrchestration,
ensureAllDispensariesHaveSchedules,
} from '../services/dispensary-orchestrator';
import { pool } from '../db/pool';
import { resolveDispensaryId } from '../dutchie-az/services/graphql-client';
const router = Router();
router.use(authMiddleware);
// ============================================
// Global Schedule Endpoints
// ============================================
/**
* GET /api/schedule/global
* Get global schedule settings
*/
router.get('/global', async (req: Request, res: Response) => {
try {
const schedules = await getGlobalSchedule();
res.json({ schedules });
} catch (error: any) {
console.error('Error fetching global schedule:', error);
res.status(500).json({ error: 'Failed to fetch global schedule' });
}
});
/**
* PUT /api/schedule/global/:type
* Update global schedule setting
*/
router.put('/global/:type', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const { type } = req.params;
const { enabled, interval_hours, run_time } = req.body;
if (type !== 'global_interval' && type !== 'daily_special') {
return res.status(400).json({ error: 'Invalid schedule type' });
}
const schedule = await updateGlobalSchedule(type, {
enabled,
interval_hours,
run_time
});
// Restart scheduler to apply changes
await restartCrawlScheduler();
res.json({ schedule, message: 'Schedule updated and scheduler restarted' });
} catch (error: any) {
console.error('Error updating global schedule:', error);
res.status(500).json({ error: 'Failed to update global schedule' });
}
});
// ============================================
// Store Schedule Endpoints
// ============================================
/**
* GET /api/schedule/stores
* Get all store schedule statuses
*/
router.get('/stores', async (req: Request, res: Response) => {
try {
const stores = await getStoreScheduleStatuses();
res.json({ stores });
} catch (error: any) {
console.error('Error fetching store schedules:', error);
res.status(500).json({ error: 'Failed to fetch store schedules' });
}
});
/**
* GET /api/schedule/stores/:storeId
* Get schedule for a specific store
*/
router.get('/stores/:storeId', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.storeId);
if (isNaN(storeId)) {
return res.status(400).json({ error: 'Invalid store ID' });
}
const schedule = await getStoreSchedule(storeId);
res.json({ schedule });
} catch (error: any) {
console.error('Error fetching store schedule:', error);
res.status(500).json({ error: 'Failed to fetch store schedule' });
}
});
/**
* PUT /api/schedule/stores/:storeId
* Update schedule for a specific store
*/
router.put('/stores/:storeId', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.storeId);
if (isNaN(storeId)) {
return res.status(400).json({ error: 'Invalid store ID' });
}
const {
enabled,
interval_hours,
daily_special_enabled,
daily_special_time,
priority
} = req.body;
const schedule = await updateStoreSchedule(storeId, {
enabled,
interval_hours,
daily_special_enabled,
daily_special_time,
priority
});
res.json({ schedule });
} catch (error: any) {
console.error('Error updating store schedule:', error);
res.status(500).json({ error: 'Failed to update store schedule' });
}
});
// ============================================
// Job Queue Endpoints
// ============================================
/**
* GET /api/schedule/jobs
* Get recent jobs
*/
router.get('/jobs', async (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 50;
const jobs = await getAllRecentJobs(Math.min(limit, 200));
res.json({ jobs });
} catch (error: any) {
console.error('Error fetching jobs:', error);
res.status(500).json({ error: 'Failed to fetch jobs' });
}
});
/**
* GET /api/schedule/jobs/store/:storeId
* Get recent jobs for a specific store
*/
router.get('/jobs/store/:storeId', async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.storeId);
if (isNaN(storeId)) {
return res.status(400).json({ error: 'Invalid store ID' });
}
const limit = parseInt(req.query.limit as string) || 10;
const jobs = await getRecentJobs(storeId, Math.min(limit, 100));
res.json({ jobs });
} catch (error: any) {
console.error('Error fetching store jobs:', error);
res.status(500).json({ error: 'Failed to fetch store jobs' });
}
});
/**
* POST /api/schedule/jobs/:jobId/cancel
* Cancel a pending job
*/
router.post('/jobs/:jobId/cancel', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const jobId = parseInt(req.params.jobId);
if (isNaN(jobId)) {
return res.status(400).json({ error: 'Invalid job ID' });
}
const cancelled = await cancelJob(jobId);
if (cancelled) {
res.json({ success: true, message: 'Job cancelled' });
} else {
res.status(400).json({ error: 'Job could not be cancelled (may not be pending)' });
}
} catch (error: any) {
console.error('Error cancelling job:', error);
res.status(500).json({ error: 'Failed to cancel job' });
}
});
// ============================================
// Manual Trigger Endpoints
// ============================================
/**
* POST /api/schedule/trigger/store/:storeId
* Manually trigger orchestrated crawl for a specific store
* Uses the intelligent orchestrator which:
* - Checks provider detection status
* - Runs detection if needed
* - Queues appropriate crawl type (production/sandbox)
*/
router.post('/trigger/store/:storeId', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.storeId);
if (isNaN(storeId)) {
return res.status(400).json({ error: 'Invalid store ID' });
}
// Use the orchestrator instead of simple triggerManualCrawl
const result = await runStoreCrawlOrchestrator(storeId);
res.json({
result,
message: result.summary,
success: result.status === 'success' || result.status === 'sandbox_only',
});
} catch (error: any) {
console.error('Error triggering orchestrated crawl:', error);
res.status(500).json({ error: 'Failed to trigger crawl' });
}
});
/**
* POST /api/schedule/trigger/store/:storeId/legacy
* Legacy: Simple job queue trigger (no orchestration)
*/
router.post('/trigger/store/:storeId/legacy', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const storeId = parseInt(req.params.storeId);
if (isNaN(storeId)) {
return res.status(400).json({ error: 'Invalid store ID' });
}
const job = await triggerManualCrawl(storeId);
res.json({ job, message: 'Crawl job created' });
} catch (error: any) {
console.error('Error triggering manual crawl:', error);
res.status(500).json({ error: 'Failed to trigger crawl' });
}
});
/**
* POST /api/schedule/trigger/all
* Manually trigger crawls for all stores
*/
router.post('/trigger/all', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const jobsCreated = await triggerAllStoresCrawl();
res.json({ jobs_created: jobsCreated, message: `Created ${jobsCreated} crawl jobs` });
} catch (error: any) {
console.error('Error triggering all crawls:', error);
res.status(500).json({ error: 'Failed to trigger crawls' });
}
});
/**
* POST /api/schedule/restart
* Restart the scheduler
*/
router.post('/restart', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
await restartCrawlScheduler();
res.json({ message: 'Scheduler restarted', mode: getSchedulerMode() });
} catch (error: any) {
console.error('Error restarting scheduler:', error);
res.status(500).json({ error: 'Failed to restart scheduler' });
}
});
// ============================================
// Scheduler Mode Endpoints
// ============================================
/**
* GET /api/schedule/mode
* Get current scheduler mode
*/
router.get('/mode', async (req: Request, res: Response) => {
try {
const mode = getSchedulerMode();
res.json({ mode });
} catch (error: any) {
console.error('Error getting scheduler mode:', error);
res.status(500).json({ error: 'Failed to get scheduler mode' });
}
});
/**
* PUT /api/schedule/mode
* Set scheduler mode (legacy or orchestrator)
*/
router.put('/mode', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const { mode } = req.body;
if (mode !== 'legacy' && mode !== 'orchestrator') {
return res.status(400).json({ error: 'Invalid mode. Must be "legacy" or "orchestrator"' });
}
setSchedulerMode(mode);
// Restart scheduler with new mode
await restartCrawlScheduler();
res.json({ mode, message: `Scheduler mode set to ${mode} and restarted` });
} catch (error: any) {
console.error('Error setting scheduler mode:', error);
res.status(500).json({ error: 'Failed to set scheduler mode' });
}
});
/**
* GET /api/schedule/due
* Get stores that are due for orchestration
*/
router.get('/due', async (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 10;
const storeIds = await getStoresDueForOrchestration(Math.min(limit, 50));
res.json({ stores_due: storeIds, count: storeIds.length });
} catch (error: any) {
console.error('Error getting stores due for orchestration:', error);
res.status(500).json({ error: 'Failed to get stores due' });
}
});
// ============================================
// Dispensary Schedule Endpoints (NEW - dispensary-centric)
// ============================================
/**
* GET /api/schedule/dispensaries
* Get all dispensary schedule statuses with optional filters
* Query params:
* - state: filter by state (e.g., 'AZ')
* - search: search by name or slug
*/
router.get('/dispensaries', async (req: Request, res: Response) => {
try {
const { state, search } = req.query;
// Build dynamic query with optional filters
const conditions: string[] = [];
const params: any[] = [];
let paramIndex = 1;
if (state) {
conditions.push(`d.state = $${paramIndex}`);
params.push(state);
paramIndex++;
}
if (search) {
conditions.push(`(d.name ILIKE $${paramIndex} OR d.slug ILIKE $${paramIndex})`);
params.push(`%${search}%`);
paramIndex++;
}
const whereClause = conditions.length > 0 ? `WHERE ${conditions.join(' AND ')}` : '';
const query = `
SELECT
d.id AS dispensary_id,
d.name AS dispensary_name,
d.slug AS dispensary_slug,
d.city,
d.state,
d.menu_url,
d.menu_type,
d.platform_dispensary_id,
d.scrape_enabled,
d.last_crawl_at,
d.crawl_status,
d.product_crawler_mode,
d.product_provider,
cs.interval_minutes,
cs.is_active,
cs.priority,
cs.last_run_at,
cs.next_run_at,
cs.last_status AS schedule_last_status,
cs.last_error AS schedule_last_error,
cs.consecutive_failures,
j.id AS latest_job_id,
j.status AS latest_job_status,
j.job_type AS latest_job_type,
j.started_at AS latest_job_started,
j.completed_at AS latest_job_completed,
j.products_found AS latest_products_found,
j.products_new AS latest_products_created,
j.products_updated AS latest_products_updated,
j.error_message AS latest_job_error,
CASE
WHEN d.menu_type = 'dutchie' AND d.platform_dispensary_id IS NOT NULL THEN true
ELSE false
END AS can_crawl,
CASE
WHEN d.menu_type IS NULL OR d.menu_type = 'unknown' THEN 'menu_type not detected'
WHEN d.menu_type != 'dutchie' THEN 'not dutchie platform'
WHEN d.platform_dispensary_id IS NULL THEN 'platform ID not resolved'
WHEN d.scrape_enabled = false THEN 'scraping disabled'
ELSE 'ready'
END AS schedule_status_reason
FROM public.dispensaries d
LEFT JOIN public.dispensary_crawl_schedule cs ON cs.dispensary_id = d.id
LEFT JOIN LATERAL (
SELECT *
FROM public.dispensary_crawl_jobs dj
WHERE dj.dispensary_id = d.id
ORDER BY dj.created_at DESC
LIMIT 1
) j ON true
${whereClause}
ORDER BY cs.priority DESC NULLS LAST, d.name
`;
const result = await pool.query(query, params);
res.json({ dispensaries: result.rows });
} catch (error: any) {
console.error('Error fetching dispensary schedules:', error);
res.status(500).json({ error: 'Failed to fetch dispensary schedules' });
}
});
/**
* GET /api/schedule/dispensaries/:id
* Get schedule for a specific dispensary
*/
router.get('/dispensaries/:id', async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
const result = await pool.query(`
SELECT * FROM dispensary_crawl_status
WHERE dispensary_id = $1
`, [dispensaryId]);
if (result.rows.length === 0) {
return res.status(404).json({ error: 'Dispensary not found' });
}
res.json({ schedule: result.rows[0] });
} catch (error: any) {
console.error('Error fetching dispensary schedule:', error);
res.status(500).json({ error: 'Failed to fetch dispensary schedule' });
}
});
/**
* PUT /api/schedule/dispensaries/:id
* Update schedule for a specific dispensary
*/
router.put('/dispensaries/:id', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
const {
is_active,
interval_minutes,
priority
} = req.body;
// Upsert schedule
const result = await pool.query(`
INSERT INTO dispensary_crawl_schedule (dispensary_id, is_active, interval_minutes, priority)
VALUES ($1, COALESCE($2, TRUE), COALESCE($3, 240), COALESCE($4, 0))
ON CONFLICT (dispensary_id) DO UPDATE SET
is_active = COALESCE($2, dispensary_crawl_schedule.is_active),
interval_minutes = COALESCE($3, dispensary_crawl_schedule.interval_minutes),
priority = COALESCE($4, dispensary_crawl_schedule.priority),
updated_at = NOW()
RETURNING *
`, [dispensaryId, is_active, interval_minutes, priority]);
res.json({ schedule: result.rows[0] });
} catch (error: any) {
console.error('Error updating dispensary schedule:', error);
res.status(500).json({ error: 'Failed to update dispensary schedule' });
}
});
/**
* GET /api/schedule/dispensary-jobs
* Get recent dispensary crawl jobs
*/
router.get('/dispensary-jobs', async (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 50;
const result = await pool.query(`
SELECT dcj.*, d.name as dispensary_name
FROM dispensary_crawl_jobs dcj
JOIN dispensaries d ON d.id = dcj.dispensary_id
ORDER BY dcj.created_at DESC
LIMIT $1
`, [Math.min(limit, 200)]);
res.json({ jobs: result.rows });
} catch (error: any) {
console.error('Error fetching dispensary jobs:', error);
res.status(500).json({ error: 'Failed to fetch dispensary jobs' });
}
});
/**
* GET /api/schedule/dispensary-jobs/:dispensaryId
* Get recent jobs for a specific dispensary
*/
router.get('/dispensary-jobs/:dispensaryId', async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.dispensaryId);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
const limit = parseInt(req.query.limit as string) || 10;
const result = await pool.query(`
SELECT dcj.*, d.name as dispensary_name
FROM dispensary_crawl_jobs dcj
JOIN dispensaries d ON d.id = dcj.dispensary_id
WHERE dcj.dispensary_id = $1
ORDER BY dcj.created_at DESC
LIMIT $2
`, [dispensaryId, Math.min(limit, 100)]);
res.json({ jobs: result.rows });
} catch (error: any) {
console.error('Error fetching dispensary jobs:', error);
res.status(500).json({ error: 'Failed to fetch dispensary jobs' });
}
});
/**
* POST /api/schedule/trigger/dispensary/:id
* Trigger orchestrator for a specific dispensary (Run Now button)
*/
router.post('/trigger/dispensary/:id', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
// Run the dispensary orchestrator
const result = await runDispensaryOrchestrator(dispensaryId);
res.json({
result,
message: result.summary,
success: result.status === 'success' || result.status === 'sandbox_only' || result.status === 'detection_only',
});
} catch (error: any) {
console.error('Error triggering dispensary orchestrator:', error);
res.status(500).json({ error: 'Failed to trigger orchestrator' });
}
});
/**
* POST /api/schedule/trigger/dispensaries/batch
* Trigger orchestrator for multiple dispensaries
*/
router.post('/trigger/dispensaries/batch', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const { dispensary_ids, concurrency } = req.body;
if (!Array.isArray(dispensary_ids) || dispensary_ids.length === 0) {
return res.status(400).json({ error: 'dispensary_ids must be a non-empty array' });
}
const results = await runBatchDispensaryOrchestrator(
dispensary_ids,
concurrency || 3
);
const summary = {
total: results.length,
success: results.filter(r => r.status === 'success').length,
sandbox_only: results.filter(r => r.status === 'sandbox_only').length,
detection_only: results.filter(r => r.status === 'detection_only').length,
error: results.filter(r => r.status === 'error').length,
};
res.json({ results, summary });
} catch (error: any) {
console.error('Error triggering batch orchestrator:', error);
res.status(500).json({ error: 'Failed to trigger batch orchestrator' });
}
});
/**
* GET /api/schedule/dispensary-due
* Get dispensaries that are due for orchestration
*/
router.get('/dispensary-due', async (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 10;
const dispensaryIds = await getDispensariesDueForOrchestration(Math.min(limit, 50));
// Get details for the due dispensaries
if (dispensaryIds.length > 0) {
const details = await pool.query(`
SELECT d.id, d.name, d.product_provider, d.product_crawler_mode,
dcs.next_run_at, dcs.last_status, dcs.priority
FROM dispensaries d
LEFT JOIN dispensary_crawl_schedule dcs ON dcs.dispensary_id = d.id
WHERE d.id = ANY($1)
ORDER BY COALESCE(dcs.priority, 0) DESC, dcs.last_run_at ASC NULLS FIRST
`, [dispensaryIds]);
res.json({ dispensaries_due: details.rows, count: dispensaryIds.length });
} else {
res.json({ dispensaries_due: [], count: 0 });
}
} catch (error: any) {
console.error('Error getting dispensaries due for orchestration:', error);
res.status(500).json({ error: 'Failed to get dispensaries due' });
}
});
/**
* POST /api/schedule/dispensaries/bootstrap
* Ensure all dispensaries have schedule entries
*/
router.post('/dispensaries/bootstrap', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const { interval_minutes } = req.body;
const result = await ensureAllDispensariesHaveSchedules(interval_minutes || 240);
res.json({
message: `Created ${result.created} new schedules, ${result.existing} already existed`,
created: result.created,
existing: result.existing,
});
} catch (error: any) {
console.error('Error bootstrapping dispensary schedules:', error);
res.status(500).json({ error: 'Failed to bootstrap schedules' });
}
});
// ============================================
// Platform ID & Menu Type Detection Endpoints
// ============================================
/**
* POST /api/schedule/dispensaries/:id/resolve-platform-id
* Resolve the Dutchie platform_dispensary_id from menu_url slug
*/
router.post('/dispensaries/:id/resolve-platform-id', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
// Get dispensary info
const dispensaryResult = await pool.query(`
SELECT id, name, slug, menu_url, menu_type, platform_dispensary_id
FROM dispensaries WHERE id = $1
`, [dispensaryId]);
if (dispensaryResult.rows.length === 0) {
return res.status(404).json({ error: 'Dispensary not found' });
}
const dispensary = dispensaryResult.rows[0];
// Check if already resolved
if (dispensary.platform_dispensary_id) {
return res.json({
success: true,
message: 'Platform ID already resolved',
platform_dispensary_id: dispensary.platform_dispensary_id,
already_resolved: true
});
}
// Extract slug from menu_url for Dutchie URLs
let slugToResolve = dispensary.slug;
if (dispensary.menu_url) {
// Match embedded-menu or dispensary URLs
const match = dispensary.menu_url.match(/(?:embedded-menu|dispensar(?:y|ies))\/([^\/\?#]+)/i);
if (match) {
slugToResolve = match[1];
}
}
if (!slugToResolve) {
return res.status(400).json({
error: 'No slug available to resolve platform ID',
menu_url: dispensary.menu_url
});
}
console.log(`[Schedule] Resolving platform ID for ${dispensary.name} using slug: ${slugToResolve}`);
// Resolve platform ID using GraphQL client
const platformId = await resolveDispensaryId(slugToResolve);
if (!platformId) {
return res.status(404).json({
error: 'Could not resolve platform ID',
slug_tried: slugToResolve,
message: 'The dispensary might not be on Dutchie or the slug is incorrect'
});
}
// Update the dispensary with resolved platform ID
await pool.query(`
UPDATE dispensaries
SET platform_dispensary_id = $1,
menu_type = COALESCE(menu_type, 'dutchie'),
updated_at = NOW()
WHERE id = $2
`, [platformId, dispensaryId]);
res.json({
success: true,
platform_dispensary_id: platformId,
slug_resolved: slugToResolve,
message: `Platform ID resolved: ${platformId}`
});
} catch (error: any) {
console.error('Error resolving platform ID:', error);
res.status(500).json({ error: 'Failed to resolve platform ID', details: error.message });
}
});
/**
* POST /api/schedule/dispensaries/:id/detect-menu-type
* Detect menu type from menu_url
*/
router.post('/dispensaries/:id/detect-menu-type', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
// Get dispensary info
const dispensaryResult = await pool.query(`
SELECT id, name, menu_url, website FROM dispensaries WHERE id = $1
`, [dispensaryId]);
if (dispensaryResult.rows.length === 0) {
return res.status(404).json({ error: 'Dispensary not found' });
}
const dispensary = dispensaryResult.rows[0];
const urlToCheck = dispensary.menu_url || dispensary.website;
if (!urlToCheck) {
return res.status(400).json({ error: 'No menu_url or website to detect from' });
}
// Detect menu type from URL patterns
let detectedType: string = 'unknown';
if (urlToCheck.includes('dutchie.com') || urlToCheck.includes('embedded-menu')) {
detectedType = 'dutchie';
} else if (urlToCheck.includes('iheartjane.com') || urlToCheck.includes('jane.co')) {
detectedType = 'jane';
} else if (urlToCheck.includes('weedmaps.com')) {
detectedType = 'weedmaps';
} else if (urlToCheck.includes('leafly.com')) {
detectedType = 'leafly';
} else if (urlToCheck.includes('treez.io') || urlToCheck.includes('treez.co')) {
detectedType = 'treez';
} else if (urlToCheck.includes('meadow.com')) {
detectedType = 'meadow';
} else if (urlToCheck.includes('blaze.me') || urlToCheck.includes('blazepay')) {
detectedType = 'blaze';
} else if (urlToCheck.includes('flowhub.com')) {
detectedType = 'flowhub';
} else if (urlToCheck.includes('dispense.app')) {
detectedType = 'dispense';
} else if (urlToCheck.includes('covasoft.com')) {
detectedType = 'cova';
}
// Update menu_type
await pool.query(`
UPDATE dispensaries
SET menu_type = $1, updated_at = NOW()
WHERE id = $2
`, [detectedType, dispensaryId]);
res.json({
success: true,
menu_type: detectedType,
url_checked: urlToCheck,
message: `Menu type detected: ${detectedType}`
});
} catch (error: any) {
console.error('Error detecting menu type:', error);
res.status(500).json({ error: 'Failed to detect menu type' });
}
});
/**
* POST /api/schedule/dispensaries/:id/refresh-detection
* Combined: detect menu_type AND resolve platform_dispensary_id if dutchie
*/
router.post('/dispensaries/:id/refresh-detection', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
// Get dispensary info
const dispensaryResult = await pool.query(`
SELECT id, name, slug, menu_url, website FROM dispensaries WHERE id = $1
`, [dispensaryId]);
if (dispensaryResult.rows.length === 0) {
return res.status(404).json({ error: 'Dispensary not found' });
}
const dispensary = dispensaryResult.rows[0];
const urlToCheck = dispensary.menu_url || dispensary.website;
if (!urlToCheck) {
return res.status(400).json({ error: 'No menu_url or website to detect from' });
}
// Detect menu type from URL patterns
let detectedType: string = 'unknown';
if (urlToCheck.includes('dutchie.com') || urlToCheck.includes('embedded-menu')) {
detectedType = 'dutchie';
} else if (urlToCheck.includes('iheartjane.com') || urlToCheck.includes('jane.co')) {
detectedType = 'jane';
} else if (urlToCheck.includes('weedmaps.com')) {
detectedType = 'weedmaps';
} else if (urlToCheck.includes('leafly.com')) {
detectedType = 'leafly';
} else if (urlToCheck.includes('treez.io') || urlToCheck.includes('treez.co')) {
detectedType = 'treez';
} else if (urlToCheck.includes('meadow.com')) {
detectedType = 'meadow';
} else if (urlToCheck.includes('blaze.me') || urlToCheck.includes('blazepay')) {
detectedType = 'blaze';
} else if (urlToCheck.includes('flowhub.com')) {
detectedType = 'flowhub';
} else if (urlToCheck.includes('dispense.app')) {
detectedType = 'dispense';
} else if (urlToCheck.includes('covasoft.com')) {
detectedType = 'cova';
}
// Update menu_type first
await pool.query(`
UPDATE dispensaries SET menu_type = $1, updated_at = NOW() WHERE id = $2
`, [detectedType, dispensaryId]);
let platformId: string | null = null;
// If dutchie, also try to resolve platform ID
if (detectedType === 'dutchie') {
let slugToResolve = dispensary.slug;
const match = urlToCheck.match(/(?:embedded-menu|dispensar(?:y|ies))\/([^\/\?#]+)/i);
if (match) {
slugToResolve = match[1];
}
if (slugToResolve) {
try {
console.log(`[Schedule] Resolving platform ID for ${dispensary.name} using slug: ${slugToResolve}`);
platformId = await resolveDispensaryId(slugToResolve);
if (platformId) {
await pool.query(`
UPDATE dispensaries SET platform_dispensary_id = $1, updated_at = NOW() WHERE id = $2
`, [platformId, dispensaryId]);
}
} catch (err: any) {
console.warn(`[Schedule] Failed to resolve platform ID: ${err.message}`);
}
}
}
res.json({
success: true,
menu_type: detectedType,
platform_dispensary_id: platformId,
url_checked: urlToCheck,
can_crawl: detectedType === 'dutchie' && !!platformId
});
} catch (error: any) {
console.error('Error refreshing detection:', error);
res.status(500).json({ error: 'Failed to refresh detection' });
}
});
/**
* PUT /api/schedule/dispensaries/:id/toggle-active
* Enable or disable schedule for a dispensary
*/
router.put('/dispensaries/:id/toggle-active', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
const { is_active } = req.body;
// Upsert schedule with new is_active value
const result = await pool.query(`
INSERT INTO dispensary_crawl_schedule (dispensary_id, is_active, interval_minutes, priority)
VALUES ($1, $2, 240, 0)
ON CONFLICT (dispensary_id) DO UPDATE SET
is_active = $2,
updated_at = NOW()
RETURNING *
`, [dispensaryId, is_active]);
res.json({
success: true,
schedule: result.rows[0],
message: is_active ? 'Schedule enabled' : 'Schedule disabled'
});
} catch (error: any) {
console.error('Error toggling schedule active status:', error);
res.status(500).json({ error: 'Failed to toggle schedule' });
}
});
/**
* DELETE /api/schedule/dispensaries/:id/schedule
* Delete schedule for a dispensary
*/
router.delete('/dispensaries/:id/schedule', requireRole('superadmin', 'admin'), async (req: Request, res: Response) => {
try {
const dispensaryId = parseInt(req.params.id);
if (isNaN(dispensaryId)) {
return res.status(400).json({ error: 'Invalid dispensary ID' });
}
const result = await pool.query(`
DELETE FROM dispensary_crawl_schedule WHERE dispensary_id = $1 RETURNING id
`, [dispensaryId]);
const deleted = (result.rowCount ?? 0) > 0;
res.json({
success: true,
deleted,
message: deleted ? 'Schedule deleted' : 'No schedule to delete'
});
} catch (error: any) {
console.error('Error deleting schedule:', error);
res.status(500).json({ error: 'Failed to delete schedule' });
}
});
export default router;

View File

@@ -1,53 +1,39 @@
/**
* Stores API Routes
*
* NOTE: "Store" and "Dispensary" are synonyms in CannaiQ.
* - This file handles `/api/stores` endpoints
* - The DB table is `dispensaries` (NOT `stores`)
* - Use these terms interchangeably
* - `/api/stores` and `/api/dispensaries` both work
*/
import { Router } from 'express';
import { authMiddleware, requireRole } from '../auth/middleware';
import { pool } from '../db/pool';
import { scrapeStore, scrapeCategory, discoverCategories } from '../scraper-v2';
const router = Router();
router.use(authMiddleware);
// Get all stores
router.get('/', async (req, res) => {
try {
const result = await pool.query(`
SELECT
s.*,
COUNT(DISTINCT p.id) as product_count,
COUNT(DISTINCT c.id) as category_count
FROM stores s
LEFT JOIN products p ON s.id = p.store_id
LEFT JOIN categories c ON s.id = c.store_id
GROUP BY s.id
ORDER BY s.name
`);
res.json({ stores: result.rows });
} catch (error) {
console.error('Error fetching stores:', error);
res.status(500).json({ error: 'Failed to fetch stores' });
}
});
// Freshness threshold in hours
const STALE_THRESHOLD_HOURS = 4;
function calculateFreshness(lastScrapedAt: Date | null): {
last_scraped_at: string | null;
function calculateFreshness(lastCrawlAt: Date | null): {
last_crawl_at: string | null;
is_stale: boolean;
freshness: string;
hours_since_scrape: number | null;
hours_since_crawl: number | null;
} {
if (!lastScrapedAt) {
if (!lastCrawlAt) {
return {
last_scraped_at: null,
last_crawl_at: null,
is_stale: true,
freshness: 'Never scraped',
hours_since_scrape: null
freshness: 'Never crawled',
hours_since_crawl: null
};
}
const now = new Date();
const diffMs = now.getTime() - lastScrapedAt.getTime();
const diffMs = now.getTime() - lastCrawlAt.getTime();
const diffHours = diffMs / (1000 * 60 * 60);
const isStale = diffHours > STALE_THRESHOLD_HOURS;
@@ -64,49 +50,123 @@ function calculateFreshness(lastScrapedAt: Date | null): {
}
return {
last_scraped_at: lastScrapedAt.toISOString(),
last_crawl_at: lastCrawlAt.toISOString(),
is_stale: isStale,
freshness: freshnessText,
hours_since_scrape: Math.round(diffHours * 10) / 10
hours_since_crawl: Math.round(diffHours * 10) / 10
};
}
function detectProvider(dutchieUrl: string | null): string {
if (!dutchieUrl) return 'unknown';
if (dutchieUrl.includes('dutchie.com')) return 'Dutchie';
if (dutchieUrl.includes('iheartjane.com') || dutchieUrl.includes('jane.co')) return 'Jane';
if (dutchieUrl.includes('treez.io')) return 'Treez';
if (dutchieUrl.includes('weedmaps.com')) return 'Weedmaps';
if (dutchieUrl.includes('leafly.com')) return 'Leafly';
function detectProvider(menuUrl: string | null): string {
if (!menuUrl) return 'unknown';
if (menuUrl.includes('dutchie.com')) return 'Dutchie';
if (menuUrl.includes('iheartjane.com') || menuUrl.includes('jane.co')) return 'Jane';
if (menuUrl.includes('treez.io')) return 'Treez';
if (menuUrl.includes('weedmaps.com')) return 'Weedmaps';
if (menuUrl.includes('leafly.com')) return 'Leafly';
return 'Custom';
}
// Get single store with full details
// Get all stores (from dispensaries table)
router.get('/', async (req, res) => {
try {
const { city, state, menu_type } = req.query;
let query = `
SELECT
id,
name,
slug,
city,
state,
address,
zip,
phone,
website,
latitude,
longitude,
menu_url,
menu_type,
platform,
platform_dispensary_id,
product_count,
last_crawl_at,
created_at,
updated_at
FROM dispensaries
`;
const params: any[] = [];
const conditions: string[] = [];
if (city) {
conditions.push(`city ILIKE $${params.length + 1}`);
params.push(city);
}
if (state) {
conditions.push(`state = $${params.length + 1}`);
params.push(state);
}
if (menu_type) {
conditions.push(`menu_type = $${params.length + 1}`);
params.push(menu_type);
}
if (conditions.length > 0) {
query += ` WHERE ${conditions.join(' AND ')}`;
}
query += ` ORDER BY name`;
const result = await pool.query(query, params);
// Add computed fields
const stores = result.rows.map(row => ({
...row,
provider: detectProvider(row.menu_url),
...calculateFreshness(row.last_crawl_at)
}));
res.json({ stores });
} catch (error) {
console.error('Error fetching stores:', error);
res.status(500).json({ error: 'Failed to fetch stores' });
}
});
// Get single store by ID (from dispensaries table)
router.get('/:id', async (req, res) => {
try {
const { id } = req.params;
// Get store with counts and linked dispensary
const result = await pool.query(`
SELECT
s.*,
d.id as dispensary_id,
d.name as dispensary_name,
d.slug as dispensary_slug,
d.state as dispensary_state,
d.city as dispensary_city,
d.address as dispensary_address,
d.menu_provider as dispensary_menu_provider,
COUNT(DISTINCT p.id) as product_count,
COUNT(DISTINCT c.id) as category_count,
COUNT(DISTINCT p.id) FILTER (WHERE p.in_stock = true) as in_stock_count,
COUNT(DISTINCT p.id) FILTER (WHERE p.in_stock = false) as out_of_stock_count
FROM stores s
LEFT JOIN dispensaries d ON s.dispensary_id = d.id
LEFT JOIN products p ON s.id = p.store_id
LEFT JOIN categories c ON s.id = c.store_id
WHERE s.id = $1
GROUP BY s.id, d.id, d.name, d.slug, d.state, d.city, d.address, d.menu_provider
id,
name,
slug,
city,
state,
address,
zip,
phone,
website,
dba_name,
company_name,
latitude,
longitude,
menu_url,
menu_type,
platform,
platform_dispensary_id,
product_count,
last_crawl_at,
raw_metadata,
created_at,
updated_at
FROM dispensaries
WHERE id = $1
`, [id]);
if (result.rows.length === 0) {
@@ -115,62 +175,19 @@ router.get('/:id', async (req, res) => {
const store = result.rows[0];
// Get recent crawl jobs for this store
const jobsResult = await pool.query(`
SELECT
id, status, job_type, trigger_type,
started_at, completed_at,
products_found, products_new, products_updated,
in_stock_count, out_of_stock_count,
error_message
FROM crawl_jobs
WHERE store_id = $1
ORDER BY created_at DESC
LIMIT 10
`, [id]);
// Get schedule info if exists
const scheduleResult = await pool.query(`
SELECT
enabled, interval_hours, next_run_at, last_run_at
FROM store_crawl_schedule
WHERE store_id = $1
`, [id]);
// Calculate freshness
const freshness = calculateFreshness(store.last_scraped_at);
const freshness = calculateFreshness(store.last_crawl_at);
// Detect provider from URL
const provider = detectProvider(store.dutchie_url);
const provider = detectProvider(store.menu_url);
// Build response
const response = {
...store,
provider,
freshness: freshness.freshness,
is_stale: freshness.is_stale,
hours_since_scrape: freshness.hours_since_scrape,
linked_dispensary: store.dispensary_id ? {
id: store.dispensary_id,
name: store.dispensary_name,
slug: store.dispensary_slug,
state: store.dispensary_state,
city: store.dispensary_city,
address: store.dispensary_address,
menu_provider: store.dispensary_menu_provider
} : null,
schedule: scheduleResult.rows[0] || null,
recent_jobs: jobsResult.rows
...freshness,
};
// Remove redundant dispensary fields from root
delete response.dispensary_name;
delete response.dispensary_slug;
delete response.dispensary_state;
delete response.dispensary_city;
delete response.dispensary_address;
delete response.dispensary_menu_provider;
res.json(response);
} catch (error) {
console.error('Error fetching store:', error);
@@ -178,88 +195,101 @@ router.get('/:id', async (req, res) => {
}
});
// Get store brands
router.get('/:id/brands', async (req, res) => {
try {
const { id } = req.params;
const result = await pool.query(`
SELECT name
FROM brands
WHERE store_id = $1
ORDER BY name
`, [id]);
const brands = result.rows.map((row: any) => row.name);
res.json({ brands });
} catch (error) {
console.error('Error fetching store brands:', error);
res.status(500).json({ error: 'Failed to fetch store brands' });
}
});
// Get store specials
router.get('/:id/specials', async (req, res) => {
try {
const { id } = req.params;
const { date } = req.query;
// Use provided date or today's date
const queryDate = date || new Date().toISOString().split('T')[0];
const result = await pool.query(`
SELECT
s.*,
p.name as product_name,
p.image_url as product_image
FROM specials s
LEFT JOIN products p ON s.product_id = p.id
WHERE s.store_id = $1 AND s.valid_date = $2
ORDER BY s.name
`, [id, queryDate]);
res.json({ specials: result.rows, date: queryDate });
} catch (error) {
console.error('Error fetching store specials:', error);
res.status(500).json({ error: 'Failed to fetch store specials' });
}
});
// Create store
// Create store (into dispensaries table)
router.post('/', requireRole('superadmin', 'admin'), async (req, res) => {
try {
const { name, slug, dutchie_url, active, scrape_enabled } = req.body;
const {
name,
slug,
city,
state,
address,
zip,
phone,
website,
menu_url,
menu_type,
platform,
platform_dispensary_id,
latitude,
longitude
} = req.body;
if (!name || !slug || !city || !state) {
return res.status(400).json({ error: 'name, slug, city, and state are required' });
}
const result = await pool.query(`
INSERT INTO stores (name, slug, dutchie_url, active, scrape_enabled)
VALUES ($1, $2, $3, $4, $5)
INSERT INTO dispensaries (
name, slug, city, state, address, zip, phone, website,
menu_url, menu_type, platform, platform_dispensary_id,
latitude, longitude, created_at, updated_at
)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
RETURNING *
`, [name, slug, dutchie_url, active ?? true, scrape_enabled ?? true]);
`, [
name, slug, city, state, address, zip, phone, website,
menu_url, menu_type, platform || 'dutchie', platform_dispensary_id,
latitude, longitude
]);
res.status(201).json(result.rows[0]);
} catch (error) {
} catch (error: any) {
console.error('Error creating store:', error);
if (error.code === '23505') { // unique violation
res.status(409).json({ error: 'Store with this slug already exists' });
} else {
res.status(500).json({ error: 'Failed to create store' });
}
}
});
// Update store
// Update store (in dispensaries table)
router.put('/:id', requireRole('superadmin', 'admin'), async (req, res) => {
try {
const { id } = req.params;
const { name, slug, dutchie_url, active, scrape_enabled } = req.body;
const {
name,
slug,
city,
state,
address,
zip,
phone,
website,
menu_url,
menu_type,
platform,
platform_dispensary_id,
latitude,
longitude
} = req.body;
const result = await pool.query(`
UPDATE stores
SET name = COALESCE($1, name),
UPDATE dispensaries
SET
name = COALESCE($1, name),
slug = COALESCE($2, slug),
dutchie_url = COALESCE($3, dutchie_url),
active = COALESCE($4, active),
scrape_enabled = COALESCE($5, scrape_enabled),
city = COALESCE($3, city),
state = COALESCE($4, state),
address = COALESCE($5, address),
zip = COALESCE($6, zip),
phone = COALESCE($7, phone),
website = COALESCE($8, website),
menu_url = COALESCE($9, menu_url),
menu_type = COALESCE($10, menu_type),
platform = COALESCE($11, platform),
platform_dispensary_id = COALESCE($12, platform_dispensary_id),
latitude = COALESCE($13, latitude),
longitude = COALESCE($14, longitude),
updated_at = CURRENT_TIMESTAMP
WHERE id = $6
WHERE id = $15
RETURNING *
`, [name, slug, dutchie_url, active, scrape_enabled, id]);
`, [
name, slug, city, state, address, zip, phone, website,
menu_url, menu_type, platform, platform_dispensary_id,
latitude, longitude, id
]);
if (result.rows.length === 0) {
return res.status(404).json({ error: 'Store not found' });
@@ -272,12 +302,12 @@ router.put('/:id', requireRole('superadmin', 'admin'), async (req, res) => {
}
});
// Delete store
// Delete store (from dispensaries table)
router.delete('/:id', requireRole('superadmin'), async (req, res) => {
try {
const { id } = req.params;
const result = await pool.query('DELETE FROM stores WHERE id = $1 RETURNING *', [id]);
const result = await pool.query('DELETE FROM dispensaries WHERE id = $1 RETURNING *', [id]);
if (result.rows.length === 0) {
return res.status(404).json({ error: 'Store not found' });
@@ -290,135 +320,55 @@ router.delete('/:id', requireRole('superadmin'), async (req, res) => {
}
});
// Trigger scrape for a store
router.post('/:id/scrape', requireRole('superadmin', 'admin'), async (req, res) => {
try {
const { id } = req.params;
const { parallel = 3, userAgent } = req.body; // Default to 3 parallel scrapers
const storeResult = await pool.query('SELECT id FROM stores WHERE id = $1', [id]);
if (storeResult.rows.length === 0) {
return res.status(404).json({ error: 'Store not found' });
}
scrapeStore(parseInt(id), parseInt(parallel), userAgent).catch(err => {
console.error('Background scrape error:', err);
});
res.json({
message: 'Scrape started',
parallel: parseInt(parallel),
userAgent: userAgent || 'random'
});
} catch (error) {
console.error('Error triggering scrape:', error);
res.status(500).json({ error: 'Failed to trigger scrape' });
}
});
// Download missing images for a store
router.post('/:id/download-images', requireRole('superadmin', 'admin'), async (req, res) => {
// Get products for a store (uses dutchie_products table)
router.get('/:id/products', async (req, res) => {
try {
const { id } = req.params;
const storeResult = await pool.query('SELECT id, name FROM stores WHERE id = $1', [id]);
if (storeResult.rows.length === 0) {
return res.status(404).json({ error: 'Store not found' });
}
const store = storeResult.rows[0];
const productsResult = await pool.query(`
SELECT id, name, image_url
FROM products
WHERE store_id = $1
AND image_url IS NOT NULL
AND local_image_path IS NULL
const result = await pool.query(`
SELECT
id,
name,
brand_name,
type,
subcategory,
stock_status,
thc_content,
cbd_content,
primary_image_url,
external_product_id,
created_at,
updated_at
FROM dutchie_products
WHERE dispensary_id = $1
ORDER BY name
`, [id]);
(async () => {
const { uploadImageFromUrl } = await import('../utils/minio');
let downloaded = 0;
for (const product of productsResult.rows) {
try {
console.log(`📸 Downloading image for: ${product.name}`);
const localPath = await uploadImageFromUrl(product.image_url, product.id);
await pool.query(`
UPDATE products
SET local_image_path = $1
WHERE id = $2
`, [localPath, product.id]);
downloaded++;
res.json({ products: result.rows });
} catch (error) {
console.error(`Failed to download image for ${product.name}:`, error);
}
}
console.log(`✅ Downloaded ${downloaded} of ${productsResult.rows.length} missing images for ${store.name}`);
})().catch(err => console.error('Background image download error:', err));
res.json({
message: 'Image download started',
total_missing: productsResult.rows.length
});
} catch (error) {
console.error('Error triggering image download:', error);
res.status(500).json({ error: 'Failed to trigger image download' });
console.error('Error fetching store products:', error);
res.status(500).json({ error: 'Failed to fetch products' });
}
});
// Discover categories for a store
router.post('/:id/discover-categories', requireRole('superadmin', 'admin'), async (req, res) => {
// Get brands for a store
router.get('/:id/brands', async (req, res) => {
try {
const { id } = req.params;
const storeResult = await pool.query('SELECT id FROM stores WHERE id = $1', [id]);
if (storeResult.rows.length === 0) {
return res.status(404).json({ error: 'Store not found' });
}
discoverCategories(parseInt(id)).catch(err => {
console.error('Background category discovery error:', err);
});
res.json({ message: 'Category discovery started' });
} catch (error) {
console.error('Error triggering category discovery:', error);
res.status(500).json({ error: 'Failed to trigger category discovery' });
}
});
// Debug scraper
router.post('/:id/debug-scrape', requireRole('superadmin', 'admin'), async (req, res) => {
try {
const { id } = req.params;
console.log('Debug scrape triggered for store:', id);
const categoryResult = await pool.query(`
SELECT c.dutchie_url, c.name
FROM categories c
WHERE c.store_id = $1 AND c.slug = 'edibles'
LIMIT 1
const result = await pool.query(`
SELECT DISTINCT brand_name as name, COUNT(*) as product_count
FROM dutchie_products
WHERE dispensary_id = $1 AND brand_name IS NOT NULL
GROUP BY brand_name
ORDER BY product_count DESC, brand_name
`, [id]);
if (categoryResult.rows.length === 0) {
return res.status(404).json({ error: 'Edibles category not found' });
}
console.log('Found category:', categoryResult.rows[0]);
const { debugDutchiePage } = await import('../services/scraper-debug');
debugDutchiePage(categoryResult.rows[0].dutchie_url).catch(err => {
console.error('Debug error:', err);
});
res.json({ message: 'Debug started, check logs', url: categoryResult.rows[0].dutchie_url });
const brands = result.rows.map((row: any) => row.name);
res.json({ brands, details: result.rows });
} catch (error) {
console.error('Debug endpoint error:', error);
res.status(500).json({ error: 'Failed to debug' });
console.error('Error fetching store brands:', error);
res.status(500).json({ error: 'Failed to fetch store brands' });
}
});

View File

@@ -20,7 +20,7 @@
*/
import { Router, Request, Response } from 'express';
import { getPool } from '../dutchie-az/db/connection';
import { pool } from '../db/pool';
const router = Router();
@@ -112,7 +112,7 @@ function extractRunRole(jobName: string, jobConfig: any): string {
*/
router.get('/', async (_req: Request, res: Response) => {
try {
const pool = getPool();
// pool imported from db/pool
const { rows } = await pool.query(`
SELECT
id,
@@ -158,7 +158,7 @@ router.get('/', async (_req: Request, res: Response) => {
*/
router.get('/active', async (_req: Request, res: Response) => {
try {
const pool = getPool();
// pool imported from db/pool
const { rows } = await pool.query(`
SELECT DISTINCT ON (claimed_by)
claimed_by as worker_id,
@@ -193,7 +193,7 @@ router.get('/active', async (_req: Request, res: Response) => {
*/
router.get('/schedule', async (req: Request, res: Response) => {
// Delegate to main workers endpoint
const pool = getPool();
// pool imported from db/pool
try {
const { rows } = await pool.query(`
SELECT
@@ -223,7 +223,7 @@ router.get('/schedule', async (req: Request, res: Response) => {
router.get('/:workerIdOrName', async (req: Request, res: Response) => {
try {
const { workerIdOrName } = req.params;
const pool = getPool();
// pool imported from db/pool
// Try to find by ID or job_name
const { rows } = await pool.query(`
@@ -278,7 +278,7 @@ router.get('/:workerIdOrName', async (req: Request, res: Response) => {
router.get('/:workerIdOrName/scope', async (req: Request, res: Response) => {
try {
const { workerIdOrName } = req.params;
const pool = getPool();
// pool imported from db/pool
const { rows } = await pool.query(`
SELECT job_config
@@ -304,7 +304,7 @@ router.get('/:workerIdOrName/scope', async (req: Request, res: Response) => {
router.get('/:workerIdOrName/stats', async (req: Request, res: Response) => {
try {
const { workerIdOrName } = req.params;
const pool = getPool();
// pool imported from db/pool
// Get schedule info
const scheduleResult = await pool.query(`
@@ -357,7 +357,7 @@ router.get('/:workerIdOrName/logs', async (req: Request, res: Response) => {
try {
const { workerIdOrName } = req.params;
const limit = parseInt(req.query.limit as string) || 20;
const pool = getPool();
// pool imported from db/pool
// Get schedule info
const scheduleResult = await pool.query(`
@@ -419,7 +419,7 @@ router.get('/:workerIdOrName/logs', async (req: Request, res: Response) => {
router.post('/:workerIdOrName/trigger', async (req: Request, res: Response) => {
try {
const { workerIdOrName } = req.params;
const pool = getPool();
// pool imported from db/pool
// Get schedule info
const scheduleResult = await pool.query(`
@@ -458,7 +458,7 @@ router.get('/jobs', async (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 50;
const status = req.query.status as string | undefined;
const pool = getPool();
// pool imported from db/pool
let query = `
SELECT
@@ -518,7 +518,7 @@ router.get('/jobs', async (req: Request, res: Response) => {
*/
router.get('/active-jobs', async (req: Request, res: Response) => {
try {
const pool = getPool();
// pool imported from db/pool
const { rows } = await pool.query(`
SELECT
@@ -563,7 +563,7 @@ router.get('/active-jobs', async (req: Request, res: Response) => {
*/
router.get('/summary', async (req: Request, res: Response) => {
try {
const pool = getPool();
// pool imported from db/pool
// Get summary stats
const [scheduleStats, jobStats, activeJobs] = await Promise.all([

View File

@@ -38,10 +38,10 @@ export class NavigationDiscovery {
logger.info('categories', `Starting category discovery for store ${storeId}`);
try {
// Get store info
// Get dispensary info (store = dispensary)
const storeResult = await pool.query(`
SELECT id, name, slug, dutchie_url
FROM stores
SELECT id, name, slug, menu_url as dutchie_url
FROM dispensaries
WHERE id = $1
`, [storeId]);

View File

@@ -1,439 +0,0 @@
// ============================================================================
// DEPRECATED: This scraper writes to the LEGACY products table.
// DO NOT USE - All Dutchie crawling must use the new dutchie-az pipeline.
//
// New pipeline location: src/dutchie-az/services/product-crawler.ts
// - Uses fetch-based GraphQL (no Puppeteer needed)
// - Writes to isolated dutchie_az_* tables with snapshot model
// - Tracks stockStatus, isPresentInFeed, missing_from_feed
// ============================================================================
/**
* @deprecated DEPRECATED - Use src/dutchie-az/services/product-crawler.ts instead.
* This scraper writes to the legacy products table, not the new dutchie_az tables.
*
* Makes direct GraphQL requests from within the browser context to:
* 1. Bypass Cloudflare (using browser session)
* 2. Fetch ALL products including out-of-stock (Status: null)
* 3. Paginate through complete menu
*/
import puppeteer from 'puppeteer-extra';
import type { Browser, Page } from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { Pool } from 'pg';
import { DutchieProduct, NormalizedProduct, normalizeDutchieProduct } from './dutchie-graphql';
puppeteer.use(StealthPlugin());
// GraphQL persisted query hashes
const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
};
interface FetchResult {
products: DutchieProduct[];
dispensaryId: string;
totalProducts: number;
activeCount: number;
inactiveCount: number;
}
/**
* Fetch all products via in-page GraphQL requests
* This includes both in-stock and out-of-stock items
*/
export async function fetchAllDutchieProducts(
menuUrl: string,
options: {
headless?: boolean | 'new';
timeout?: number;
perPage?: number;
includeOutOfStock?: boolean;
} = {}
): Promise<FetchResult> {
const {
headless = 'new',
timeout = 90000,
perPage = 100,
includeOutOfStock = true,
} = options;
let browser: Browser | undefined;
try {
browser = await puppeteer.launch({
headless,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled',
],
});
const page = await browser.newPage();
// Stealth configuration
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
await page.setViewport({ width: 1920, height: 1080 });
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
(window as any).chrome = { runtime: {} };
});
// Navigate to menu page to establish session
console.log('[DutchieGraphQL] Loading menu page to establish session...');
await page.goto(menuUrl, {
waitUntil: 'networkidle2',
timeout,
});
// Get dispensary ID from page
const dispensaryId = await page.evaluate(() => {
const env = (window as any).reactEnv;
return env?.dispensaryId || env?.retailerId || '';
});
if (!dispensaryId) {
throw new Error('Could not determine dispensaryId from page');
}
console.log(`[DutchieGraphQL] Dispensary ID: ${dispensaryId}`);
// Fetch all products via in-page GraphQL requests
const allProducts: DutchieProduct[] = [];
let page_num = 0;
let hasMore = true;
while (hasMore) {
console.log(`[DutchieGraphQL] Fetching page ${page_num} (perPage=${perPage})...`);
const result = await page.evaluate(
async (dispensaryId: string, page_num: number, perPage: number, includeOutOfStock: boolean, hash: string) => {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId,
pricingType: 'rec',
Status: includeOutOfStock ? null : 'Active', // null = include out-of-stock
types: [],
useCache: false, // Don't cache to get fresh data
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: page_num,
perPage,
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify({
persistedQuery: { version: 1, sha256Hash: hash },
}),
});
const response = await fetch(`https://dutchie.com/graphql?${qs.toString()}`, {
method: 'GET',
headers: {
'content-type': 'application/json',
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include', // Include cookies/session
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return response.json();
},
dispensaryId,
page_num,
perPage,
includeOutOfStock,
GRAPHQL_HASHES.FilteredProducts
);
if (result.errors) {
console.error('[DutchieGraphQL] GraphQL errors:', result.errors);
break;
}
const products = result?.data?.filteredProducts?.products || [];
console.log(`[DutchieGraphQL] Page ${page_num}: ${products.length} products`);
if (products.length === 0) {
hasMore = false;
} else {
allProducts.push(...products);
page_num++;
// Safety limit
if (page_num > 50) {
console.log('[DutchieGraphQL] Reached page limit, stopping');
hasMore = false;
}
}
}
// Count active vs inactive
const activeCount = allProducts.filter((p) => p.Status === 'Active').length;
const inactiveCount = allProducts.filter((p) => p.Status !== 'Active').length;
console.log(`[DutchieGraphQL] Total: ${allProducts.length} products (${activeCount} active, ${inactiveCount} inactive)`);
return {
products: allProducts,
dispensaryId,
totalProducts: allProducts.length,
activeCount,
inactiveCount,
};
} finally {
if (browser) {
await browser.close();
}
}
}
/**
* Upsert products to database
*/
export async function upsertProductsDirect(
pool: Pool,
storeId: number,
products: NormalizedProduct[]
): Promise<{ inserted: number; updated: number }> {
const client = await pool.connect();
let inserted = 0;
let updated = 0;
try {
await client.query('BEGIN');
for (const product of products) {
const result = await client.query(
`
INSERT INTO products (
store_id, external_id, slug, name, enterprise_product_id,
brand, brand_external_id, brand_logo_url,
subcategory, strain_type, canonical_category,
price, rec_price, med_price, rec_special_price, med_special_price,
is_on_special, special_name, discount_percent, special_data,
sku, inventory_quantity, inventory_available, is_below_threshold, status,
thc_percentage, cbd_percentage, cannabinoids,
weight_mg, net_weight_value, net_weight_unit, options, raw_options,
image_url, additional_images,
is_featured, medical_only, rec_only,
source_created_at, source_updated_at,
description, raw_data,
dutchie_url, last_seen_at, updated_at
)
VALUES (
$1, $2, $3, $4, $5,
$6, $7, $8,
$9, $10, $11,
$12, $13, $14, $15, $16,
$17, $18, $19, $20,
$21, $22, $23, $24, $25,
$26, $27, $28,
$29, $30, $31, $32, $33,
$34, $35,
$36, $37, $38,
$39, $40,
$41, $42,
'', NOW(), NOW()
)
ON CONFLICT (store_id, slug) DO UPDATE SET
name = EXCLUDED.name,
enterprise_product_id = EXCLUDED.enterprise_product_id,
brand = EXCLUDED.brand,
brand_external_id = EXCLUDED.brand_external_id,
brand_logo_url = EXCLUDED.brand_logo_url,
subcategory = EXCLUDED.subcategory,
strain_type = EXCLUDED.strain_type,
canonical_category = EXCLUDED.canonical_category,
price = EXCLUDED.price,
rec_price = EXCLUDED.rec_price,
med_price = EXCLUDED.med_price,
rec_special_price = EXCLUDED.rec_special_price,
med_special_price = EXCLUDED.med_special_price,
is_on_special = EXCLUDED.is_on_special,
special_name = EXCLUDED.special_name,
discount_percent = EXCLUDED.discount_percent,
special_data = EXCLUDED.special_data,
sku = EXCLUDED.sku,
inventory_quantity = EXCLUDED.inventory_quantity,
inventory_available = EXCLUDED.inventory_available,
is_below_threshold = EXCLUDED.is_below_threshold,
status = EXCLUDED.status,
thc_percentage = EXCLUDED.thc_percentage,
cbd_percentage = EXCLUDED.cbd_percentage,
cannabinoids = EXCLUDED.cannabinoids,
weight_mg = EXCLUDED.weight_mg,
net_weight_value = EXCLUDED.net_weight_value,
net_weight_unit = EXCLUDED.net_weight_unit,
options = EXCLUDED.options,
raw_options = EXCLUDED.raw_options,
image_url = EXCLUDED.image_url,
additional_images = EXCLUDED.additional_images,
is_featured = EXCLUDED.is_featured,
medical_only = EXCLUDED.medical_only,
rec_only = EXCLUDED.rec_only,
source_created_at = EXCLUDED.source_created_at,
source_updated_at = EXCLUDED.source_updated_at,
description = EXCLUDED.description,
raw_data = EXCLUDED.raw_data,
last_seen_at = NOW(),
updated_at = NOW()
RETURNING (xmax = 0) AS was_inserted
`,
[
storeId,
product.external_id,
product.slug,
product.name,
product.enterprise_product_id,
product.brand,
product.brand_external_id,
product.brand_logo_url,
product.subcategory,
product.strain_type,
product.canonical_category,
product.price,
product.rec_price,
product.med_price,
product.rec_special_price,
product.med_special_price,
product.is_on_special,
product.special_name,
product.discount_percent,
product.special_data ? JSON.stringify(product.special_data) : null,
product.sku,
product.inventory_quantity,
product.inventory_available,
product.is_below_threshold,
product.status,
product.thc_percentage,
product.cbd_percentage,
product.cannabinoids ? JSON.stringify(product.cannabinoids) : null,
product.weight_mg,
product.net_weight_value,
product.net_weight_unit,
product.options,
product.raw_options,
product.image_url,
product.additional_images,
product.is_featured,
product.medical_only,
product.rec_only,
product.source_created_at,
product.source_updated_at,
product.description,
product.raw_data ? JSON.stringify(product.raw_data) : null,
]
);
if (result.rows[0]?.was_inserted) {
inserted++;
} else {
updated++;
}
}
await client.query('COMMIT');
return { inserted, updated };
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
/**
* @deprecated DEPRECATED - Use src/dutchie-az/services/product-crawler.ts instead.
* This function is disabled and will throw an error if called.
* Main entry point - scrape all products including out-of-stock
*/
export async function scrapeAllDutchieProducts(
pool: Pool,
storeId: number,
menuUrl: string
): Promise<{
success: boolean;
totalProducts: number;
activeCount: number;
inactiveCount: number;
inserted: number;
updated: number;
error?: string;
}> {
// DEPRECATED: Throw error to prevent accidental use
throw new Error(
'DEPRECATED: scrapeAllDutchieProducts() is deprecated. ' +
'Use src/dutchie-az/services/product-crawler.ts instead. ' +
'This scraper writes to the legacy products table.'
);
// Original code below is unreachable but kept for reference
try {
console.log(`[DutchieGraphQL] Scraping ALL products (including out-of-stock): ${menuUrl}`);
// Fetch all products via direct GraphQL
const { products, totalProducts, activeCount, inactiveCount } = await fetchAllDutchieProducts(menuUrl, {
includeOutOfStock: true,
perPage: 100,
});
if (products.length === 0) {
return {
success: false,
totalProducts: 0,
activeCount: 0,
inactiveCount: 0,
inserted: 0,
updated: 0,
error: 'No products returned from GraphQL',
};
}
// Normalize products
const normalized = products.map(normalizeDutchieProduct);
// Upsert to database
const { inserted, updated } = await upsertProductsDirect(pool, storeId, normalized);
console.log(`[DutchieGraphQL] Complete: ${totalProducts} products (${activeCount} active, ${inactiveCount} inactive)`);
console.log(`[DutchieGraphQL] Database: ${inserted} inserted, ${updated} updated`);
return {
success: true,
totalProducts,
activeCount,
inactiveCount,
inserted,
updated,
};
} catch (error: any) {
console.error(`[DutchieGraphQL] Error:`, error.message);
return {
success: false,
totalProducts: 0,
activeCount: 0,
inactiveCount: 0,
inserted: 0,
updated: 0,
error: error.message,
};
}
}

View File

@@ -1,711 +0,0 @@
// ============================================================================
// DEPRECATED: This scraper writes to the LEGACY products table.
// DO NOT USE - All Dutchie crawling must use the new dutchie-az pipeline.
//
// New pipeline location: src/dutchie-az/services/product-crawler.ts
// - Uses fetch-based GraphQL (no Puppeteer needed)
// - Writes to isolated dutchie_az_* tables with snapshot model
// - Tracks stockStatus, isPresentInFeed, missing_from_feed
//
// The normalizer functions in this file (normalizeDutchieProduct) may still
// be imported for reference, but do NOT call scrapeDutchieMenu() or upsertProducts().
// ============================================================================
/**
* @deprecated DEPRECATED - Use src/dutchie-az/services/product-crawler.ts instead.
* This scraper writes to the legacy products table, not the new dutchie_az tables.
*
* Fetches product data via Puppeteer interception of Dutchie's GraphQL API.
* This bypasses Cloudflare by using a real browser to load the menu page.
*
* GraphQL Operations:
* - FilteredProducts: Returns paginated product list with full details
* - GetAddressBasedDispensaryData: Resolves dispensary cName to dispensaryId
*/
import puppeteer from 'puppeteer-extra';
import type { Browser, Page, HTTPResponse } from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { Pool } from 'pg';
puppeteer.use(StealthPlugin());
// =====================================================
// TYPE DEFINITIONS (from captured GraphQL schema)
// =====================================================
export interface DutchieProduct {
_id: string;
id: string;
Name: string;
cName: string; // URL slug
enterpriseProductId?: string;
DispensaryID: string;
// Brand
brand?: {
id: string;
name: string;
imageUrl?: string;
description?: string;
};
brandId?: string;
brandName?: string;
brandLogo?: string;
// Category
type?: string; // e.g., "Edible", "Flower"
subcategory?: string; // e.g., "gummies", "pre-rolls"
strainType?: string; // "Indica", "Sativa", "Hybrid", "N/A"
// Pricing (arrays - first element is primary)
Prices?: number[];
recPrices?: number[];
medicalPrices?: number[];
recSpecialPrices?: number[];
medicalSpecialPrices?: number[];
// Specials
special?: boolean;
specialData?: {
saleSpecials?: Array<{
specialId: string;
specialName: string;
discount: number;
percentDiscount: boolean;
dollarDiscount: boolean;
specialType: string;
}>;
bogoSpecials?: any;
};
// Inventory
POSMetaData?: {
canonicalSKU?: string;
canonicalCategory?: string;
canonicalName?: string;
canonicalLabResultUrl?: string;
children?: Array<{
option: string;
price: number;
quantity: number;
quantityAvailable: number;
recPrice?: number;
medPrice?: number;
}>;
};
Status?: string; // "Active" or "Inactive"
isBelowThreshold?: boolean;
// Potency
THCContent?: {
unit: string;
range: number[];
};
CBDContent?: {
unit: string;
range: number[];
};
cannabinoidsV2?: Array<{
value: number;
unit: string;
cannabinoid: {
name: string;
};
}>;
// Weight/Options
Options?: string[];
rawOptions?: string[];
weight?: number;
measurements?: {
netWeight?: {
unit: string;
values: number[];
};
volume?: any;
};
// Images
Image?: string;
images?: string[];
// Flags
featured?: boolean;
medicalOnly?: boolean;
recOnly?: boolean;
// Timestamps
createdAt?: string;
updatedAt?: string;
// Description
description?: string;
effects?: Record<string, any>;
terpenes?: any[];
}
// Database product row
export interface NormalizedProduct {
external_id: string;
slug: string;
name: string;
enterprise_product_id?: string;
// Brand
brand?: string;
brand_external_id?: string;
brand_logo_url?: string;
// Category
subcategory?: string;
strain_type?: string;
canonical_category?: string;
// Pricing
price?: number;
rec_price?: number;
med_price?: number;
rec_special_price?: number;
med_special_price?: number;
// Specials
is_on_special: boolean;
special_name?: string;
discount_percent?: number;
special_data?: any;
// Inventory
sku?: string;
inventory_quantity?: number;
inventory_available?: number;
is_below_threshold: boolean;
status?: string;
// Potency
thc_percentage?: number;
cbd_percentage?: number;
cannabinoids?: any;
// Weight/Options
weight_mg?: number;
net_weight_value?: number;
net_weight_unit?: string;
options?: string[];
raw_options?: string[];
// Images
image_url?: string;
additional_images?: string[];
// Flags
is_featured: boolean;
medical_only: boolean;
rec_only: boolean;
// Timestamps
source_created_at?: Date;
source_updated_at?: Date;
// Raw
description?: string;
raw_data?: any;
}
// =====================================================
// NORMALIZER: Dutchie GraphQL → DB Schema
// =====================================================
export function normalizeDutchieProduct(product: DutchieProduct): NormalizedProduct {
// Extract first special if exists
const saleSpecial = product.specialData?.saleSpecials?.[0];
// Calculate inventory from POSMetaData children
const children = product.POSMetaData?.children || [];
const totalQuantity = children.reduce((sum, c) => sum + (c.quantity || 0), 0);
const availableQuantity = children.reduce((sum, c) => sum + (c.quantityAvailable || 0), 0);
// Parse timestamps
let sourceCreatedAt: Date | undefined;
if (product.createdAt) {
// createdAt is a timestamp string like "1729044510543"
const ts = parseInt(product.createdAt, 10);
if (!isNaN(ts)) {
sourceCreatedAt = new Date(ts);
}
}
let sourceUpdatedAt: Date | undefined;
if (product.updatedAt) {
sourceUpdatedAt = new Date(product.updatedAt);
}
return {
// Identity
external_id: product._id || product.id,
slug: product.cName,
name: product.Name,
enterprise_product_id: product.enterpriseProductId,
// Brand
brand: product.brandName || product.brand?.name,
brand_external_id: product.brandId || product.brand?.id,
brand_logo_url: product.brandLogo || product.brand?.imageUrl,
// Category
subcategory: product.subcategory,
strain_type: product.strainType,
canonical_category: product.POSMetaData?.canonicalCategory,
// Pricing
price: product.Prices?.[0],
rec_price: product.recPrices?.[0],
med_price: product.medicalPrices?.[0],
rec_special_price: product.recSpecialPrices?.[0],
med_special_price: product.medicalSpecialPrices?.[0],
// Specials
is_on_special: product.special === true,
special_name: saleSpecial?.specialName,
discount_percent: saleSpecial?.percentDiscount ? saleSpecial.discount : undefined,
special_data: product.specialData,
// Inventory
sku: product.POSMetaData?.canonicalSKU,
inventory_quantity: totalQuantity || undefined,
inventory_available: availableQuantity || undefined,
is_below_threshold: product.isBelowThreshold === true,
status: product.Status,
// Potency
thc_percentage: product.THCContent?.range?.[0],
cbd_percentage: product.CBDContent?.range?.[0],
cannabinoids: product.cannabinoidsV2,
// Weight/Options
weight_mg: product.weight,
net_weight_value: product.measurements?.netWeight?.values?.[0],
net_weight_unit: product.measurements?.netWeight?.unit,
options: product.Options,
raw_options: product.rawOptions,
// Images
image_url: product.Image,
additional_images: product.images?.length ? product.images : undefined,
// Flags
is_featured: product.featured === true,
medical_only: product.medicalOnly === true,
rec_only: product.recOnly === true,
// Timestamps
source_created_at: sourceCreatedAt,
source_updated_at: sourceUpdatedAt,
// Description
description: typeof product.description === 'string' ? product.description : undefined,
// Raw
raw_data: product,
};
}
// =====================================================
// PUPPETEER SCRAPER
// =====================================================
interface CapturedProducts {
products: DutchieProduct[];
dispensaryId: string;
menuUrl: string;
}
export async function fetchDutchieMenuViaPuppeteer(
menuUrl: string,
options: {
headless?: boolean | 'new';
timeout?: number;
maxScrolls?: number;
} = {}
): Promise<CapturedProducts> {
const {
headless = 'new',
timeout = 90000,
maxScrolls = 30, // Increased for full menu capture
} = options;
let browser: Browser | undefined;
const capturedProducts: DutchieProduct[] = [];
let dispensaryId = '';
try {
browser = await puppeteer.launch({
headless,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-blink-features=AutomationControlled',
],
});
const page = await browser.newPage();
// Stealth configuration
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
await page.setViewport({ width: 1920, height: 1080 });
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
(window as any).chrome = { runtime: {} };
});
// Track seen product IDs to avoid duplicates
const seenIds = new Set<string>();
// Intercept GraphQL responses
page.on('response', async (response: HTTPResponse) => {
const url = response.url();
if (!url.includes('graphql')) return;
try {
const contentType = response.headers()['content-type'] || '';
if (!contentType.includes('application/json')) return;
const data = await response.json();
// Capture dispensary ID
if (data?.data?.getAddressBasedDispensaryData?.dispensaryData?.dispensaryId) {
dispensaryId = data.data.getAddressBasedDispensaryData.dispensaryData.dispensaryId;
}
// Capture products from FilteredProducts
if (data?.data?.filteredProducts?.products) {
const products = data.data.filteredProducts.products as DutchieProduct[];
for (const product of products) {
if (!seenIds.has(product._id)) {
seenIds.add(product._id);
capturedProducts.push(product);
}
}
}
} catch {
// Ignore parse errors
}
});
// Navigate to menu
console.log('[DutchieGraphQL] Loading menu page...');
await page.goto(menuUrl, {
waitUntil: 'networkidle2',
timeout,
});
// Get dispensary ID from window.reactEnv if not captured
if (!dispensaryId) {
dispensaryId = await page.evaluate(() => {
const env = (window as any).reactEnv;
return env?.dispensaryId || env?.retailerId || '';
});
}
// Helper function to scroll through a page until no more products load
async function scrollToLoadAll(maxScrollAttempts: number = maxScrolls): Promise<void> {
let scrollCount = 0;
let previousCount = 0;
let noNewProductsCount = 0;
while (scrollCount < maxScrollAttempts && noNewProductsCount < 3) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await new Promise((r) => setTimeout(r, 1500));
const currentCount = seenIds.size;
if (currentCount === previousCount) {
noNewProductsCount++;
} else {
noNewProductsCount = 0;
}
previousCount = currentCount;
scrollCount++;
}
}
// First, scroll through the main page (all products)
console.log('[DutchieGraphQL] Scrolling main page...');
await scrollToLoadAll();
console.log(`[DutchieGraphQL] After main page: ${seenIds.size} products`);
// Get category links from the navigation
const categoryLinks = await page.evaluate(() => {
const links: string[] = [];
// Look for category navigation links
const navLinks = document.querySelectorAll('a[href*="/products/"]');
navLinks.forEach((link) => {
const href = (link as HTMLAnchorElement).href;
if (href && !links.includes(href)) {
links.push(href);
}
});
return links;
});
console.log(`[DutchieGraphQL] Found ${categoryLinks.length} category links`);
// Visit each category page to capture all products
for (const categoryUrl of categoryLinks) {
try {
console.log(`[DutchieGraphQL] Visiting category: ${categoryUrl.split('/').pop()}`);
await page.goto(categoryUrl, {
waitUntil: 'networkidle2',
timeout: 30000,
});
await scrollToLoadAll(15); // Fewer scrolls per category
console.log(`[DutchieGraphQL] Total products: ${seenIds.size}`);
} catch (e: any) {
console.log(`[DutchieGraphQL] Category error: ${e.message}`);
}
}
// Wait for any final responses
await new Promise((r) => setTimeout(r, 2000));
return {
products: capturedProducts,
dispensaryId,
menuUrl,
};
} finally {
if (browser) {
await browser.close();
}
}
}
// =====================================================
// DATABASE OPERATIONS
// =====================================================
export async function upsertProducts(
pool: Pool,
storeId: number,
products: NormalizedProduct[]
): Promise<{ inserted: number; updated: number }> {
const client = await pool.connect();
let inserted = 0;
let updated = 0;
try {
await client.query('BEGIN');
for (const product of products) {
// Upsert product
const result = await client.query(
`
INSERT INTO products (
store_id, external_id, slug, name, enterprise_product_id,
brand, brand_external_id, brand_logo_url,
subcategory, strain_type, canonical_category,
price, rec_price, med_price, rec_special_price, med_special_price,
is_on_special, special_name, discount_percent, special_data,
sku, inventory_quantity, inventory_available, is_below_threshold, status,
thc_percentage, cbd_percentage, cannabinoids,
weight_mg, net_weight_value, net_weight_unit, options, raw_options,
image_url, additional_images,
is_featured, medical_only, rec_only,
source_created_at, source_updated_at,
description, raw_data,
dutchie_url, last_seen_at, updated_at
)
VALUES (
$1, $2, $3, $4, $5,
$6, $7, $8,
$9, $10, $11,
$12, $13, $14, $15, $16,
$17, $18, $19, $20,
$21, $22, $23, $24, $25,
$26, $27, $28,
$29, $30, $31, $32, $33,
$34, $35,
$36, $37, $38,
$39, $40,
$41, $42,
'', NOW(), NOW()
)
ON CONFLICT (store_id, slug) DO UPDATE SET
name = EXCLUDED.name,
enterprise_product_id = EXCLUDED.enterprise_product_id,
brand = EXCLUDED.brand,
brand_external_id = EXCLUDED.brand_external_id,
brand_logo_url = EXCLUDED.brand_logo_url,
subcategory = EXCLUDED.subcategory,
strain_type = EXCLUDED.strain_type,
canonical_category = EXCLUDED.canonical_category,
price = EXCLUDED.price,
rec_price = EXCLUDED.rec_price,
med_price = EXCLUDED.med_price,
rec_special_price = EXCLUDED.rec_special_price,
med_special_price = EXCLUDED.med_special_price,
is_on_special = EXCLUDED.is_on_special,
special_name = EXCLUDED.special_name,
discount_percent = EXCLUDED.discount_percent,
special_data = EXCLUDED.special_data,
sku = EXCLUDED.sku,
inventory_quantity = EXCLUDED.inventory_quantity,
inventory_available = EXCLUDED.inventory_available,
is_below_threshold = EXCLUDED.is_below_threshold,
status = EXCLUDED.status,
thc_percentage = EXCLUDED.thc_percentage,
cbd_percentage = EXCLUDED.cbd_percentage,
cannabinoids = EXCLUDED.cannabinoids,
weight_mg = EXCLUDED.weight_mg,
net_weight_value = EXCLUDED.net_weight_value,
net_weight_unit = EXCLUDED.net_weight_unit,
options = EXCLUDED.options,
raw_options = EXCLUDED.raw_options,
image_url = EXCLUDED.image_url,
additional_images = EXCLUDED.additional_images,
is_featured = EXCLUDED.is_featured,
medical_only = EXCLUDED.medical_only,
rec_only = EXCLUDED.rec_only,
source_created_at = EXCLUDED.source_created_at,
source_updated_at = EXCLUDED.source_updated_at,
description = EXCLUDED.description,
raw_data = EXCLUDED.raw_data,
last_seen_at = NOW(),
updated_at = NOW()
RETURNING (xmax = 0) AS was_inserted
`,
[
storeId,
product.external_id,
product.slug,
product.name,
product.enterprise_product_id,
product.brand,
product.brand_external_id,
product.brand_logo_url,
product.subcategory,
product.strain_type,
product.canonical_category,
product.price,
product.rec_price,
product.med_price,
product.rec_special_price,
product.med_special_price,
product.is_on_special,
product.special_name,
product.discount_percent,
product.special_data ? JSON.stringify(product.special_data) : null,
product.sku,
product.inventory_quantity,
product.inventory_available,
product.is_below_threshold,
product.status,
product.thc_percentage,
product.cbd_percentage,
product.cannabinoids ? JSON.stringify(product.cannabinoids) : null,
product.weight_mg,
product.net_weight_value,
product.net_weight_unit,
product.options,
product.raw_options,
product.image_url,
product.additional_images,
product.is_featured,
product.medical_only,
product.rec_only,
product.source_created_at,
product.source_updated_at,
product.description,
product.raw_data ? JSON.stringify(product.raw_data) : null,
]
);
if (result.rows[0]?.was_inserted) {
inserted++;
} else {
updated++;
}
}
await client.query('COMMIT');
return { inserted, updated };
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
// =====================================================
// MAIN ENTRY POINT
// =====================================================
/**
* @deprecated DEPRECATED - Use src/dutchie-az/services/product-crawler.ts instead.
* This function is disabled and will throw an error if called.
*/
export async function scrapeDutchieMenu(
pool: Pool,
storeId: number,
menuUrl: string
): Promise<{
success: boolean;
productsFound: number;
inserted: number;
updated: number;
error?: string;
}> {
// DEPRECATED: Throw error to prevent accidental use
throw new Error(
'DEPRECATED: scrapeDutchieMenu() is deprecated. ' +
'Use src/dutchie-az/services/product-crawler.ts instead. ' +
'This scraper writes to the legacy products table.'
);
// Original code below is unreachable but kept for reference
try {
console.log(`[DutchieGraphQL] Scraping: ${menuUrl}`);
// Fetch products via Puppeteer
const { products, dispensaryId } = await fetchDutchieMenuViaPuppeteer(menuUrl);
console.log(`[DutchieGraphQL] Captured ${products.length} products, dispensaryId: ${dispensaryId}`);
if (products.length === 0) {
return {
success: false,
productsFound: 0,
inserted: 0,
updated: 0,
error: 'No products captured from GraphQL responses',
};
}
// Normalize products
const normalized = products.map(normalizeDutchieProduct);
// Upsert to database
const { inserted, updated } = await upsertProducts(pool, storeId, normalized);
console.log(`[DutchieGraphQL] Upsert complete: ${inserted} inserted, ${updated} updated`);
return {
success: true,
productsFound: products.length,
inserted,
updated,
};
} catch (error: any) {
console.error(`[DutchieGraphQL] Error:`, error.message);
return {
success: false,
productsFound: 0,
inserted: 0,
updated: 0,
error: error.message,
};
}
}

View File

@@ -1,102 +0,0 @@
// ============================================================================
// DEPRECATED: Dutchie now crawled via GraphQL only (see dutchie-az pipeline)
// DO NOT USE - This HTML scraper is unreliable and targets the legacy products table.
// All Dutchie crawling must go through: src/dutchie-az/services/product-crawler.ts
// ============================================================================
import { Page } from 'playwright';
import { logger } from '../../services/logger';
export interface ScraperTemplate {
name: string;
urlPattern: RegExp;
buildCategoryUrl: (baseUrl: string, category: string) => string;
extractProducts: (page: Page) => Promise<any[]>;
}
/**
* @deprecated DEPRECATED - Dutchie HTML scraping is no longer supported.
* Use the dutchie-az GraphQL pipeline instead: src/dutchie-az/services/product-crawler.ts
* This template relied on unstable DOM selectors and wrote to legacy tables.
*/
export const dutchieTemplate: ScraperTemplate = {
name: 'Dutchie Marketplace',
urlPattern: /dutchie\.com\/dispensary\//,
buildCategoryUrl: (baseUrl: string, category: string) => {
// Remove trailing slash
const base = baseUrl.replace(/\/$/, '');
// Convert category name to URL-friendly slug
const categorySlug = category.toLowerCase().replace(/\s+/g, '-');
return `${base}/products/${categorySlug}`;
},
extractProducts: async (page: Page) => {
const products: any[] = [];
try {
// Wait for product cards to load
await page.waitForSelector('a[data-testid="card-link"]', { timeout: 10000 }).catch(() => {
logger.warn('scraper', 'No product cards found with data-testid="card-link"');
});
// Get all product card links
const productCards = await page.locator('a[href*="/product/"][data-testid="card-link"]').all();
logger.info('scraper', `Found ${productCards.length} Dutchie product cards`);
for (const card of productCards) {
try {
// Extract all data at once using evaluate for speed
const cardData = await card.evaluate((el) => {
const href = el.getAttribute('href') || '';
const img = el.querySelector('img');
const imageUrl = img ? img.getAttribute('src') || '' : '';
// Get all text nodes in order
const textElements = Array.from(el.querySelectorAll('*'))
.filter(el => el.textContent && el.children.length === 0)
.map(el => (el.textContent || '').trim())
.filter(text => text.length > 0);
const name = textElements[0] || '';
const brand = textElements[1] || '';
// Look for price
const priceMatch = el.textContent?.match(/\$(\d+(?:\.\d{2})?)/);
const price = priceMatch ? parseFloat(priceMatch[1]) : undefined;
return { href, imageUrl, name, brand, price };
});
if (cardData.name && cardData.href) {
products.push({
name: cardData.name,
brand: cardData.brand || undefined,
product_url: cardData.href.startsWith('http') ? cardData.href : `https://dutchie.com${cardData.href}`,
image_url: cardData.imageUrl || undefined,
price: cardData.price,
in_stock: true,
});
}
} catch (err) {
logger.warn('scraper', `Error extracting Dutchie product card: ${err}`);
}
}
} catch (err) {
logger.error('scraper', `Error in Dutchie product extraction: ${err}`);
}
return products;
},
};
/**
* Get the appropriate scraper template based on URL
*/
export function getTemplateForUrl(url: string): ScraperTemplate | null {
if (dutchieTemplate.urlPattern.test(url)) {
return dutchieTemplate;
}
return null;
}

View File

@@ -905,12 +905,13 @@ async function backfillProducts(
let crawlRunId = crawlRunCache.get(dayKey);
if (!crawlRunId && !options.dryRun) {
crawlRunId = await getOrCreateBackfillCrawlRun(
const newCrawlRunId = await getOrCreateBackfillCrawlRun(
pool,
product.dispensary_id,
capturedAt,
options.dryRun
);
crawlRunId = newCrawlRunId ?? undefined;
if (crawlRunId) {
crawlRunCache.set(dayKey, crawlRunId);
stats.crawlRunsCreated++;

View File

@@ -212,7 +212,7 @@ EXAMPLES:
try {
// Fetch all stores without a dispensary_id
const storesResult = await pool.query<Store>(`
const storesResult = await pool.query(`
SELECT id, name, slug, dispensary_id
FROM stores
WHERE dispensary_id IS NULL
@@ -221,7 +221,7 @@ EXAMPLES:
const unmappedStores = storesResult.rows;
// Fetch all already-mapped stores for context
const mappedResult = await pool.query<Store>(`
const mappedResult = await pool.query(`
SELECT id, name, slug, dispensary_id
FROM stores
WHERE dispensary_id IS NOT NULL
@@ -230,7 +230,7 @@ EXAMPLES:
const mappedStores = mappedResult.rows;
// Fetch all dispensaries
const dispResult = await pool.query<Dispensary>(`
const dispResult = await pool.query(`
SELECT id, name, company_name, city, address, slug
FROM dispensaries
ORDER BY name

View File

@@ -1,388 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Bootstrap Discovery Script
*
* One-time (but reusable) bootstrap command that:
* 1. Ensures every Dispensary has a dispensary_crawl_schedule entry (4h default)
* 2. Optionally runs RunDispensaryOrchestrator for each dispensary
*
* Usage:
* npx tsx src/scripts/bootstrap-discovery.ts # Create schedules only
* npx tsx src/scripts/bootstrap-discovery.ts --run # Create schedules + run orchestrator
* npx tsx src/scripts/bootstrap-discovery.ts --run --limit=10 # Run for first 10 dispensaries
* npx tsx src/scripts/bootstrap-discovery.ts --dry-run # Preview what would happen
* npx tsx src/scripts/bootstrap-discovery.ts --status # Show current status only
*/
import { pool } from '../db/pool';
import {
ensureAllDispensariesHaveSchedules,
runDispensaryOrchestrator,
runBatchDispensaryOrchestrator,
getDispensariesDueForOrchestration,
} from '../services/dispensary-orchestrator';
// Parse command line args
const args = process.argv.slice(2);
const flags = {
run: args.includes('--run'),
dryRun: args.includes('--dry-run'),
status: args.includes('--status'),
help: args.includes('--help') || args.includes('-h'),
limit: parseInt(args.find(a => a.startsWith('--limit='))?.split('=')[1] || '0'),
concurrency: parseInt(args.find(a => a.startsWith('--concurrency='))?.split('=')[1] || '3'),
interval: parseInt(args.find(a => a.startsWith('--interval='))?.split('=')[1] || '240'),
detectionOnly: args.includes('--detection-only'),
productionOnly: args.includes('--production-only'),
sandboxOnly: args.includes('--sandbox-only'),
};
async function showHelp() {
console.log(`
Bootstrap Discovery - Initialize Dispensary Crawl System
USAGE:
npx tsx src/scripts/bootstrap-discovery.ts [OPTIONS]
OPTIONS:
--run After creating schedules, run the orchestrator for each dispensary
--dry-run Show what would happen without making changes
--status Show current status and exit
--limit=N Limit how many dispensaries to process (0 = all, default: 0)
--concurrency=N How many dispensaries to process in parallel (default: 3)
--interval=M Default interval in minutes for new schedules (default: 240 = 4 hours)
--detection-only Only run detection, don't crawl
--production-only Only run dispensaries in production mode
--sandbox-only Only run dispensaries in sandbox mode
--help, -h Show this help message
EXAMPLES:
# Create schedule entries for all dispensaries (no crawling)
npx tsx src/scripts/bootstrap-discovery.ts
# Create schedules and run orchestrator for all dispensaries
npx tsx src/scripts/bootstrap-discovery.ts --run
# Run orchestrator for first 10 dispensaries
npx tsx src/scripts/bootstrap-discovery.ts --run --limit=10
# Run with higher concurrency
npx tsx src/scripts/bootstrap-discovery.ts --run --concurrency=5
# Show current status
npx tsx src/scripts/bootstrap-discovery.ts --status
WHAT IT DOES:
1. Creates dispensary_crawl_schedule entries for all dispensaries that don't have one
2. If --run: For each dispensary, runs the orchestrator which:
a. Checks if provider detection is needed (null/unknown/stale/low confidence)
b. Runs detection if needed
c. If Dutchie + production mode: runs production crawl
d. Otherwise: runs sandbox crawl
3. Updates schedule status and job records
`);
}
async function showStatus() {
console.log('\n📊 Current Dispensary Crawl Status\n');
console.log('═'.repeat(70));
// Get dispensary counts by provider
const providerStats = await pool.query(`
SELECT
COALESCE(product_provider, 'undetected') as provider,
COUNT(*) as count,
COUNT(*) FILTER (WHERE product_crawler_mode = 'production') as production,
COUNT(*) FILTER (WHERE product_crawler_mode = 'sandbox') as sandbox,
COUNT(*) FILTER (WHERE product_crawler_mode IS NULL) as no_mode
FROM dispensaries
GROUP BY COALESCE(product_provider, 'undetected')
ORDER BY count DESC
`);
console.log('\nProvider Distribution:');
console.log('-'.repeat(60));
console.log(
'Provider'.padEnd(20) +
'Total'.padStart(8) +
'Production'.padStart(12) +
'Sandbox'.padStart(10) +
'No Mode'.padStart(10)
);
console.log('-'.repeat(60));
for (const row of providerStats.rows) {
console.log(
row.provider.padEnd(20) +
row.count.toString().padStart(8) +
row.production.toString().padStart(12) +
row.sandbox.toString().padStart(10) +
row.no_mode.toString().padStart(10)
);
}
// Get schedule stats
const scheduleStats = await pool.query(`
SELECT
COUNT(DISTINCT d.id) as total_dispensaries,
COUNT(DISTINCT dcs.id) as with_schedule,
COUNT(DISTINCT d.id) - COUNT(DISTINCT dcs.id) as without_schedule,
COUNT(*) FILTER (WHERE dcs.is_active = TRUE) as active_schedules,
COUNT(*) FILTER (WHERE dcs.last_status = 'success') as last_success,
COUNT(*) FILTER (WHERE dcs.last_status = 'error') as last_error,
COUNT(*) FILTER (WHERE dcs.last_status = 'sandbox_only') as last_sandbox,
COUNT(*) FILTER (WHERE dcs.last_status = 'detection_only') as last_detection,
COUNT(*) FILTER (WHERE dcs.next_run_at <= NOW()) as due_now,
AVG(dcs.interval_minutes)::INTEGER as avg_interval
FROM dispensaries d
LEFT JOIN dispensary_crawl_schedule dcs ON dcs.dispensary_id = d.id
`);
const s = scheduleStats.rows[0];
console.log('\n\nSchedule Status:');
console.log('-'.repeat(60));
console.log(` Total Dispensaries: ${s.total_dispensaries}`);
console.log(` With Schedule: ${s.with_schedule}`);
console.log(` Without Schedule: ${s.without_schedule}`);
console.log(` Active Schedules: ${s.active_schedules || 0}`);
console.log(` Average Interval: ${s.avg_interval || 240} minutes`);
console.log('\n Last Run Status:');
console.log(` - Success: ${s.last_success || 0}`);
console.log(` - Error: ${s.last_error || 0}`);
console.log(` - Sandbox Only: ${s.last_sandbox || 0}`);
console.log(` - Detection Only: ${s.last_detection || 0}`);
console.log(` - Due Now: ${s.due_now || 0}`);
// Get recent job stats
const jobStats = await pool.query(`
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE status = 'completed') as completed,
COUNT(*) FILTER (WHERE status = 'failed') as failed,
COUNT(*) FILTER (WHERE status = 'running') as running,
COUNT(*) FILTER (WHERE status = 'pending') as pending,
COUNT(*) FILTER (WHERE detection_ran = TRUE) as with_detection,
COUNT(*) FILTER (WHERE crawl_ran = TRUE) as with_crawl,
COUNT(*) FILTER (WHERE crawl_type = 'production') as production_crawls,
COUNT(*) FILTER (WHERE crawl_type = 'sandbox') as sandbox_crawls,
SUM(products_found) as total_products_found
FROM dispensary_crawl_jobs
WHERE created_at > NOW() - INTERVAL '24 hours'
`);
const j = jobStats.rows[0];
console.log('\n\nJobs (Last 24 Hours):');
console.log('-'.repeat(60));
console.log(` Total Jobs: ${j.total || 0}`);
console.log(` Completed: ${j.completed || 0}`);
console.log(` Failed: ${j.failed || 0}`);
console.log(` Running: ${j.running || 0}`);
console.log(` Pending: ${j.pending || 0}`);
console.log(` With Detection: ${j.with_detection || 0}`);
console.log(` With Crawl: ${j.with_crawl || 0}`);
console.log(` - Production: ${j.production_crawls || 0}`);
console.log(` - Sandbox: ${j.sandbox_crawls || 0}`);
console.log(` Products Found: ${j.total_products_found || 0}`);
console.log('\n' + '═'.repeat(70) + '\n');
}
async function createSchedules(): Promise<{ created: number; existing: number }> {
console.log('\n📅 Creating Dispensary Schedules...\n');
if (flags.dryRun) {
// Count how many would be created
const result = await pool.query(`
SELECT COUNT(*) as count
FROM dispensaries d
WHERE NOT EXISTS (
SELECT 1 FROM dispensary_crawl_schedule dcs WHERE dcs.dispensary_id = d.id
)
`);
const wouldCreate = parseInt(result.rows[0].count);
console.log(` Would create ${wouldCreate} new schedule entries (${flags.interval} minute interval)`);
return { created: wouldCreate, existing: 0 };
}
const result = await ensureAllDispensariesHaveSchedules(flags.interval);
console.log(` ✓ Created ${result.created} new schedule entries`);
console.log(`${result.existing} dispensaries already had schedules`);
return result;
}
async function getDispensariesToProcess(): Promise<number[]> {
// Build query based on filters
let whereClause = 'TRUE';
if (flags.productionOnly) {
whereClause += ` AND d.product_crawler_mode = 'production'`;
} else if (flags.sandboxOnly) {
whereClause += ` AND d.product_crawler_mode = 'sandbox'`;
}
if (flags.detectionOnly) {
whereClause += ` AND (d.product_provider IS NULL OR d.product_provider = 'unknown' OR d.product_confidence < 50)`;
}
const limitClause = flags.limit > 0 ? `LIMIT ${flags.limit}` : '';
const query = `
SELECT d.id, d.name, d.product_provider, d.product_crawler_mode
FROM dispensaries d
LEFT JOIN dispensary_crawl_schedule dcs ON dcs.dispensary_id = d.id
WHERE ${whereClause}
ORDER BY
COALESCE(dcs.priority, 0) DESC,
dcs.last_run_at ASC NULLS FIRST,
d.id ASC
${limitClause}
`;
const result = await pool.query(query);
return result.rows.map(row => row.id);
}
async function runOrchestrator() {
console.log('\n🚀 Running Dispensary Orchestrator...\n');
const dispensaryIds = await getDispensariesToProcess();
if (dispensaryIds.length === 0) {
console.log(' No dispensaries to process.');
return;
}
console.log(` Found ${dispensaryIds.length} dispensaries to process`);
console.log(` Concurrency: ${flags.concurrency}`);
if (flags.dryRun) {
console.log('\n Would process these dispensaries:');
const details = await pool.query(
`SELECT id, name, product_provider, product_crawler_mode
FROM dispensaries WHERE id = ANY($1) ORDER BY id`,
[dispensaryIds]
);
for (const row of details.rows.slice(0, 20)) {
console.log(` - [${row.id}] ${row.name} (${row.product_provider || 'undetected'}, ${row.product_crawler_mode || 'no mode'})`);
}
if (details.rows.length > 20) {
console.log(` ... and ${details.rows.length - 20} more`);
}
return;
}
console.log('\n Starting batch processing...\n');
const results = await runBatchDispensaryOrchestrator(dispensaryIds, flags.concurrency);
// Summarize results
const summary = {
total: results.length,
success: results.filter(r => r.status === 'success').length,
sandboxOnly: results.filter(r => r.status === 'sandbox_only').length,
detectionOnly: results.filter(r => r.status === 'detection_only').length,
error: results.filter(r => r.status === 'error').length,
detectionsRan: results.filter(r => r.detectionRan).length,
crawlsRan: results.filter(r => r.crawlRan).length,
productionCrawls: results.filter(r => r.crawlType === 'production').length,
sandboxCrawls: results.filter(r => r.crawlType === 'sandbox').length,
totalProducts: results.reduce((sum, r) => sum + (r.productsFound || 0), 0),
totalDuration: results.reduce((sum, r) => sum + r.durationMs, 0),
};
console.log('\n' + '═'.repeat(70));
console.log(' Orchestrator Results');
console.log('═'.repeat(70));
console.log(`
Total Processed: ${summary.total}
Status:
- Success: ${summary.success}
- Sandbox Only: ${summary.sandboxOnly}
- Detection Only: ${summary.detectionOnly}
- Error: ${summary.error}
Operations:
- Detections Ran: ${summary.detectionsRan}
- Crawls Ran: ${summary.crawlsRan}
- Production: ${summary.productionCrawls}
- Sandbox: ${summary.sandboxCrawls}
Results:
- Products Found: ${summary.totalProducts}
- Total Duration: ${(summary.totalDuration / 1000).toFixed(1)}s
- Avg per Dispensary: ${(summary.totalDuration / summary.total / 1000).toFixed(1)}s
`);
console.log('═'.repeat(70) + '\n');
// Show errors if any
const errors = results.filter(r => r.status === 'error');
if (errors.length > 0) {
console.log('\n⚠ Errors encountered:');
for (const err of errors.slice(0, 10)) {
console.log(` - [${err.dispensaryId}] ${err.dispensaryName}: ${err.error}`);
}
if (errors.length > 10) {
console.log(` ... and ${errors.length - 10} more errors`);
}
}
}
async function main() {
if (flags.help) {
await showHelp();
process.exit(0);
}
console.log('\n' + '═'.repeat(70));
console.log(' Dispensary Crawl Bootstrap Discovery');
console.log('═'.repeat(70));
if (flags.dryRun) {
console.log('\n🔍 DRY RUN MODE - No changes will be made');
}
try {
// Always show status first
await showStatus();
if (flags.status) {
// Status-only mode, we're done
await pool.end();
process.exit(0);
}
// Step 1: Create schedule entries
await createSchedules();
// Step 2: Optionally run orchestrator
if (flags.run) {
await runOrchestrator();
} else {
console.log('\n💡 Tip: Use --run to also run the orchestrator for each dispensary');
}
// Show final status
if (!flags.dryRun) {
await showStatus();
}
} catch (error: any) {
console.error('\n❌ Fatal error:', error.message);
console.error(error.stack);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,101 +0,0 @@
/**
* LOCAL-ONLY Admin Bootstrap Script
*
* Creates or resets a local admin user for development.
* This script is ONLY for local development - never use in production.
*
* Usage:
* cd backend
* npx tsx src/scripts/bootstrap-local-admin.ts
*
* Default credentials:
* Email: admin@local.test
* Password: admin123
*/
import bcrypt from 'bcrypt';
import { query, closePool } from '../dutchie-az/db/connection';
// Local admin credentials - deterministic for dev
const LOCAL_ADMIN_EMAIL = 'admin@local.test';
const LOCAL_ADMIN_PASSWORD = 'admin123';
const LOCAL_ADMIN_ROLE = 'admin'; // Match existing schema (admin, not superadmin)
async function bootstrapLocalAdmin(): Promise<void> {
console.log('='.repeat(60));
console.log('LOCAL ADMIN BOOTSTRAP');
console.log('='.repeat(60));
console.log('');
console.log('This script creates/resets a local admin user for development.');
console.log('');
try {
// Hash the password with bcrypt (10 rounds, matching existing code)
const passwordHash = await bcrypt.hash(LOCAL_ADMIN_PASSWORD, 10);
// Check if user exists
const existing = await query<{ id: number; email: string }>(
'SELECT id, email FROM users WHERE email = $1',
[LOCAL_ADMIN_EMAIL]
);
if (existing.rows.length > 0) {
// User exists - update password and role
console.log(`User "${LOCAL_ADMIN_EMAIL}" already exists (id=${existing.rows[0].id})`);
console.log('Resetting password and ensuring admin role...');
await query(
`UPDATE users
SET password_hash = $1,
role = $2,
updated_at = NOW()
WHERE email = $3`,
[passwordHash, LOCAL_ADMIN_ROLE, LOCAL_ADMIN_EMAIL]
);
console.log('User updated successfully.');
} else {
// User doesn't exist - create new
console.log(`Creating new admin user: ${LOCAL_ADMIN_EMAIL}`);
const result = await query<{ id: number }>(
`INSERT INTO users (email, password_hash, role, created_at, updated_at)
VALUES ($1, $2, $3, NOW(), NOW())
RETURNING id`,
[LOCAL_ADMIN_EMAIL, passwordHash, LOCAL_ADMIN_ROLE]
);
console.log(`User created successfully (id=${result.rows[0].id})`);
}
console.log('');
console.log('='.repeat(60));
console.log('LOCAL ADMIN READY');
console.log('='.repeat(60));
console.log('');
console.log('Login credentials:');
console.log(` Email: ${LOCAL_ADMIN_EMAIL}`);
console.log(` Password: ${LOCAL_ADMIN_PASSWORD}`);
console.log('');
console.log('Admin UI: http://localhost:8080/admin');
console.log('');
} catch (error: any) {
console.error('');
console.error('ERROR: Failed to bootstrap local admin');
console.error(error.message);
if (error.message.includes('relation "users" does not exist')) {
console.error('');
console.error('The "users" table does not exist.');
console.error('Run migrations first: npm run migrate');
}
process.exit(1);
} finally {
await closePool();
}
}
// Run the bootstrap
bootstrapLocalAdmin();

View File

@@ -1,66 +0,0 @@
/**
* Seed crawl: trigger dutchie crawls for all dispensaries with menu_type='dutchie'
* and a resolved platform_dispensary_id. This uses the AZ orchestrator endpoint logic.
*
* Usage (local):
* node dist/scripts/crawl-all-dutchie.js
*
* Requires:
* - DATABASE_URL/CRAWLSY_DATABASE_URL pointing to the consolidated DB
* - Dispensaries table populated with menu_type and platform_dispensary_id
*/
import { query } from '../dutchie-az/db/connection';
import { runDispensaryOrchestrator } from '../services/dispensary-orchestrator';
async function main() {
const { rows } = await query<{
id: number;
name: string;
slug: string;
platform_dispensary_id: string | null;
}>(`
SELECT id, name, slug, platform_dispensary_id
FROM dispensaries
WHERE menu_type = 'dutchie'
AND platform_dispensary_id IS NOT NULL
ORDER BY id
`);
if (!rows.length) {
console.log('No dutchie dispensaries with resolved platform_dispensary_id found.');
process.exit(0);
}
console.log(`Found ${rows.length} dutchie dispensaries with resolved IDs. Triggering crawls...`);
let success = 0;
let failed = 0;
for (const row of rows) {
try {
console.log(`Crawling ${row.id} (${row.name})...`);
const result = await runDispensaryOrchestrator(row.id);
const ok =
result.status === 'success' ||
result.status === 'sandbox_only' ||
result.status === 'detection_only';
if (ok) {
success++;
} else {
failed++;
console.warn(`Crawl returned status ${result.status} for ${row.id} (${row.name})`);
}
} catch (err: any) {
failed++;
console.error(`Failed crawl for ${row.id} (${row.name}): ${err.message}`);
}
}
console.log(`Completed. Success: ${success}, Failed: ${failed}`);
}
main().catch((err) => {
console.error('Fatal:', err);
process.exit(1);
});

View File

@@ -1,50 +0,0 @@
import { runDispensaryOrchestrator } from '../services/dispensary-orchestrator';
// All 57 dutchie stores with platform_dispensary_id (as of 2024-12)
const ALL_DISPENSARY_IDS = [
72, 74, 75, 76, 77, 78, 81, 82, 85, 87, 91, 92, 97, 101, 106, 108, 110, 112,
115, 120, 123, 125, 128, 131, 135, 139, 140, 143, 144, 145, 152, 153, 161,
168, 176, 177, 180, 181, 189, 195, 196, 199, 200, 201, 205, 206, 207, 213,
214, 224, 225, 227, 232, 235, 248, 252, 281
];
const BATCH_SIZE = 5;
async function run() {
const totalBatches = Math.ceil(ALL_DISPENSARY_IDS.length / BATCH_SIZE);
console.log(`Starting crawl of ${ALL_DISPENSARY_IDS.length} stores in ${totalBatches} batches of ${BATCH_SIZE}...`);
let successCount = 0;
let errorCount = 0;
for (let i = 0; i < ALL_DISPENSARY_IDS.length; i += BATCH_SIZE) {
const batch = ALL_DISPENSARY_IDS.slice(i, i + BATCH_SIZE);
const batchNum = Math.floor(i / BATCH_SIZE) + 1;
console.log(`\n========== BATCH ${batchNum}/${totalBatches} (IDs: ${batch.join(', ')}) ==========`);
for (const id of batch) {
console.log(`\n--- Crawling dispensary ${id} ---`);
try {
const result = await runDispensaryOrchestrator(id);
console.log(` Status: ${result.status}`);
console.log(` Summary: ${result.summary}`);
if (result.productsFound) {
console.log(` Products: ${result.productsFound} found, ${result.productsNew} new, ${result.productsUpdated} updated`);
}
successCount++;
} catch (e: any) {
console.log(` ERROR: ${e.message}`);
errorCount++;
}
}
console.log(`\n--- Batch ${batchNum} complete. Progress: ${Math.min(i + BATCH_SIZE, ALL_DISPENSARY_IDS.length)}/${ALL_DISPENSARY_IDS.length} ---`);
}
console.log('\n========================================');
console.log(`=== ALL CRAWLS COMPLETE ===`);
console.log(`Success: ${successCount}, Errors: ${errorCount}`);
console.log('========================================');
}
run().catch(e => console.log('Fatal:', e.message));

View File

@@ -0,0 +1,114 @@
/**
* Debug Dutchie city page to see what data is available
*/
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
async function main() {
const cityUrl = process.argv[2] || 'https://dutchie.com/us/dispensaries/wa-bellevue';
console.log(`Debugging page: ${cityUrl}`);
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'],
});
try {
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
console.log('Navigating...');
await page.goto(cityUrl, {
waitUntil: 'networkidle2',
timeout: 60000,
});
await new Promise((r) => setTimeout(r, 5000));
// Get page title
const title = await page.title();
console.log(`\nPage title: ${title}`);
// Check for Cloudflare challenge
const isCFChallenge = await page.evaluate(() => {
return document.title.includes('Just a moment') ||
document.body.textContent?.includes('Enable JavaScript');
});
if (isCFChallenge) {
console.log('\n⚠ CLOUDFLARE CHALLENGE DETECTED - waiting longer...');
await new Promise((r) => setTimeout(r, 10000));
}
// Check for __NEXT_DATA__
const nextData = await page.evaluate(() => {
const script = document.querySelector('script#__NEXT_DATA__');
if (script) {
try {
return JSON.parse(script.textContent || '{}');
} catch {
return { error: 'Failed to parse __NEXT_DATA__' };
}
}
return null;
});
if (nextData) {
console.log('\n✅ __NEXT_DATA__ found!');
console.log('Keys:', Object.keys(nextData));
if (nextData.props?.pageProps) {
console.log('pageProps keys:', Object.keys(nextData.props.pageProps));
if (nextData.props.pageProps.dispensaries) {
console.log('Dispensaries count:', nextData.props.pageProps.dispensaries.length);
// Show first dispensary structure
const first = nextData.props.pageProps.dispensaries[0];
if (first) {
console.log('\nFirst dispensary keys:', Object.keys(first));
console.log('First dispensary sample:', JSON.stringify(first, null, 2).slice(0, 1000));
}
}
}
} else {
console.log('\n❌ No __NEXT_DATA__ found');
// Check what scripts are on the page
const scripts = await page.evaluate(() => {
return Array.from(document.querySelectorAll('script[id]')).map(s => ({
id: s.id,
src: (s as HTMLScriptElement).src?.slice(0, 100),
}));
});
console.log('Scripts with IDs:', scripts);
// Try to find dispensary data in window object
const windowData = await page.evaluate(() => {
const w = window as any;
const keys = ['__NEXT_DATA__', '__PRELOADED_STATE__', '__INITIAL_STATE__',
'dispensaries', '__data', 'pageData', '__remixContext'];
const found: Record<string, any> = {};
for (const key of keys) {
if (w[key]) {
found[key] = typeof w[key] === 'object' ? Object.keys(w[key]) : typeof w[key];
}
}
return found;
});
console.log('Window data:', windowData);
// Get some page content
const bodyText = await page.evaluate(() => document.body.innerText.slice(0, 500));
console.log('\nPage text preview:', bodyText);
}
} finally {
await browser.close();
}
}
main().catch(console.error);

View File

@@ -0,0 +1,258 @@
/**
* Discover and Import Store Script
*
* Discovers a store from Dutchie by city+state and imports it into the dispensaries table.
* Uses the local API endpoints - does NOT make direct GraphQL calls.
*
* Usage:
* npx tsx src/scripts/discover-and-import-store.ts --city "Adelanto" --state "CA"
* npx tsx src/scripts/discover-and-import-store.ts --city "Phoenix" --state "AZ" --dry-run
* npx tsx src/scripts/discover-and-import-store.ts --city "Los Angeles" --state "CA" --all
*/
const API_BASE = process.env.API_BASE || 'http://localhost:3010';
interface DiscoveryResult {
cityId: string;
citySlug: string;
locationsFound: number;
locationsUpserted: number;
locationsNew: number;
locationsUpdated: number;
errors: string[];
durationMs: number;
}
interface DiscoveryLocation {
id: number;
name: string;
city: string;
stateCode: string;
platformSlug: string;
platformLocationId: string;
platformMenuUrl: string;
status: string;
}
interface Store {
id: number;
name: string;
slug: string;
city: string;
state: string;
menu_url: string;
platform_dispensary_id: string;
}
async function discoverCity(city: string, state: string): Promise<DiscoveryResult | null> {
const citySlug = city.toLowerCase().replace(/\s+/g, '-');
console.log(`\n[1/3] Discovering stores in ${city}, ${state}...`);
const response = await fetch(`${API_BASE}/api/discovery/admin/discover-city`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
citySlug,
stateCode: state,
countryCode: 'US'
})
});
if (!response.ok) {
const error = await response.text();
console.error(`Discovery failed: ${error}`);
return null;
}
const data = await response.json();
if (!data.success) {
console.error(`Discovery failed: ${JSON.stringify(data)}`);
return null;
}
console.log(` Found ${data.result.locationsFound} location(s)`);
console.log(` New: ${data.result.locationsNew}, Updated: ${data.result.locationsUpdated}`);
return data.result;
}
async function getDiscoveredLocations(state: string, city?: string): Promise<DiscoveryLocation[]> {
console.log(`\n[2/3] Fetching discovered locations for ${city || 'all cities'}, ${state}...`);
// Query the discovery_locations table via SQL since the API has a bug
// For now, return empty and let caller handle via direct DB query
// TODO: Fix the /api/discovery/locations endpoint
return [];
}
async function createStore(location: {
name: string;
slug: string;
city: string;
state: string;
menuUrl: string;
platformId: string;
}): Promise<Store | null> {
console.log(`\n[3/3] Creating store: ${location.name}...`);
const response = await fetch(`${API_BASE}/api/stores`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
name: location.name,
slug: location.slug,
city: location.city,
state: location.state,
menu_url: location.menuUrl,
menu_type: 'dutchie',
platform: 'dutchie',
platform_dispensary_id: location.platformId
})
});
if (!response.ok) {
const error = await response.json();
if (error.error?.includes('already exists')) {
console.log(` Store already exists (slug: ${location.slug})`);
return null;
}
console.error(` Failed to create store: ${JSON.stringify(error)}`);
return null;
}
const store = await response.json();
console.log(` Created store ID: ${store.id}`);
return store;
}
async function verifyStoreExists(city: string, state: string): Promise<Store[]> {
const response = await fetch(`${API_BASE}/api/stores?city=${encodeURIComponent(city)}&state=${state}`);
if (!response.ok) {
return [];
}
const data = await response.json();
return data.stores || [];
}
async function main() {
const args = process.argv.slice(2);
// Parse arguments
let city = '';
let state = '';
let dryRun = false;
let importAll = false;
for (let i = 0; i < args.length; i++) {
if (args[i] === '--city' && args[i + 1]) {
city = args[i + 1];
i++;
} else if (args[i] === '--state' && args[i + 1]) {
state = args[i + 1].toUpperCase();
i++;
} else if (args[i] === '--dry-run') {
dryRun = true;
} else if (args[i] === '--all') {
importAll = true;
}
}
if (!city || !state) {
console.log(`
Usage: npx tsx src/scripts/discover-and-import-store.ts --city "City Name" --state "ST"
Options:
--city City name (required)
--state State code, e.g., CA, AZ (required)
--dry-run Discover only, don't import
--all Import all discovered locations (default: first one only)
Examples:
npx tsx src/scripts/discover-and-import-store.ts --city "Adelanto" --state "CA"
npx tsx src/scripts/discover-and-import-store.ts --city "Phoenix" --state "AZ" --all
`);
process.exit(1);
}
console.log('='.repeat(60));
console.log(`STORE DISCOVERY & IMPORT`);
console.log(`City: ${city}, State: ${state}`);
console.log(`Mode: ${dryRun ? 'DRY RUN' : 'IMPORT'}`);
console.log('='.repeat(60));
// Step 1: Check if stores already exist
const existingStores = await verifyStoreExists(city, state);
if (existingStores.length > 0) {
console.log(`\nFound ${existingStores.length} existing store(s) in ${city}, ${state}:`);
existingStores.forEach(s => console.log(` - ${s.name} (ID: ${s.id})`));
if (!importAll) {
console.log('\nUse --all to discover and import additional stores.');
}
}
// Step 2: Discover from Dutchie
const discovery = await discoverCity(city, state);
if (!discovery) {
console.error('\nDiscovery failed. Exiting.');
process.exit(1);
}
if (discovery.locationsFound === 0) {
console.log('\nNo stores found in this city on Dutchie.');
process.exit(0);
}
if (dryRun) {
console.log('\n[DRY RUN] Would import stores. Run without --dry-run to import.');
process.exit(0);
}
// Step 3: The discovery endpoint already saved to dutchie_discovery_locations
// Now we need to query that table and create dispensary records
// Since the API has bugs, we'll provide instructions for manual import
console.log(`
Next steps to complete import:
1. Query the discovery location:
psql -c "SELECT id, name, platform_slug, platform_location_id, platform_menu_url
FROM dutchie_discovery_locations
WHERE name ILIKE '%${city}%'
ORDER BY id DESC LIMIT 5;"
2. Create the store via API:
curl -X POST ${API_BASE}/api/stores \\
-H "Content-Type: application/json" \\
-d '{
"name": "<NAME>",
"slug": "<PLATFORM_SLUG>",
"city": "${city}",
"state": "${state}",
"menu_url": "<PLATFORM_MENU_URL>",
"menu_type": "dutchie",
"platform": "dutchie",
"platform_dispensary_id": "<PLATFORM_LOCATION_ID>"
}'
3. Verify:
curl "${API_BASE}/api/stores?city=${encodeURIComponent(city)}&state=${state}"
`);
// Final verification
const finalStores = await verifyStoreExists(city, state);
console.log('\n' + '='.repeat(60));
console.log(`RESULT: ${finalStores.length} store(s) now in ${city}, ${state}`);
finalStores.forEach(s => console.log(` - ${s.name} (ID: ${s.id})`));
console.log('='.repeat(60));
}
main().catch(console.error);
export {};

View File

@@ -0,0 +1,88 @@
/**
* Discover all Arizona dispensaries from Dutchie
* Uses the state/city HTML pages which contain __NEXT_DATA__ with full dispensary list
*/
import { fetchPage, extractNextData } from '../platforms/dutchie/client';
interface DutchieDispensary {
platform_dispensary_id: string;
name: string;
slug: string;
city: string;
state: string;
address: string;
zip: string;
}
async function discoverAZDispensaries() {
console.log('Discovering Arizona dispensaries from Dutchie...\n');
const allDispensaries: Map<string, DutchieDispensary> = new Map();
// Fetch the Arizona state page
console.log('Fetching /dispensaries/arizona...');
const stateResult = await fetchPage('/dispensaries/arizona');
if (!stateResult) {
console.error('Failed to fetch Arizona page');
return;
}
console.log(`Got ${stateResult.status} response, ${stateResult.html.length} bytes`);
const nextData = extractNextData(stateResult.html);
if (!nextData) {
console.error('Failed to extract __NEXT_DATA__');
// Try to find dispensary links in HTML
const links = stateResult.html.match(/\/dispensary\/([a-z0-9-]+)/gi) || [];
console.log(`Found ${links.length} dispensary links in HTML`);
const uniqueSlugs = [...new Set(links.map(l => l.replace('/dispensary/', '')))];
console.log('Unique slugs:', uniqueSlugs.slice(0, 20));
return;
}
console.log('Extracted __NEXT_DATA__');
console.log('Keys:', Object.keys(nextData));
// The dispensary data is usually in props.pageProps
const pageProps = nextData?.props?.pageProps;
if (pageProps) {
console.log('pageProps keys:', Object.keys(pageProps));
// Try various possible locations
const dispensaries = pageProps.dispensaries ||
pageProps.nearbyDispensaries ||
pageProps.filteredDispensaries ||
pageProps.allDispensaries ||
[];
console.log(`Found ${dispensaries.length} dispensaries in pageProps`);
if (dispensaries.length > 0) {
console.log('Sample:', JSON.stringify(dispensaries[0], null, 2));
}
}
// Also look for dehydratedState (Apollo cache)
const dehydratedState = nextData?.props?.pageProps?.__APOLLO_STATE__;
if (dehydratedState) {
console.log('Found Apollo state');
const dispensaryKeys = Object.keys(dehydratedState).filter(k =>
k.startsWith('Dispensary:') || k.includes('dispensary')
);
console.log(`Found ${dispensaryKeys.length} dispensary entries`);
if (dispensaryKeys.length > 0) {
console.log('Sample key:', dispensaryKeys[0]);
console.log('Sample value:', JSON.stringify(dehydratedState[dispensaryKeys[0]], null, 2).slice(0, 500));
}
}
// Output the raw pageProps for analysis
if (pageProps) {
const fs = await import('fs');
fs.writeFileSync('/tmp/az-pageprops.json', JSON.stringify(pageProps, null, 2));
console.log('\nWrote pageProps to /tmp/az-pageprops.json');
}
}
discoverAZDispensaries().catch(console.error);

View File

@@ -1,86 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Dutchie City Discovery CLI Runner
*
* Discovers cities from Dutchie's /cities page and upserts to dutchie_discovery_cities.
*
* Usage:
* npm run discovery:dutchie:cities
* npx tsx src/scripts/discovery-dutchie-cities.ts
*
* Environment:
* DATABASE_URL - PostgreSQL connection string (required)
*/
import { Pool } from 'pg';
import { DutchieCityDiscovery } from '../dutchie-az/discovery/DutchieCityDiscovery';
async function main() {
console.log('='.repeat(60));
console.log('DUTCHIE CITY DISCOVERY');
console.log('='.repeat(60));
// Get database URL from environment
const connectionString = process.env.DATABASE_URL;
if (!connectionString) {
console.error('ERROR: DATABASE_URL environment variable is required');
console.error('');
console.error('Usage:');
console.error(' DATABASE_URL="postgresql://..." npm run discovery:dutchie:cities');
process.exit(1);
}
// Create pool
const pool = new Pool({ connectionString });
try {
// Test connection
await pool.query('SELECT 1');
console.log('[CLI] Database connection established');
// Run discovery
const discovery = new DutchieCityDiscovery(pool);
const result = await discovery.run();
// Print summary
console.log('');
console.log('='.repeat(60));
console.log('DISCOVERY COMPLETE');
console.log('='.repeat(60));
console.log(`Cities found: ${result.citiesFound}`);
console.log(`Cities inserted: ${result.citiesInserted}`);
console.log(`Cities updated: ${result.citiesUpdated}`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('');
console.log('Errors:');
result.errors.forEach((e) => console.log(` - ${e}`));
}
// Show stats
console.log('');
console.log('Current Statistics:');
const stats = await discovery.getStats();
console.log(` Total cities: ${stats.total}`);
console.log(` Crawl enabled: ${stats.crawlEnabled}`);
console.log(` Never crawled: ${stats.neverCrawled}`);
console.log('');
console.log('By Country:');
stats.byCountry.forEach((c) => console.log(` ${c.countryCode}: ${c.count}`));
console.log('');
console.log('By State (top 10):');
stats.byState.slice(0, 10).forEach((s) => console.log(` ${s.stateCode} (${s.countryCode}): ${s.count}`));
process.exit(result.errors.length > 0 ? 1 : 0);
} catch (error: any) {
console.error('FATAL ERROR:', error.message);
console.error(error.stack);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,189 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Dutchie Location Discovery CLI Runner
*
* Discovers store locations for cities and upserts to dutchie_discovery_locations.
*
* Usage:
* npm run discovery:dutchie:locations -- --all-enabled
* npm run discovery:dutchie:locations -- --city-slug=phoenix
* npm run discovery:dutchie:locations -- --all-enabled --limit=10
*
* npx tsx src/scripts/discovery-dutchie-locations.ts --all-enabled
* npx tsx src/scripts/discovery-dutchie-locations.ts --city-slug=phoenix
*
* Options:
* --city-slug=<slug> Run for a single city by its slug
* --all-enabled Run for all cities where crawl_enabled = TRUE
* --limit=<n> Limit the number of cities to process
* --delay=<ms> Delay between cities in ms (default: 2000)
*
* Environment:
* DATABASE_URL - PostgreSQL connection string (required)
*/
import { Pool } from 'pg';
import { DutchieLocationDiscovery } from '../dutchie-az/discovery/DutchieLocationDiscovery';
// Parse command line arguments
function parseArgs(): {
citySlug: string | null;
allEnabled: boolean;
limit: number | undefined;
delay: number;
} {
const args = process.argv.slice(2);
let citySlug: string | null = null;
let allEnabled = false;
let limit: number | undefined = undefined;
let delay = 2000;
for (const arg of args) {
if (arg.startsWith('--city-slug=')) {
citySlug = arg.split('=')[1];
} else if (arg === '--all-enabled') {
allEnabled = true;
} else if (arg.startsWith('--limit=')) {
limit = parseInt(arg.split('=')[1], 10);
} else if (arg.startsWith('--delay=')) {
delay = parseInt(arg.split('=')[1], 10);
}
}
return { citySlug, allEnabled, limit, delay };
}
function printUsage() {
console.log(`
Dutchie Location Discovery CLI
Usage:
npx tsx src/scripts/discovery-dutchie-locations.ts [options]
Options:
--city-slug=<slug> Run for a single city by its slug
--all-enabled Run for all cities where crawl_enabled = TRUE
--limit=<n> Limit the number of cities to process
--delay=<ms> Delay between cities in ms (default: 2000)
Examples:
npx tsx src/scripts/discovery-dutchie-locations.ts --all-enabled
npx tsx src/scripts/discovery-dutchie-locations.ts --city-slug=phoenix
npx tsx src/scripts/discovery-dutchie-locations.ts --all-enabled --limit=5
Environment:
DATABASE_URL - PostgreSQL connection string (required)
`);
}
async function main() {
const { citySlug, allEnabled, limit, delay } = parseArgs();
if (!citySlug && !allEnabled) {
console.error('ERROR: Must specify either --city-slug=<slug> or --all-enabled');
printUsage();
process.exit(1);
}
console.log('='.repeat(60));
console.log('DUTCHIE LOCATION DISCOVERY');
console.log('='.repeat(60));
if (citySlug) {
console.log(`Mode: Single city (${citySlug})`);
} else {
console.log(`Mode: All enabled cities${limit ? ` (limit: ${limit})` : ''}`);
}
console.log(`Delay between cities: ${delay}ms`);
console.log('');
// Get database URL from environment
const connectionString = process.env.DATABASE_URL;
if (!connectionString) {
console.error('ERROR: DATABASE_URL environment variable is required');
console.error('');
console.error('Usage:');
console.error(' DATABASE_URL="postgresql://..." npx tsx src/scripts/discovery-dutchie-locations.ts --all-enabled');
process.exit(1);
}
// Create pool
const pool = new Pool({ connectionString });
try {
// Test connection
await pool.query('SELECT 1');
console.log('[CLI] Database connection established');
const discovery = new DutchieLocationDiscovery(pool);
if (citySlug) {
// Single city mode
const city = await discovery.getCityBySlug(citySlug);
if (!city) {
console.error(`ERROR: City not found: ${citySlug}`);
console.error('');
console.error('Make sure you have run city discovery first:');
console.error(' npm run discovery:dutchie:cities');
process.exit(1);
}
const result = await discovery.discoverForCity(city);
console.log('');
console.log('='.repeat(60));
console.log('DISCOVERY COMPLETE');
console.log('='.repeat(60));
console.log(`City: ${city.cityName}, ${city.stateCode}`);
console.log(`Locations found: ${result.locationsFound}`);
console.log(`Inserted: ${result.locationsInserted}`);
console.log(`Updated: ${result.locationsUpdated}`);
console.log(`Skipped (protected): ${result.locationsSkipped}`);
console.log(`Errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0) {
console.log('');
console.log('Errors:');
result.errors.forEach((e) => console.log(` - ${e}`));
}
process.exit(result.errors.length > 0 ? 1 : 0);
} else {
// All enabled cities mode
const result = await discovery.discoverAllEnabled({ limit, delayMs: delay });
console.log('');
console.log('='.repeat(60));
console.log('DISCOVERY COMPLETE');
console.log('='.repeat(60));
console.log(`Total cities processed: ${result.totalCities}`);
console.log(`Total locations found: ${result.totalLocationsFound}`);
console.log(`Total inserted: ${result.totalInserted}`);
console.log(`Total updated: ${result.totalUpdated}`);
console.log(`Total skipped: ${result.totalSkipped}`);
console.log(`Total errors: ${result.errors.length}`);
console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`);
if (result.errors.length > 0 && result.errors.length <= 20) {
console.log('');
console.log('Errors:');
result.errors.forEach((e) => console.log(` - ${e}`));
} else if (result.errors.length > 20) {
console.log('');
console.log(`First 20 of ${result.errors.length} errors:`);
result.errors.slice(0, 20).forEach((e) => console.log(` - ${e}`));
}
process.exit(result.errors.length > 0 ? 1 : 0);
}
} catch (error: any) {
console.error('FATAL ERROR:', error.message);
console.error(error.stack);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,749 +0,0 @@
/**
* Legacy Data Import ETL Script
*
* DEPRECATED: This script assumed a two-database architecture.
*
* CURRENT ARCHITECTURE (Single Database):
* - All data lives in ONE database: cannaiq (cannaiq-postgres container)
* - Legacy tables exist INSIDE this same database with namespaced prefixes (e.g., legacy_*)
* - The only database is: cannaiq (in cannaiq-postgres container)
*
* If you need to import legacy data:
* 1. Import into namespaced tables (legacy_dispensaries, legacy_products, etc.)
* inside the main cannaiq database
* 2. Use the canonical connection from src/dutchie-az/db/connection.ts
*
* SAFETY RULES:
* - INSERT-ONLY: No UPDATE, no DELETE, no TRUNCATE
* - ON CONFLICT DO NOTHING: Skip duplicates, never overwrite
* - Batch Processing: 500-1000 rows per batch
* - Manual Invocation Only: Requires explicit user execution
*/
import { Pool, PoolClient } from 'pg';
// ============================================================
// CONFIGURATION
// ============================================================
const BATCH_SIZE = 500;
interface ETLConfig {
dryRun: boolean;
tables: string[];
}
interface ETLStats {
table: string;
read: number;
inserted: number;
skipped: number;
errors: number;
durationMs: number;
}
// Parse command line arguments
function parseArgs(): ETLConfig {
const args = process.argv.slice(2);
const config: ETLConfig = {
dryRun: false,
tables: ['dispensaries', 'products', 'dutchie_products', 'dutchie_product_snapshots'],
};
for (const arg of args) {
if (arg === '--dry-run') {
config.dryRun = true;
} else if (arg.startsWith('--tables=')) {
config.tables = arg.replace('--tables=', '').split(',');
}
}
return config;
}
// ============================================================
// DATABASE CONNECTIONS
// ============================================================
// DEPRECATED: Both pools point to the same database (cannaiq)
// Legacy tables exist inside the main database with namespaced prefixes
function createLegacyPool(): Pool {
return new Pool({
host: process.env.CANNAIQ_DB_HOST || 'localhost',
port: parseInt(process.env.CANNAIQ_DB_PORT || '54320'),
user: process.env.CANNAIQ_DB_USER || 'dutchie',
password: process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass',
database: process.env.CANNAIQ_DB_NAME || 'cannaiq',
max: 5,
});
}
function createCannaiqPool(): Pool {
return new Pool({
host: process.env.CANNAIQ_DB_HOST || 'localhost',
port: parseInt(process.env.CANNAIQ_DB_PORT || '54320'),
user: process.env.CANNAIQ_DB_USER || 'dutchie',
password: process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass',
database: process.env.CANNAIQ_DB_NAME || 'cannaiq',
max: 5,
});
}
// ============================================================
// STAGING TABLE CREATION
// ============================================================
const STAGING_TABLES_SQL = `
-- Staging table for legacy dispensaries
CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
id SERIAL PRIMARY KEY,
legacy_id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(10) NOT NULL,
postal_code VARCHAR(20),
address TEXT,
latitude DECIMAL(10,7),
longitude DECIMAL(10,7),
menu_url TEXT,
website TEXT,
legacy_metadata JSONB,
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_id)
);
-- Staging table for legacy products
CREATE TABLE IF NOT EXISTS products_from_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
legacy_dispensary_id INTEGER,
external_product_id VARCHAR(255),
name VARCHAR(500) NOT NULL,
brand_name VARCHAR(255),
type VARCHAR(100),
subcategory VARCHAR(100),
strain_type VARCHAR(50),
thc DECIMAL(10,4),
cbd DECIMAL(10,4),
price_cents INTEGER,
original_price_cents INTEGER,
stock_status VARCHAR(20),
weight VARCHAR(100),
primary_image_url TEXT,
first_seen_at TIMESTAMPTZ,
last_seen_at TIMESTAMPTZ,
legacy_raw_payload JSONB,
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_product_id)
);
-- Staging table for legacy price history
CREATE TABLE IF NOT EXISTS price_history_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
price_cents INTEGER,
recorded_at TIMESTAMPTZ,
imported_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for efficient lookups
CREATE INDEX IF NOT EXISTS idx_disp_legacy_slug ON dispensaries_from_legacy(slug, city, state);
CREATE INDEX IF NOT EXISTS idx_prod_legacy_ext_id ON products_from_legacy(external_product_id);
`;
async function createStagingTables(cannaiqPool: Pool, dryRun: boolean): Promise<void> {
console.log('[ETL] Creating staging tables...');
if (dryRun) {
console.log('[ETL] DRY RUN: Would create staging tables');
return;
}
const client = await cannaiqPool.connect();
try {
await client.query(STAGING_TABLES_SQL);
console.log('[ETL] Staging tables created successfully');
} finally {
client.release();
}
}
// ============================================================
// ETL FUNCTIONS
// ============================================================
async function importDispensaries(
legacyPool: Pool,
cannaiqPool: Pool,
dryRun: boolean
): Promise<ETLStats> {
const startTime = Date.now();
const stats: ETLStats = {
table: 'dispensaries',
read: 0,
inserted: 0,
skipped: 0,
errors: 0,
durationMs: 0,
};
console.log('[ETL] Importing dispensaries...');
const legacyClient = await legacyPool.connect();
const cannaiqClient = await cannaiqPool.connect();
try {
// Count total rows
const countResult = await legacyClient.query('SELECT COUNT(*) FROM dispensaries');
const totalRows = parseInt(countResult.rows[0].count);
console.log(`[ETL] Found ${totalRows} dispensaries in legacy database`);
// Process in batches
let offset = 0;
while (offset < totalRows) {
const batchResult = await legacyClient.query(`
SELECT
id, name, slug, city, state, zip, address,
latitude, longitude, menu_url, website, dba_name,
menu_provider, product_provider, provider_detection_data
FROM dispensaries
ORDER BY id
LIMIT $1 OFFSET $2
`, [BATCH_SIZE, offset]);
stats.read += batchResult.rows.length;
if (dryRun) {
console.log(`[ETL] DRY RUN: Would insert batch of ${batchResult.rows.length} dispensaries`);
stats.inserted += batchResult.rows.length;
} else {
for (const row of batchResult.rows) {
try {
const legacyMetadata = {
dba_name: row.dba_name,
menu_provider: row.menu_provider,
product_provider: row.product_provider,
provider_detection_data: row.provider_detection_data,
};
const insertResult = await cannaiqClient.query(`
INSERT INTO dispensaries_from_legacy
(legacy_id, name, slug, city, state, postal_code, address,
latitude, longitude, menu_url, website, legacy_metadata)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)
ON CONFLICT (legacy_id) DO NOTHING
RETURNING id
`, [
row.id,
row.name,
row.slug,
row.city,
row.state,
row.zip,
row.address,
row.latitude,
row.longitude,
row.menu_url,
row.website,
JSON.stringify(legacyMetadata),
]);
if (insertResult.rowCount > 0) {
stats.inserted++;
} else {
stats.skipped++;
}
} catch (err: any) {
stats.errors++;
console.error(`[ETL] Error inserting dispensary ${row.id}:`, err.message);
}
}
}
offset += BATCH_SIZE;
console.log(`[ETL] Processed ${Math.min(offset, totalRows)}/${totalRows} dispensaries`);
}
} finally {
legacyClient.release();
cannaiqClient.release();
}
stats.durationMs = Date.now() - startTime;
return stats;
}
async function importProducts(
legacyPool: Pool,
cannaiqPool: Pool,
dryRun: boolean
): Promise<ETLStats> {
const startTime = Date.now();
const stats: ETLStats = {
table: 'products',
read: 0,
inserted: 0,
skipped: 0,
errors: 0,
durationMs: 0,
};
console.log('[ETL] Importing legacy products...');
const legacyClient = await legacyPool.connect();
const cannaiqClient = await cannaiqPool.connect();
try {
const countResult = await legacyClient.query('SELECT COUNT(*) FROM products');
const totalRows = parseInt(countResult.rows[0].count);
console.log(`[ETL] Found ${totalRows} products in legacy database`);
let offset = 0;
while (offset < totalRows) {
const batchResult = await legacyClient.query(`
SELECT
id, dispensary_id, dutchie_product_id, name, brand,
subcategory, strain_type, thc_percentage, cbd_percentage,
price, original_price, in_stock, weight, image_url,
first_seen_at, last_seen_at, raw_data
FROM products
ORDER BY id
LIMIT $1 OFFSET $2
`, [BATCH_SIZE, offset]);
stats.read += batchResult.rows.length;
if (dryRun) {
console.log(`[ETL] DRY RUN: Would insert batch of ${batchResult.rows.length} products`);
stats.inserted += batchResult.rows.length;
} else {
for (const row of batchResult.rows) {
try {
const stockStatus = row.in_stock === true ? 'in_stock' :
row.in_stock === false ? 'out_of_stock' : 'unknown';
const priceCents = row.price ? Math.round(parseFloat(row.price) * 100) : null;
const originalPriceCents = row.original_price ? Math.round(parseFloat(row.original_price) * 100) : null;
const insertResult = await cannaiqClient.query(`
INSERT INTO products_from_legacy
(legacy_product_id, legacy_dispensary_id, external_product_id,
name, brand_name, subcategory, strain_type, thc, cbd,
price_cents, original_price_cents, stock_status, weight,
primary_image_url, first_seen_at, last_seen_at, legacy_raw_payload)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17)
ON CONFLICT (legacy_product_id) DO NOTHING
RETURNING id
`, [
row.id,
row.dispensary_id,
row.dutchie_product_id,
row.name,
row.brand,
row.subcategory,
row.strain_type,
row.thc_percentage,
row.cbd_percentage,
priceCents,
originalPriceCents,
stockStatus,
row.weight,
row.image_url,
row.first_seen_at,
row.last_seen_at,
row.raw_data ? JSON.stringify(row.raw_data) : null,
]);
if (insertResult.rowCount > 0) {
stats.inserted++;
} else {
stats.skipped++;
}
} catch (err: any) {
stats.errors++;
console.error(`[ETL] Error inserting product ${row.id}:`, err.message);
}
}
}
offset += BATCH_SIZE;
console.log(`[ETL] Processed ${Math.min(offset, totalRows)}/${totalRows} products`);
}
} finally {
legacyClient.release();
cannaiqClient.release();
}
stats.durationMs = Date.now() - startTime;
return stats;
}
async function importDutchieProducts(
legacyPool: Pool,
cannaiqPool: Pool,
dryRun: boolean
): Promise<ETLStats> {
const startTime = Date.now();
const stats: ETLStats = {
table: 'dutchie_products',
read: 0,
inserted: 0,
skipped: 0,
errors: 0,
durationMs: 0,
};
console.log('[ETL] Importing dutchie_products...');
const legacyClient = await legacyPool.connect();
const cannaiqClient = await cannaiqPool.connect();
try {
const countResult = await legacyClient.query('SELECT COUNT(*) FROM dutchie_products');
const totalRows = parseInt(countResult.rows[0].count);
console.log(`[ETL] Found ${totalRows} dutchie_products in legacy database`);
// Note: For dutchie_products, we need to map dispensary_id to the canonical dispensary
// This requires the dispensaries to be imported first
// For now, we'll insert directly since the schema is nearly identical
let offset = 0;
while (offset < totalRows) {
const batchResult = await legacyClient.query(`
SELECT *
FROM dutchie_products
ORDER BY id
LIMIT $1 OFFSET $2
`, [BATCH_SIZE, offset]);
stats.read += batchResult.rows.length;
if (dryRun) {
console.log(`[ETL] DRY RUN: Would insert batch of ${batchResult.rows.length} dutchie_products`);
stats.inserted += batchResult.rows.length;
} else {
// For each row, attempt insert with ON CONFLICT DO NOTHING
for (const row of batchResult.rows) {
try {
// Check if dispensary exists in canonical table
const dispCheck = await cannaiqClient.query(`
SELECT id FROM dispensaries WHERE id = $1
`, [row.dispensary_id]);
if (dispCheck.rows.length === 0) {
stats.skipped++;
continue; // Skip products for dispensaries not yet imported
}
const insertResult = await cannaiqClient.query(`
INSERT INTO dutchie_products
(dispensary_id, platform, external_product_id, platform_dispensary_id,
c_name, name, brand_name, brand_id, brand_logo_url,
type, subcategory, strain_type, provider,
thc, thc_content, cbd, cbd_content, cannabinoids_v2, effects,
status, medical_only, rec_only, featured, coming_soon,
certificate_of_analysis_enabled,
is_below_threshold, is_below_kiosk_threshold,
options_below_threshold, options_below_kiosk_threshold,
stock_status, total_quantity_available,
primary_image_url, images, measurements, weight, past_c_names,
created_at_dutchie, updated_at_dutchie, latest_raw_payload)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39)
ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
RETURNING id
`, [
row.dispensary_id,
row.platform || 'dutchie',
row.external_product_id,
row.platform_dispensary_id,
row.c_name,
row.name,
row.brand_name,
row.brand_id,
row.brand_logo_url,
row.type,
row.subcategory,
row.strain_type,
row.provider,
row.thc,
row.thc_content,
row.cbd,
row.cbd_content,
row.cannabinoids_v2,
row.effects,
row.status,
row.medical_only,
row.rec_only,
row.featured,
row.coming_soon,
row.certificate_of_analysis_enabled,
row.is_below_threshold,
row.is_below_kiosk_threshold,
row.options_below_threshold,
row.options_below_kiosk_threshold,
row.stock_status,
row.total_quantity_available,
row.primary_image_url,
row.images,
row.measurements,
row.weight,
row.past_c_names,
row.created_at_dutchie,
row.updated_at_dutchie,
row.latest_raw_payload,
]);
if (insertResult.rowCount > 0) {
stats.inserted++;
} else {
stats.skipped++;
}
} catch (err: any) {
stats.errors++;
if (stats.errors <= 5) {
console.error(`[ETL] Error inserting dutchie_product ${row.id}:`, err.message);
}
}
}
}
offset += BATCH_SIZE;
console.log(`[ETL] Processed ${Math.min(offset, totalRows)}/${totalRows} dutchie_products`);
}
} finally {
legacyClient.release();
cannaiqClient.release();
}
stats.durationMs = Date.now() - startTime;
return stats;
}
async function importDutchieSnapshots(
legacyPool: Pool,
cannaiqPool: Pool,
dryRun: boolean
): Promise<ETLStats> {
const startTime = Date.now();
const stats: ETLStats = {
table: 'dutchie_product_snapshots',
read: 0,
inserted: 0,
skipped: 0,
errors: 0,
durationMs: 0,
};
console.log('[ETL] Importing dutchie_product_snapshots...');
const legacyClient = await legacyPool.connect();
const cannaiqClient = await cannaiqPool.connect();
try {
const countResult = await legacyClient.query('SELECT COUNT(*) FROM dutchie_product_snapshots');
const totalRows = parseInt(countResult.rows[0].count);
console.log(`[ETL] Found ${totalRows} dutchie_product_snapshots in legacy database`);
// Build mapping of legacy product IDs to canonical product IDs
console.log('[ETL] Building product ID mapping...');
const productMapping = new Map<number, number>();
const mappingResult = await cannaiqClient.query(`
SELECT id, external_product_id, dispensary_id FROM dutchie_products
`);
// Create a key from dispensary_id + external_product_id
const productByKey = new Map<string, number>();
for (const row of mappingResult.rows) {
const key = `${row.dispensary_id}:${row.external_product_id}`;
productByKey.set(key, row.id);
}
let offset = 0;
while (offset < totalRows) {
const batchResult = await legacyClient.query(`
SELECT *
FROM dutchie_product_snapshots
ORDER BY id
LIMIT $1 OFFSET $2
`, [BATCH_SIZE, offset]);
stats.read += batchResult.rows.length;
if (dryRun) {
console.log(`[ETL] DRY RUN: Would insert batch of ${batchResult.rows.length} snapshots`);
stats.inserted += batchResult.rows.length;
} else {
for (const row of batchResult.rows) {
try {
// Map legacy product ID to canonical product ID
const key = `${row.dispensary_id}:${row.external_product_id}`;
const canonicalProductId = productByKey.get(key);
if (!canonicalProductId) {
stats.skipped++;
continue; // Skip snapshots for products not yet imported
}
// Insert snapshot (no conflict handling - all snapshots are historical)
await cannaiqClient.query(`
INSERT INTO dutchie_product_snapshots
(dutchie_product_id, dispensary_id, platform_dispensary_id,
external_product_id, pricing_type, crawl_mode,
status, featured, special, medical_only, rec_only,
is_present_in_feed, stock_status,
rec_min_price_cents, rec_max_price_cents, rec_min_special_price_cents,
med_min_price_cents, med_max_price_cents, med_min_special_price_cents,
wholesale_min_price_cents,
total_quantity_available, total_kiosk_quantity_available,
manual_inventory, is_below_threshold, is_below_kiosk_threshold,
options, raw_payload, crawled_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28)
`, [
canonicalProductId,
row.dispensary_id,
row.platform_dispensary_id,
row.external_product_id,
row.pricing_type,
row.crawl_mode,
row.status,
row.featured,
row.special,
row.medical_only,
row.rec_only,
row.is_present_in_feed,
row.stock_status,
row.rec_min_price_cents,
row.rec_max_price_cents,
row.rec_min_special_price_cents,
row.med_min_price_cents,
row.med_max_price_cents,
row.med_min_special_price_cents,
row.wholesale_min_price_cents,
row.total_quantity_available,
row.total_kiosk_quantity_available,
row.manual_inventory,
row.is_below_threshold,
row.is_below_kiosk_threshold,
row.options,
row.raw_payload,
row.crawled_at,
]);
stats.inserted++;
} catch (err: any) {
stats.errors++;
if (stats.errors <= 5) {
console.error(`[ETL] Error inserting snapshot ${row.id}:`, err.message);
}
}
}
}
offset += BATCH_SIZE;
console.log(`[ETL] Processed ${Math.min(offset, totalRows)}/${totalRows} snapshots`);
}
} finally {
legacyClient.release();
cannaiqClient.release();
}
stats.durationMs = Date.now() - startTime;
return stats;
}
// ============================================================
// MAIN
// ============================================================
async function main(): Promise<void> {
console.log('='.repeat(60));
console.log('LEGACY DATA IMPORT ETL');
console.log('='.repeat(60));
const config = parseArgs();
console.log(`Mode: ${config.dryRun ? 'DRY RUN' : 'LIVE'}`);
console.log(`Tables: ${config.tables.join(', ')}`);
console.log('');
// Create connection pools
const legacyPool = createLegacyPool();
const cannaiqPool = createCannaiqPool();
try {
// Test connections
console.log('[ETL] Testing database connections...');
await legacyPool.query('SELECT 1');
console.log('[ETL] Legacy database connected');
await cannaiqPool.query('SELECT 1');
console.log('[ETL] CannaiQ database connected');
console.log('');
// Create staging tables
await createStagingTables(cannaiqPool, config.dryRun);
console.log('');
// Run imports
const allStats: ETLStats[] = [];
if (config.tables.includes('dispensaries')) {
const stats = await importDispensaries(legacyPool, cannaiqPool, config.dryRun);
allStats.push(stats);
console.log('');
}
if (config.tables.includes('products')) {
const stats = await importProducts(legacyPool, cannaiqPool, config.dryRun);
allStats.push(stats);
console.log('');
}
if (config.tables.includes('dutchie_products')) {
const stats = await importDutchieProducts(legacyPool, cannaiqPool, config.dryRun);
allStats.push(stats);
console.log('');
}
if (config.tables.includes('dutchie_product_snapshots')) {
const stats = await importDutchieSnapshots(legacyPool, cannaiqPool, config.dryRun);
allStats.push(stats);
console.log('');
}
// Print summary
console.log('='.repeat(60));
console.log('IMPORT SUMMARY');
console.log('='.repeat(60));
console.log('');
console.log('| Table | Read | Inserted | Skipped | Errors | Duration |');
console.log('|----------------------------|----------|----------|----------|----------|----------|');
for (const s of allStats) {
console.log(`| ${s.table.padEnd(26)} | ${String(s.read).padStart(8)} | ${String(s.inserted).padStart(8)} | ${String(s.skipped).padStart(8)} | ${String(s.errors).padStart(8)} | ${(s.durationMs / 1000).toFixed(1).padStart(7)}s |`);
}
console.log('');
const totalInserted = allStats.reduce((sum, s) => sum + s.inserted, 0);
const totalErrors = allStats.reduce((sum, s) => sum + s.errors, 0);
console.log(`Total inserted: ${totalInserted}`);
console.log(`Total errors: ${totalErrors}`);
if (config.dryRun) {
console.log('');
console.log('DRY RUN COMPLETE - No data was written');
console.log('Run without --dry-run to perform actual import');
}
} catch (error: any) {
console.error('[ETL] Fatal error:', error.message);
process.exit(1);
} finally {
await legacyPool.end();
await cannaiqPool.end();
}
console.log('');
console.log('ETL complete');
}
main().catch((err) => {
console.error('Unhandled error:', err);
process.exit(1);
});

View File

@@ -0,0 +1,397 @@
/**
* Harmonize AZ Dispensaries with Dutchie Source of Truth
*
* This script:
* 1. Queries Dutchie ConsumerDispensaries API for all AZ cities
* 2. Matches our dispensaries by platform_dispensary_id
* 3. Updates existing records with full Dutchie data
* 4. Creates new records for dispensaries in Dutchie but not in our DB
* 5. Disables dispensaries not found in Dutchie
*
* Usage:
* npx tsx src/scripts/harmonize-az-dispensaries.ts
* npx tsx src/scripts/harmonize-az-dispensaries.ts --dry-run
* npx tsx src/scripts/harmonize-az-dispensaries.ts --state CA
*/
import { Pool } from 'pg';
import { executeGraphQL, GRAPHQL_HASHES } from '../platforms/dutchie/client';
const pool = new Pool({
host: process.env.CANNAIQ_DB_HOST || 'localhost',
port: parseInt(process.env.CANNAIQ_DB_PORT || '54320'),
database: process.env.CANNAIQ_DB_NAME || 'dutchie_menus',
user: process.env.CANNAIQ_DB_USER || 'dutchie',
password: process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass',
});
interface Dispensary {
id: number;
name: string;
slug: string;
city: string;
state: string;
platform_dispensary_id: string | null;
dutchie_verified: boolean;
crawl_enabled: boolean;
}
interface DutchieDispensary {
id: string; // Platform ID like "deHiuKKmBHGJKXzuj"
cName: string; // Slug like "the-downtown-dispensary"
name: string;
phone: string | null;
address: string;
description: string | null;
status: string;
chain: string | null;
timezone: string;
location: {
ln1: string;
ln2: string;
city: string;
state: string;
country: string;
zipcode: string;
geometry: {
coordinates: [number, number];
};
};
deliveryHours: any;
pickupHours: any;
offerDelivery: boolean;
offerPickup: boolean;
offerCurbsidePickup: boolean;
isMedical: boolean;
isRecreational: boolean;
}
interface HarmonizationResult {
updated: number;
created: number;
disabled: number;
skipped: number;
errors: string[];
}
// Cities to query for AZ (from statesWithDispensaries)
const AZ_CITIES = [
'Apache Junction', 'Bisbee', 'Bullhead City', 'Casa Grande', 'Chandler',
'Cottonwood', 'El Mirage', 'Flagstaff', 'Florence', 'Gilbert', 'Glendale',
'Globe', 'Goodyear', 'Kingman', 'Lake Havasu City', 'Maricopa', 'Mesa',
'Peoria', 'Phoenix', 'Prescott', 'Prescott Valley', 'Queen Creek',
'Scottsdale', 'Show Low', 'Sierra Vista', 'Snowflake', 'Sun City',
'Surprise', 'Tempe', 'Tolleson', 'Tucson', 'Yuma'
];
async function getDispensaries(state: string): Promise<Dispensary[]> {
const result = await pool.query<Dispensary>(
`SELECT id, name, slug, city, state, platform_dispensary_id,
COALESCE(dutchie_verified, false) as dutchie_verified,
COALESCE(crawl_enabled, true) as crawl_enabled
FROM dispensaries
WHERE state = $1
ORDER BY id`,
[state]
);
return result.rows;
}
async function fetchDutchieDispensariesByCity(
city: string,
state: string
): Promise<DutchieDispensary[]> {
const allDispensaries: DutchieDispensary[] = [];
let page = 0;
const perPage = 100;
while (true) {
const variables = {
dispensaryFilter: {
activeOnly: true,
city,
state,
},
page,
perPage,
};
const result = await executeGraphQL(
'ConsumerDispensaries',
variables,
GRAPHQL_HASHES.ConsumerDispensaries,
{ cName: `${city.toLowerCase().replace(/\s+/g, '-')}-${state.toLowerCase()}`, maxRetries: 2, retryOn403: true }
);
const dispensaries = result?.data?.filteredDispensaries || [];
allDispensaries.push(...dispensaries);
if (dispensaries.length < perPage) break;
page++;
// Rate limit
await new Promise(resolve => setTimeout(resolve, 200));
}
return allDispensaries;
}
async function fetchAllDutchieDispensaries(state: string): Promise<Map<string, DutchieDispensary>> {
const cities = state === 'AZ' ? AZ_CITIES : [];
const dispensaryMap = new Map<string, DutchieDispensary>();
console.log(`Fetching dispensaries from ${cities.length} cities...`);
for (const city of cities) {
const dispensaries = await fetchDutchieDispensariesByCity(city, state);
console.log(` ${city}: ${dispensaries.length} dispensaries`);
for (const d of dispensaries) {
// Index by platform ID
if (d.id && !dispensaryMap.has(d.id)) {
dispensaryMap.set(d.id, d);
}
}
// Rate limit between cities
await new Promise(resolve => setTimeout(resolve, 300));
}
console.log(`Total unique dispensaries from Dutchie: ${dispensaryMap.size}\n`);
return dispensaryMap;
}
async function updateDispensary(
dispensaryId: number,
dutchie: DutchieDispensary,
dryRun: boolean
): Promise<void> {
if (dryRun) return;
const menuUrl = `https://dutchie.com/dispensary/${dutchie.cName}`;
await pool.query(
`UPDATE dispensaries
SET name = $2,
slug = $3,
address = $4,
city = $5,
postal_code = $6,
phone = $7,
latitude = $8,
longitude = $9,
menu_url = $10,
menu_type = 'dutchie',
platform = 'dutchie',
is_delivery = $11,
is_pickup = $12,
dutchie_verified = true,
dutchie_verified_at = NOW(),
crawl_enabled = true,
updated_at = NOW()
WHERE id = $1`,
[
dispensaryId,
dutchie.name.trim(),
dutchie.cName,
dutchie.location?.ln1 || dutchie.address,
dutchie.location?.city || '',
dutchie.location?.zipcode || '',
dutchie.phone,
dutchie.location?.geometry?.coordinates?.[1] || null,
dutchie.location?.geometry?.coordinates?.[0] || null,
menuUrl,
dutchie.offerDelivery ?? false,
dutchie.offerPickup ?? true,
]
);
}
async function createDispensary(
dutchie: DutchieDispensary,
state: string,
dryRun: boolean
): Promise<number | null> {
if (dryRun) return null;
const menuUrl = `https://dutchie.com/dispensary/${dutchie.cName}`;
const result = await pool.query<{ id: number }>(
`INSERT INTO dispensaries (
name, slug, city, state, platform, platform_dispensary_id,
menu_url, menu_type, address, postal_code, latitude, longitude,
is_delivery, is_pickup, phone,
dutchie_verified, dutchie_verified_at,
crawl_enabled,
created_at, updated_at
) VALUES (
$1, $2, $3, $4, 'dutchie', $5,
$6, 'dutchie', $7, $8, $9, $10,
$11, $12, $13,
true, NOW(),
true,
NOW(), NOW()
)
ON CONFLICT (slug) DO UPDATE SET
platform_dispensary_id = EXCLUDED.platform_dispensary_id,
name = EXCLUDED.name,
menu_url = EXCLUDED.menu_url,
address = EXCLUDED.address,
postal_code = EXCLUDED.postal_code,
latitude = EXCLUDED.latitude,
longitude = EXCLUDED.longitude,
is_delivery = EXCLUDED.is_delivery,
is_pickup = EXCLUDED.is_pickup,
phone = EXCLUDED.phone,
dutchie_verified = true,
dutchie_verified_at = NOW(),
crawl_enabled = true,
updated_at = NOW()
RETURNING id`,
[
dutchie.name.trim(),
dutchie.cName,
dutchie.location?.city || '',
state,
dutchie.id,
menuUrl,
dutchie.location?.ln1 || dutchie.address,
dutchie.location?.zipcode || '',
dutchie.location?.geometry?.coordinates?.[1] || null,
dutchie.location?.geometry?.coordinates?.[0] || null,
dutchie.offerDelivery ?? false,
dutchie.offerPickup ?? true,
dutchie.phone,
]
);
return result.rows[0]?.id || null;
}
async function disableDispensary(dispensaryId: number, reason: string, dryRun: boolean): Promise<void> {
if (dryRun) return;
await pool.query(
`UPDATE dispensaries
SET crawl_enabled = false,
failure_notes = $2,
updated_at = NOW()
WHERE id = $1`,
[dispensaryId, reason]
);
}
async function harmonizeDispensaries(
state: string,
dryRun: boolean = false
): Promise<HarmonizationResult> {
console.log(`\n${'='.repeat(60)}`);
console.log(`HARMONIZING ${state} DISPENSARIES${dryRun ? ' (DRY RUN)' : ''}`);
console.log(`${'='.repeat(60)}\n`);
const result: HarmonizationResult = {
updated: 0,
created: 0,
disabled: 0,
skipped: 0,
errors: [],
};
// Fetch all dispensaries from Dutchie (source of truth)
const dutchieMap = await fetchAllDutchieDispensaries(state);
// Get our current dispensaries
const dispensaries = await getDispensaries(state);
console.log(`Found ${dispensaries.length} dispensaries in our DB\n`);
// Track which Dutchie dispensaries we've matched
const matchedDutchieIds = new Set<string>();
// Step 1: Match our dispensaries to Dutchie by platform_dispensary_id
console.log('[Step 1/3] Matching existing dispensaries to Dutchie...');
for (const disp of dispensaries) {
if (disp.platform_dispensary_id && dutchieMap.has(disp.platform_dispensary_id)) {
// Found match - update with Dutchie data
const dutchie = dutchieMap.get(disp.platform_dispensary_id)!;
try {
await updateDispensary(disp.id, dutchie, dryRun);
console.log(` [UPDATED] ${disp.name} -> ${dutchie.name} (${dutchie.cName})`);
result.updated++;
matchedDutchieIds.add(disp.platform_dispensary_id);
} catch (error: any) {
console.error(` [ERROR] ${disp.name}: ${error.message}`);
result.errors.push(`Update ${disp.name}: ${error.message}`);
}
} else if (disp.platform_dispensary_id) {
// Has platform ID but not found in Dutchie - maybe closed?
console.log(` [NOT FOUND] ${disp.name} (${disp.platform_dispensary_id}) - not in Dutchie`);
await disableDispensary(disp.id, 'Platform ID not found in Dutchie - may be closed', dryRun);
result.disabled++;
} else {
// No platform ID - disable
console.log(` [NO ID] ${disp.name} - no platform_dispensary_id`);
await disableDispensary(disp.id, 'No platform_dispensary_id', dryRun);
result.disabled++;
}
}
// Step 2: Create new dispensaries for Dutchie records we don't have
console.log(`\n[Step 2/3] Creating new dispensaries from Dutchie...`);
for (const [platformId, dutchie] of dutchieMap) {
if (matchedDutchieIds.has(platformId)) {
continue; // Already matched
}
try {
const newId = await createDispensary(dutchie, state, dryRun);
console.log(` [CREATED] ${dutchie.name} (${dutchie.cName}) -> ID ${newId || '(dry-run)'}`);
result.created++;
} catch (error: any) {
console.error(` [ERROR] ${dutchie.name}: ${error.message}`);
result.errors.push(`Create ${dutchie.name}: ${error.message}`);
}
}
// Summary
console.log(`\n${'='.repeat(60)}`);
console.log('HARMONIZATION SUMMARY');
console.log(`${'='.repeat(60)}`);
console.log(` Updated (matched to Dutchie): ${result.updated}`);
console.log(` Created (new from Dutchie): ${result.created}`);
console.log(` Disabled (not in Dutchie): ${result.disabled}`);
console.log(` Errors: ${result.errors.length}`);
if (result.errors.length > 0) {
console.log(`\nErrors:`);
result.errors.slice(0, 20).forEach(e => console.log(` - ${e}`));
if (result.errors.length > 20) {
console.log(` ... and ${result.errors.length - 20} more`);
}
}
return result;
}
async function main() {
const args = process.argv.slice(2);
let state = 'AZ';
let dryRun = false;
for (let i = 0; i < args.length; i++) {
if (args[i] === '--state' && args[i + 1]) {
state = args[i + 1].toUpperCase();
i++;
} else if (args[i] === '--dry-run') {
dryRun = true;
}
}
try {
await harmonizeDispensaries(state, dryRun);
} finally {
await pool.end();
}
}
main().catch(console.error);

View File

@@ -1,583 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Queue Intelligence Script
*
* Orchestrates the multi-category intelligence crawler system:
* 1. Queue dispensaries that need provider detection (all 4 categories)
* 2. Queue per-category production crawls (Dutchie products only for now)
* 3. Queue per-category sandbox crawls (all providers)
*
* Each category (product, specials, brand, metadata) is handled independently.
* A failure in one category does NOT affect other categories.
*
* Usage:
* npx tsx src/scripts/queue-intelligence.ts [--detection] [--production] [--sandbox] [--all]
* npx tsx src/scripts/queue-intelligence.ts --category=product --sandbox
* npx tsx src/scripts/queue-intelligence.ts --process --category=product
* npx tsx src/scripts/queue-intelligence.ts --dry-run
*/
import { pool } from '../db/pool';
import { logger } from '../services/logger';
import {
detectMultiCategoryProviders,
updateAllCategoryProviders,
IntelligenceCategory,
} from '../services/intelligence-detector';
import {
runCrawlProductsJob,
runCrawlSpecialsJob,
runCrawlBrandIntelligenceJob,
runCrawlMetadataJob,
runSandboxProductsJob,
runSandboxSpecialsJob,
runSandboxBrandJob,
runSandboxMetadataJob,
runAllCategoryProductionCrawls,
runAllCategorySandboxCrawls,
processCategorySandboxJobs,
} from '../services/category-crawler-jobs';
// Parse command line args
const args = process.argv.slice(2);
const flags = {
detection: args.includes('--detection') || args.includes('--all'),
production: args.includes('--production') || args.includes('--all'),
sandbox: args.includes('--sandbox') || args.includes('--all'),
dryRun: args.includes('--dry-run'),
process: args.includes('--process'),
help: args.includes('--help') || args.includes('-h'),
limit: parseInt(args.find(a => a.startsWith('--limit='))?.split('=')[1] || '10'),
category: args.find(a => a.startsWith('--category='))?.split('=')[1] as IntelligenceCategory | undefined,
dispensary: parseInt(args.find(a => a.startsWith('--dispensary='))?.split('=')[1] || '0'),
};
// If no specific flags, default to all
if (!flags.detection && !flags.production && !flags.sandbox && !flags.process) {
flags.detection = true;
flags.production = true;
flags.sandbox = true;
}
const CATEGORIES: IntelligenceCategory[] = ['product', 'specials', 'brand', 'metadata'];
async function showHelp() {
console.log(`
Queue Intelligence - Multi-Category Crawler Orchestration
USAGE:
npx tsx src/scripts/queue-intelligence.ts [OPTIONS]
OPTIONS:
--detection Queue dispensaries that need multi-category detection
--production Queue per-category production crawls
--sandbox Queue per-category sandbox crawls
--all Queue all job types (default if no specific flag)
--process Process queued jobs instead of just queuing
--category=CATEGORY Filter to specific category (product|specials|brand|metadata)
--dispensary=ID Process only a specific dispensary
--dry-run Show what would be queued without making changes
--limit=N Maximum dispensaries to queue per type (default: 10)
--help, -h Show this help message
CATEGORIES:
product - Product/menu data (Dutchie=production, others=sandbox)
specials - Deals and specials (all sandbox for now)
brand - Brand intelligence (all sandbox for now)
metadata - Categories/taxonomy (all sandbox for now)
EXAMPLES:
# Queue all dispensaries for appropriate jobs
npx tsx src/scripts/queue-intelligence.ts
# Only queue product detection jobs
npx tsx src/scripts/queue-intelligence.ts --detection --category=product
# Process sandbox jobs for specials category
npx tsx src/scripts/queue-intelligence.ts --process --category=specials --limit=5
# Run full detection for a specific dispensary
npx tsx src/scripts/queue-intelligence.ts --process --detection --dispensary=123
# Dry run to see what would be queued
npx tsx src/scripts/queue-intelligence.ts --dry-run
`);
}
async function queueMultiCategoryDetection(): Promise<number> {
console.log('\n📡 Queueing Multi-Category Detection Jobs...');
// Find dispensaries that need provider detection for any category:
// - Any *_provider is null OR
// - Any *_confidence < 70
// - has a website URL
const query = `
SELECT id, name, website, menu_url,
product_provider, product_confidence, product_crawler_mode,
specials_provider, specials_confidence, specials_crawler_mode,
brand_provider, brand_confidence, brand_crawler_mode,
metadata_provider, metadata_confidence, metadata_crawler_mode
FROM dispensaries
WHERE (website IS NOT NULL OR menu_url IS NOT NULL)
AND (
product_provider IS NULL OR product_confidence < 70 OR
specials_provider IS NULL OR specials_confidence < 70 OR
brand_provider IS NULL OR brand_confidence < 70 OR
metadata_provider IS NULL OR metadata_confidence < 70
)
ORDER BY
CASE WHEN product_provider IS NULL THEN 0 ELSE 1 END,
product_confidence ASC
LIMIT $1
`;
const result = await pool.query(query, [flags.limit]);
if (flags.dryRun) {
console.log(` Would queue ${result.rows.length} dispensaries for multi-category detection:`);
for (const row of result.rows) {
const needsDetection: string[] = [];
if (!row.product_provider || row.product_confidence < 70) needsDetection.push('product');
if (!row.specials_provider || row.specials_confidence < 70) needsDetection.push('specials');
if (!row.brand_provider || row.brand_confidence < 70) needsDetection.push('brand');
if (!row.metadata_provider || row.metadata_confidence < 70) needsDetection.push('metadata');
console.log(` - [${row.id}] ${row.name} (needs: ${needsDetection.join(', ')})`);
}
return result.rows.length;
}
let queued = 0;
for (const dispensary of result.rows) {
try {
// Create detection jobs for each category that needs it
for (const category of CATEGORIES) {
const provider = dispensary[`${category}_provider`];
const confidence = dispensary[`${category}_confidence`];
if (!provider || confidence < 70) {
await pool.query(
`INSERT INTO sandbox_crawl_jobs (dispensary_id, category, job_type, status, priority)
VALUES ($1, $2, 'detection', 'pending', 10)
ON CONFLICT DO NOTHING`,
[dispensary.id, category]
);
}
}
console.log(` ✓ Queued detection: [${dispensary.id}] ${dispensary.name}`);
queued++;
} catch (error: any) {
console.error(` ✗ Failed to queue [${dispensary.id}]: ${error.message}`);
}
}
return queued;
}
async function queueCategoryProductionCrawls(category?: IntelligenceCategory): Promise<number> {
const categories = category ? [category] : CATEGORIES;
let totalQueued = 0;
for (const cat of categories) {
console.log(`\n🏭 Queueing Production ${cat.toUpperCase()} Crawls...`);
// For now, only products have production-ready crawlers (Dutchie only)
if (cat !== 'product') {
console.log(` ⏭️ No production crawler for ${cat} yet - skipping`);
continue;
}
// Find dispensaries ready for production crawl
const query = `
SELECT id, name, ${cat}_provider as provider, last_${cat}_scan_at as last_scan
FROM dispensaries
WHERE ${cat}_provider = 'dutchie'
AND ${cat}_crawler_mode = 'production'
AND ${cat}_confidence >= 70
AND (last_${cat}_scan_at IS NULL OR last_${cat}_scan_at < NOW() - INTERVAL '4 hours')
ORDER BY
CASE WHEN last_${cat}_scan_at IS NULL THEN 0 ELSE 1 END,
last_${cat}_scan_at ASC
LIMIT $1
`;
const result = await pool.query(query, [flags.limit]);
if (flags.dryRun) {
console.log(` Would queue ${result.rows.length} dispensaries for ${cat} production crawl:`);
for (const row of result.rows) {
const lastScan = row.last_scan ? new Date(row.last_scan).toISOString() : 'never';
console.log(` - [${row.id}] ${row.name} (provider: ${row.provider}, last: ${lastScan})`);
}
totalQueued += result.rows.length;
continue;
}
for (const dispensary of result.rows) {
try {
// For products, use the existing crawl_jobs table for production
await pool.query(
`INSERT INTO crawl_jobs (store_id, job_type, trigger_type, status, priority, metadata)
SELECT s.id, 'full_crawl', 'scheduled', 'pending', 50,
jsonb_build_object('dispensary_id', $1, 'category', $2, 'source', 'queue-intelligence')
FROM stores s
JOIN dispensaries d ON (d.menu_url = s.dutchie_url OR d.name ILIKE '%' || s.name || '%')
WHERE d.id = $1
LIMIT 1`,
[dispensary.id, cat]
);
console.log(` ✓ Queued ${cat} production: [${dispensary.id}] ${dispensary.name}`);
totalQueued++;
} catch (error: any) {
console.error(` ✗ Failed to queue [${dispensary.id}]: ${error.message}`);
}
}
}
return totalQueued;
}
async function queueCategorySandboxCrawls(category?: IntelligenceCategory): Promise<number> {
const categories = category ? [category] : CATEGORIES;
let totalQueued = 0;
for (const cat of categories) {
console.log(`\n🧪 Queueing Sandbox ${cat.toUpperCase()} Crawls...`);
// Find dispensaries in sandbox mode for this category
const query = `
SELECT d.id, d.name, d.${cat}_provider as provider, d.${cat}_confidence as confidence,
d.website, d.menu_url
FROM dispensaries d
WHERE d.${cat}_crawler_mode = 'sandbox'
AND d.${cat}_provider IS NOT NULL
AND (d.website IS NOT NULL OR d.menu_url IS NOT NULL)
AND NOT EXISTS (
SELECT 1 FROM sandbox_crawl_jobs sj
WHERE sj.dispensary_id = d.id
AND sj.category = $1
AND sj.status IN ('pending', 'running')
)
ORDER BY d.${cat}_confidence DESC, d.updated_at ASC
LIMIT $2
`;
const result = await pool.query(query, [cat, flags.limit]);
if (flags.dryRun) {
console.log(` Would queue ${result.rows.length} dispensaries for ${cat} sandbox crawl:`);
for (const row of result.rows) {
console.log(` - [${row.id}] ${row.name} (provider: ${row.provider}, confidence: ${row.confidence}%)`);
}
totalQueued += result.rows.length;
continue;
}
for (const dispensary of result.rows) {
try {
// Create sandbox entry if needed
const sandboxResult = await pool.query(
`INSERT INTO crawler_sandboxes (dispensary_id, category, suspected_menu_provider, mode, status)
VALUES ($1, $2, $3, 'template_learning', 'pending')
ON CONFLICT (dispensary_id, category) WHERE status NOT IN ('moved_to_production', 'failed')
DO UPDATE SET updated_at = NOW()
RETURNING id`,
[dispensary.id, cat, dispensary.provider]
);
const sandboxId = sandboxResult.rows[0]?.id;
// Create sandbox job
await pool.query(
`INSERT INTO sandbox_crawl_jobs (dispensary_id, sandbox_id, category, job_type, status, priority)
VALUES ($1, $2, $3, 'crawl', 'pending', 5)`,
[dispensary.id, sandboxId, cat]
);
console.log(` ✓ Queued ${cat} sandbox: [${dispensary.id}] ${dispensary.name} (${dispensary.provider})`);
totalQueued++;
} catch (error: any) {
console.error(` ✗ Failed to queue [${dispensary.id}]: ${error.message}`);
}
}
}
return totalQueued;
}
async function processDetectionJobs(): Promise<void> {
console.log('\n🔍 Processing Detection Jobs...');
// Get pending detection jobs
const jobs = await pool.query(
`SELECT DISTINCT dispensary_id
FROM sandbox_crawl_jobs
WHERE job_type = 'detection' AND status = 'pending'
${flags.category ? `AND category = $2` : ''}
${flags.dispensary ? `AND dispensary_id = $${flags.category ? '3' : '2'}` : ''}
LIMIT $1`,
flags.category
? (flags.dispensary ? [flags.limit, flags.category, flags.dispensary] : [flags.limit, flags.category])
: (flags.dispensary ? [flags.limit, flags.dispensary] : [flags.limit])
);
for (const job of jobs.rows) {
console.log(`\nProcessing detection for dispensary ${job.dispensary_id}...`);
try {
// Get dispensary info
const dispResult = await pool.query(
'SELECT id, name, website, menu_url FROM dispensaries WHERE id = $1',
[job.dispensary_id]
);
const dispensary = dispResult.rows[0];
if (!dispensary) {
console.log(` ✗ Dispensary not found`);
continue;
}
const websiteUrl = dispensary.website || dispensary.menu_url;
if (!websiteUrl) {
console.log(` ✗ No website URL`);
continue;
}
// Mark jobs as running
await pool.query(
`UPDATE sandbox_crawl_jobs SET status = 'running', started_at = NOW()
WHERE dispensary_id = $1 AND job_type = 'detection' AND status = 'pending'`,
[job.dispensary_id]
);
// Run multi-category detection
console.log(` Detecting providers for ${dispensary.name}...`);
const detection = await detectMultiCategoryProviders(websiteUrl, { timeout: 45000 });
// Update all categories
await updateAllCategoryProviders(job.dispensary_id, detection);
// Mark jobs as completed
await pool.query(
`UPDATE sandbox_crawl_jobs SET status = 'completed', completed_at = NOW(),
result_summary = $1
WHERE dispensary_id = $2 AND job_type = 'detection' AND status = 'running'`,
[JSON.stringify({
product: { provider: detection.product.provider, confidence: detection.product.confidence },
specials: { provider: detection.specials.provider, confidence: detection.specials.confidence },
brand: { provider: detection.brand.provider, confidence: detection.brand.confidence },
metadata: { provider: detection.metadata.provider, confidence: detection.metadata.confidence },
}), job.dispensary_id]
);
console.log(` ✓ Detection complete:`);
console.log(` Product: ${detection.product.provider} (${detection.product.confidence}%) -> ${detection.product.mode}`);
console.log(` Specials: ${detection.specials.provider} (${detection.specials.confidence}%) -> ${detection.specials.mode}`);
console.log(` Brand: ${detection.brand.provider} (${detection.brand.confidence}%) -> ${detection.brand.mode}`);
console.log(` Metadata: ${detection.metadata.provider} (${detection.metadata.confidence}%) -> ${detection.metadata.mode}`);
} catch (error: any) {
console.log(` ✗ Error: ${error.message}`);
await pool.query(
`UPDATE sandbox_crawl_jobs SET status = 'failed', error_message = $1
WHERE dispensary_id = $2 AND job_type = 'detection' AND status = 'running'`,
[error.message, job.dispensary_id]
);
}
}
}
async function processCrawlJobs(): Promise<void> {
const categories = flags.category ? [flags.category] : CATEGORIES;
for (const cat of categories) {
console.log(`\n⚙ Processing ${cat.toUpperCase()} Crawl Jobs...\n`);
// Process sandbox jobs for this category
if (flags.sandbox || !flags.production) {
await processCategorySandboxJobs(cat, flags.limit);
}
// Process production jobs for this category
if (flags.production && cat === 'product') {
// Get pending production crawls
const prodJobs = await pool.query(
`SELECT d.id
FROM dispensaries d
WHERE d.product_provider = 'dutchie'
AND d.product_crawler_mode = 'production'
AND d.product_confidence >= 70
${flags.dispensary ? 'AND d.id = $2' : ''}
LIMIT $1`,
flags.dispensary ? [flags.limit, flags.dispensary] : [flags.limit]
);
for (const job of prodJobs.rows) {
console.log(`Processing production ${cat} crawl for dispensary ${job.id}...`);
const result = await runCrawlProductsJob(job.id);
console.log(` ${result.success ? '✓' : '✗'} ${result.message}`);
}
}
}
}
async function processSpecificDispensary(): Promise<void> {
if (!flags.dispensary) return;
console.log(`\n🎯 Processing Dispensary ${flags.dispensary}...\n`);
const dispResult = await pool.query(
'SELECT * FROM dispensaries WHERE id = $1',
[flags.dispensary]
);
if (dispResult.rows.length === 0) {
console.log('Dispensary not found');
return;
}
const dispensary = dispResult.rows[0];
console.log(`Name: ${dispensary.name}`);
console.log(`Website: ${dispensary.website || dispensary.menu_url || 'none'}`);
console.log('');
if (flags.detection) {
console.log('Running multi-category detection...');
const websiteUrl = dispensary.website || dispensary.menu_url;
if (websiteUrl) {
const detection = await detectMultiCategoryProviders(websiteUrl);
await updateAllCategoryProviders(flags.dispensary, detection);
console.log('Detection results:');
console.log(` Product: ${detection.product.provider} (${detection.product.confidence}%) -> ${detection.product.mode}`);
console.log(` Specials: ${detection.specials.provider} (${detection.specials.confidence}%) -> ${detection.specials.mode}`);
console.log(` Brand: ${detection.brand.provider} (${detection.brand.confidence}%) -> ${detection.brand.mode}`);
console.log(` Metadata: ${detection.metadata.provider} (${detection.metadata.confidence}%) -> ${detection.metadata.mode}`);
}
}
if (flags.production) {
console.log('\nRunning production crawls...');
const results = await runAllCategoryProductionCrawls(flags.dispensary);
console.log(` ${results.summary}`);
}
if (flags.sandbox) {
console.log('\nRunning sandbox crawls...');
const results = await runAllCategorySandboxCrawls(flags.dispensary);
console.log(` ${results.summary}`);
}
}
async function showStats(): Promise<void> {
console.log('\n📊 Multi-Category Intelligence Stats:');
// Per-category stats
for (const cat of CATEGORIES) {
const stats = await pool.query(`
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE ${cat}_provider IS NULL) as no_provider,
COUNT(*) FILTER (WHERE ${cat}_provider = 'dutchie') as dutchie,
COUNT(*) FILTER (WHERE ${cat}_provider = 'treez') as treez,
COUNT(*) FILTER (WHERE ${cat}_provider NOT IN ('dutchie', 'treez', 'unknown') AND ${cat}_provider IS NOT NULL) as other,
COUNT(*) FILTER (WHERE ${cat}_provider = 'unknown') as unknown,
COUNT(*) FILTER (WHERE ${cat}_crawler_mode = 'production') as production,
COUNT(*) FILTER (WHERE ${cat}_crawler_mode = 'sandbox') as sandbox,
AVG(${cat}_confidence) as avg_confidence
FROM dispensaries
`);
const s = stats.rows[0];
console.log(`
${cat.toUpperCase()}:
Providers: Dutchie=${s.dutchie}, Treez=${s.treez}, Other=${s.other}, Unknown=${s.unknown}, None=${s.no_provider}
Modes: Production=${s.production}, Sandbox=${s.sandbox}
Avg Confidence: ${Math.round(s.avg_confidence || 0)}%`);
}
// Job stats per category
console.log('\n Sandbox Jobs by Category:');
const jobStats = await pool.query(`
SELECT
category,
COUNT(*) FILTER (WHERE status = 'pending') as pending,
COUNT(*) FILTER (WHERE status = 'running') as running,
COUNT(*) FILTER (WHERE status = 'completed') as completed,
COUNT(*) FILTER (WHERE status = 'failed') as failed
FROM sandbox_crawl_jobs
GROUP BY category
ORDER BY category
`);
for (const row of jobStats.rows) {
console.log(` ${row.category}: pending=${row.pending}, running=${row.running}, completed=${row.completed}, failed=${row.failed}`);
}
}
async function main() {
if (flags.help) {
await showHelp();
process.exit(0);
}
console.log('═══════════════════════════════════════════════════════');
console.log(' Multi-Category Intelligence Queue Manager');
console.log('═══════════════════════════════════════════════════════');
if (flags.dryRun) {
console.log('\n🔍 DRY RUN MODE - No changes will be made\n');
}
if (flags.category) {
console.log(`\n📌 Filtering to category: ${flags.category}\n`);
}
try {
// Show current stats first
await showStats();
// If specific dispensary specified, process it directly
if (flags.dispensary && flags.process) {
await processSpecificDispensary();
} else if (flags.process) {
// Process mode - run jobs
if (flags.detection) {
await processDetectionJobs();
}
await processCrawlJobs();
} else {
// Queuing mode
let totalQueued = 0;
if (flags.detection) {
totalQueued += await queueMultiCategoryDetection();
}
if (flags.production) {
totalQueued += await queueCategoryProductionCrawls(flags.category);
}
if (flags.sandbox) {
totalQueued += await queueCategorySandboxCrawls(flags.category);
}
console.log('\n═══════════════════════════════════════════════════════');
console.log(` Total queued: ${totalQueued}`);
console.log('═══════════════════════════════════════════════════════\n');
}
// Show updated stats
if (!flags.dryRun) {
await showStats();
}
} catch (error) {
console.error('Fatal error:', error);
process.exit(1);
} finally {
await pool.end();
}
}
main();

View File

@@ -1,173 +0,0 @@
#!/usr/bin/env npx tsx
/**
* Dutchie Platform ID Resolver
*
* Standalone script to resolve a Dutchie dispensary slug to its platform ID.
*
* USAGE:
* npx tsx src/scripts/resolve-dutchie-id.ts <slug>
* npx tsx src/scripts/resolve-dutchie-id.ts hydroman-dispensary
* npx tsx src/scripts/resolve-dutchie-id.ts AZ-Deeply-Rooted
*
* RESOLUTION STRATEGY:
* 1. Navigate to https://dutchie.com/embedded-menu/{slug} via Puppeteer
* 2. Extract window.reactEnv.dispensaryId (preferred - fastest)
* 3. If reactEnv fails, call GraphQL GetAddressBasedDispensaryData as fallback
*
* OUTPUT:
* - dispensaryId: The MongoDB ObjectId (e.g., "6405ef617056e8014d79101b")
* - source: "reactEnv" or "graphql"
* - httpStatus: HTTP status from embedded menu page
* - error: Error message if resolution failed
*/
import { resolveDispensaryIdWithDetails, ResolveDispensaryResult } from '../dutchie-az/services/graphql-client';
async function main() {
const args = process.argv.slice(2);
if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
console.log(`
Dutchie Platform ID Resolver
Usage:
npx tsx src/scripts/resolve-dutchie-id.ts <slug>
Examples:
npx tsx src/scripts/resolve-dutchie-id.ts hydroman-dispensary
npx tsx src/scripts/resolve-dutchie-id.ts AZ-Deeply-Rooted
npx tsx src/scripts/resolve-dutchie-id.ts mint-cannabis
Resolution Strategy:
1. Puppeteer navigates to https://dutchie.com/embedded-menu/{slug}
2. Extracts window.reactEnv.dispensaryId (preferred)
3. Falls back to GraphQL GetAddressBasedDispensaryData if needed
Output Fields:
- dispensaryId: MongoDB ObjectId (e.g., "6405ef617056e8014d79101b")
- source: "reactEnv" (from page) or "graphql" (from API)
- httpStatus: HTTP status code from page load
- error: Error message if resolution failed
`);
process.exit(0);
}
const slug = args[0];
console.log('='.repeat(60));
console.log('DUTCHIE PLATFORM ID RESOLVER');
console.log('='.repeat(60));
console.log(`Slug: ${slug}`);
console.log(`Embedded Menu URL: https://dutchie.com/embedded-menu/${slug}`);
console.log('');
console.log('Resolving...');
console.log('');
const startTime = Date.now();
try {
const result: ResolveDispensaryResult = await resolveDispensaryIdWithDetails(slug);
const duration = Date.now() - startTime;
console.log('='.repeat(60));
console.log('RESOLUTION RESULT');
console.log('='.repeat(60));
if (result.dispensaryId) {
console.log(`✓ SUCCESS`);
console.log('');
console.log(` Dispensary ID: ${result.dispensaryId}`);
console.log(` Source: ${result.source}`);
console.log(` HTTP Status: ${result.httpStatus || 'N/A'}`);
console.log(` Duration: ${duration}ms`);
console.log('');
// Show how to use this ID
console.log('='.repeat(60));
console.log('USAGE');
console.log('='.repeat(60));
console.log('');
console.log('Use this ID in GraphQL FilteredProducts query:');
console.log('');
console.log(' POST https://dutchie.com/api-3/graphql');
console.log('');
console.log(' Body:');
console.log(` {
"operationName": "FilteredProducts",
"variables": {
"productsFilter": {
"dispensaryId": "${result.dispensaryId}",
"pricingType": "rec",
"Status": "Active"
},
"page": 0,
"perPage": 100
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0"
}
}
}`);
console.log('');
// Output for piping/scripting
console.log('='.repeat(60));
console.log('JSON OUTPUT');
console.log('='.repeat(60));
console.log(JSON.stringify({
success: true,
slug,
dispensaryId: result.dispensaryId,
source: result.source,
httpStatus: result.httpStatus,
durationMs: duration,
}, null, 2));
} else {
console.log(`✗ FAILED`);
console.log('');
console.log(` Error: ${result.error || 'Unknown error'}`);
console.log(` HTTP Status: ${result.httpStatus || 'N/A'}`);
console.log(` Duration: ${duration}ms`);
console.log('');
if (result.httpStatus === 403 || result.httpStatus === 404) {
console.log('NOTE: This store may be removed or not accessible on Dutchie.');
console.log(' Mark dispensary as not_crawlable in the database.');
}
console.log('');
console.log('JSON OUTPUT:');
console.log(JSON.stringify({
success: false,
slug,
error: result.error,
httpStatus: result.httpStatus,
durationMs: duration,
}, null, 2));
process.exit(1);
}
} catch (error: any) {
const duration = Date.now() - startTime;
console.error('='.repeat(60));
console.error('ERROR');
console.error('='.repeat(60));
console.error(`Message: ${error.message}`);
console.error(`Duration: ${duration}ms`);
console.error('');
if (error.message.includes('net::ERR_NAME_NOT_RESOLVED')) {
console.error('NOTE: DNS resolution failed. This typically happens when running');
console.error(' locally due to network restrictions. Try running from the');
console.error(' Kubernetes pod or a cloud environment.');
}
process.exit(1);
}
}
main();

View File

@@ -1,151 +0,0 @@
/**
* LEGACY SCRIPT - Run Dutchie GraphQL Scrape
*
* DEPRECATED: This script creates its own database pool.
* Future implementations should use the CannaiQ API endpoints instead.
*
* This script demonstrates the full pipeline:
* 1. Puppeteer navigates to Dutchie menu
* 2. GraphQL responses are intercepted
* 3. Products are normalized to our schema
* 4. Products are upserted to database
* 5. Derived views (brands, categories, specials) are automatically updated
*
* DO NOT:
* - Add this to package.json scripts
* - Run this in automated jobs
* - Use DATABASE_URL directly
*/
import { Pool } from 'pg';
import { scrapeDutchieMenu } from '../scrapers/dutchie-graphql';
console.warn('\n⚠ LEGACY SCRIPT: This script should be replaced with CannaiQ API calls.\n');
// Single database connection (cannaiq in cannaiq-postgres container)
const DATABASE_URL = process.env.CANNAIQ_DB_URL ||
`postgresql://${process.env.CANNAIQ_DB_USER || 'dutchie'}:${process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass'}@${process.env.CANNAIQ_DB_HOST || 'localhost'}:${process.env.CANNAIQ_DB_PORT || '54320'}/${process.env.CANNAIQ_DB_NAME || 'cannaiq'}`;
async function main() {
const pool = new Pool({ connectionString: DATABASE_URL });
try {
console.log('='.repeat(80));
console.log('DUTCHIE GRAPHQL SCRAPER - FULL PIPELINE TEST');
console.log('='.repeat(80));
console.log(`Database: ${DATABASE_URL.replace(/:[^:@]+@/, ':***@')}`);
// Configuration
const storeId = 1; // Deeply Rooted
const menuUrl = 'https://dutchie.com/embedded-menu/AZ-Deeply-Rooted';
console.log(`\nStore ID: ${storeId}`);
console.log(`Menu URL: ${menuUrl}`);
console.log('\n' + '-'.repeat(80));
// Run the scrape
console.log('\n🚀 Starting scrape...\n');
const result = await scrapeDutchieMenu(pool, storeId, menuUrl);
console.log('\n' + '-'.repeat(80));
console.log('📊 SCRAPE RESULTS:');
console.log('-'.repeat(80));
console.log(` Success: ${result.success}`);
console.log(` Products Found: ${result.productsFound}`);
console.log(` Inserted: ${result.inserted}`);
console.log(` Updated: ${result.updated}`);
if (result.error) {
console.log(` Error: ${result.error}`);
}
// Query derived views to show the result
if (result.success) {
console.log('\n' + '-'.repeat(80));
console.log('📈 DERIVED DATA (from products table):');
console.log('-'.repeat(80));
// Brands
const brandsResult = await pool.query(`
SELECT brand_name, product_count, min_price, max_price
FROM derived_brands
WHERE store_id = $1
ORDER BY product_count DESC
LIMIT 5
`, [storeId]);
console.log('\nTop 5 Brands:');
brandsResult.rows.forEach(row => {
console.log(` - ${row.brand_name}: ${row.product_count} products ($${row.min_price} - $${row.max_price})`);
});
// Specials
const specialsResult = await pool.query(`
SELECT name, brand, rec_price, rec_special_price, discount_percent
FROM current_specials
WHERE store_id = $1
LIMIT 5
`, [storeId]);
console.log('\nTop 5 Specials:');
if (specialsResult.rows.length === 0) {
console.log(' (No specials found - is_on_special may not be populated yet)');
} else {
specialsResult.rows.forEach(row => {
console.log(` - ${row.name} (${row.brand}): $${row.rec_price}$${row.rec_special_price} (${row.discount_percent}% off)`);
});
}
// Categories
const categoriesResult = await pool.query(`
SELECT category_name, product_count
FROM derived_categories
WHERE store_id = $1
ORDER BY product_count DESC
LIMIT 5
`, [storeId]);
console.log('\nTop 5 Categories:');
if (categoriesResult.rows.length === 0) {
console.log(' (No categories found - subcategory may not be populated yet)');
} else {
categoriesResult.rows.forEach(row => {
console.log(` - ${row.category_name}: ${row.product_count} products`);
});
}
// Sample product
const sampleResult = await pool.query(`
SELECT name, brand, subcategory, rec_price, rec_special_price, is_on_special, thc_percentage, status
FROM products
WHERE store_id = $1 AND subcategory IS NOT NULL
ORDER BY updated_at DESC
LIMIT 1
`, [storeId]);
if (sampleResult.rows.length > 0) {
const sample = sampleResult.rows[0];
console.log('\nSample Product (with new fields):');
console.log(` Name: ${sample.name}`);
console.log(` Brand: ${sample.brand}`);
console.log(` Category: ${sample.subcategory}`);
console.log(` Price: $${sample.rec_price}`);
console.log(` Sale Price: ${sample.rec_special_price ? `$${sample.rec_special_price}` : 'N/A'}`);
console.log(` On Special: ${sample.is_on_special}`);
console.log(` THC: ${sample.thc_percentage}%`);
console.log(` Status: ${sample.status}`);
}
}
console.log('\n' + '='.repeat(80));
console.log('✅ SCRAPE COMPLETE');
console.log('='.repeat(80));
} catch (error: any) {
console.error('\n❌ Error:', error.message);
throw error;
} finally {
await pool.end();
}
}
main().catch(console.error);

View File

@@ -1,225 +0,0 @@
/**
* Sandbox Crawl Script for Dispensary 101 (Trulieve Scottsdale)
*
* Runs a full crawl and captures trace data for observability.
* NO automatic promotion or status changes.
*/
import { Pool } from 'pg';
import { crawlDispensaryProducts } from '../dutchie-az/services/product-crawler';
import { Dispensary } from '../dutchie-az/types';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function main() {
console.log('=== SANDBOX CRAWL: Dispensary 101 (Trulieve Scottsdale) ===\n');
const startTime = Date.now();
// Load dispensary from database (only columns that exist in local schema)
const dispResult = await pool.query(`
SELECT id, name, city, state, menu_type, platform_dispensary_id, menu_url
FROM dispensaries
WHERE id = 101
`);
if (!dispResult.rows[0]) {
console.log('ERROR: Dispensary 101 not found');
await pool.end();
return;
}
const row = dispResult.rows[0];
// Map to Dispensary interface (snake_case -> camelCase)
const dispensary: Dispensary = {
id: row.id,
platform: 'dutchie',
name: row.name,
slug: row.name.toLowerCase().replace(/\s+/g, '-'),
city: row.city,
state: row.state,
platformDispensaryId: row.platform_dispensary_id,
menuType: row.menu_type,
menuUrl: row.menu_url,
createdAt: new Date(),
updatedAt: new Date(),
};
console.log('=== DISPENSARY INFO ===');
console.log(`Name: ${dispensary.name}`);
console.log(`Location: ${dispensary.city}, ${dispensary.state}`);
console.log(`Menu Type: ${dispensary.menuType}`);
console.log(`Platform ID: ${dispensary.platformDispensaryId}`);
console.log(`Menu URL: ${dispensary.menuUrl}`);
console.log('');
// Get profile info
const profileResult = await pool.query(`
SELECT id, profile_key, status, config FROM dispensary_crawler_profiles
WHERE dispensary_id = 101
`);
const profile = profileResult.rows[0];
if (profile) {
console.log('=== PROFILE ===');
console.log(`Profile Key: ${profile.profile_key}`);
console.log(`Profile Status: ${profile.status}`);
console.log(`Config: ${JSON.stringify(profile.config, null, 2)}`);
console.log('');
} else {
console.log('=== PROFILE ===');
console.log('No profile found - will use defaults');
console.log('');
}
// Run the crawl
console.log('=== STARTING CRAWL ===');
console.log('Options: useBothModes=true, downloadImages=false (sandbox)');
console.log('');
try {
const result = await crawlDispensaryProducts(dispensary, 'rec', {
useBothModes: true,
downloadImages: false, // Skip images in sandbox mode for speed
});
console.log('');
console.log('=== CRAWL RESULT ===');
console.log(`Success: ${result.success}`);
console.log(`Products Found: ${result.productsFound}`);
console.log(`Products Fetched: ${result.productsFetched}`);
console.log(`Products Upserted: ${result.productsUpserted}`);
console.log(`Snapshots Created: ${result.snapshotsCreated}`);
if (result.errorMessage) {
console.log(`Error: ${result.errorMessage}`);
}
console.log(`Duration: ${result.durationMs}ms`);
console.log('');
// Show sample products from database
if (result.productsUpserted > 0) {
const sampleProducts = await pool.query(`
SELECT
id, name, brand_name, type, subcategory, strain_type,
price_rec, price_rec_original, stock_status, external_product_id
FROM dutchie_products
WHERE dispensary_id = 101
ORDER BY updated_at DESC
LIMIT 10
`);
console.log('=== SAMPLE PRODUCTS (10) ===');
sampleProducts.rows.forEach((p: any, i: number) => {
console.log(`${i + 1}. ${p.name}`);
console.log(` Brand: ${p.brand_name || 'N/A'}`);
console.log(` Type: ${p.type} / ${p.subcategory || 'N/A'}`);
console.log(` Strain: ${p.strain_type || 'N/A'}`);
console.log(` Price: $${p.price_rec || 'N/A'} (orig: $${p.price_rec_original || 'N/A'})`);
console.log(` Stock: ${p.stock_status}`);
console.log(` External ID: ${p.external_product_id}`);
console.log('');
});
// Show field coverage stats
const fieldStats = await pool.query(`
SELECT
COUNT(*) as total,
COUNT(brand_name) as with_brand,
COUNT(type) as with_type,
COUNT(strain_type) as with_strain,
COUNT(price_rec) as with_price,
COUNT(image_url) as with_image,
COUNT(description) as with_description,
COUNT(thc_content) as with_thc,
COUNT(cbd_content) as with_cbd
FROM dutchie_products
WHERE dispensary_id = 101
`);
const stats = fieldStats.rows[0];
console.log('=== FIELD COVERAGE ===');
console.log(`Total products: ${stats.total}`);
console.log(`With brand: ${stats.with_brand} (${Math.round(stats.with_brand / stats.total * 100)}%)`);
console.log(`With type: ${stats.with_type} (${Math.round(stats.with_type / stats.total * 100)}%)`);
console.log(`With strain_type: ${stats.with_strain} (${Math.round(stats.with_strain / stats.total * 100)}%)`);
console.log(`With price_rec: ${stats.with_price} (${Math.round(stats.with_price / stats.total * 100)}%)`);
console.log(`With image_url: ${stats.with_image} (${Math.round(stats.with_image / stats.total * 100)}%)`);
console.log(`With description: ${stats.with_description} (${Math.round(stats.with_description / stats.total * 100)}%)`);
console.log(`With THC: ${stats.with_thc} (${Math.round(stats.with_thc / stats.total * 100)}%)`);
console.log(`With CBD: ${stats.with_cbd} (${Math.round(stats.with_cbd / stats.total * 100)}%)`);
console.log('');
}
// Insert trace record for observability
const traceData = {
crawlResult: result,
dispensaryInfo: {
id: dispensary.id,
name: dispensary.name,
platformDispensaryId: dispensary.platformDispensaryId,
menuUrl: dispensary.menuUrl,
},
profile: profile || null,
timestamp: new Date().toISOString(),
};
await pool.query(`
INSERT INTO crawl_orchestration_traces
(dispensary_id, profile_id, profile_key, crawler_module, mode,
state_at_start, state_at_end, trace, success, products_found,
duration_ms, started_at, completed_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, NOW())
`, [
101,
profile?.id || null,
profile?.profile_key || null,
'product-crawler',
'sandbox',
profile?.status || 'no_profile',
profile?.status || 'no_profile', // No status change in sandbox
JSON.stringify(traceData),
result.success,
result.productsFound,
result.durationMs,
new Date(startTime),
]);
console.log('=== TRACE RECORDED ===');
console.log('Trace saved to crawl_orchestration_traces table');
} catch (error: any) {
console.error('=== CRAWL ERROR ===');
console.error('Error:', error.message);
console.error('Stack:', error.stack);
// Record error trace
await pool.query(`
INSERT INTO crawl_orchestration_traces
(dispensary_id, profile_id, profile_key, crawler_module, mode,
state_at_start, state_at_end, trace, success, error_message,
duration_ms, started_at, completed_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, NOW())
`, [
101,
profile?.id || null,
profile?.profile_key || null,
'product-crawler',
'sandbox',
profile?.status || 'no_profile',
profile?.status || 'no_profile',
JSON.stringify({ error: error.message, stack: error.stack }),
false,
error.message,
Date.now() - startTime,
new Date(startTime),
]);
}
await pool.end();
console.log('=== SANDBOX CRAWL COMPLETE ===');
}
main().catch(e => {
console.error('Fatal error:', e.message);
process.exit(1);
});

View File

@@ -1,181 +0,0 @@
/**
* LEGACY SCRIPT - Sandbox Crawl Test
*
* DEPRECATED: This script uses direct database connections.
* Future implementations should use the CannaiQ API endpoints instead.
*
* This script runs sandbox crawl for a dispensary and captures the full trace.
* It is kept for historical reference and manual testing only.
*
* DO NOT:
* - Add this to package.json scripts
* - Run this in automated jobs
* - Use DATABASE_URL directly
*
* Usage (manual only):
* STORAGE_DRIVER=local npx tsx src/scripts/sandbox-test.ts <dispensary_id>
*
* LOCAL MODE REQUIREMENTS:
* - STORAGE_DRIVER=local
* - STORAGE_BASE_PATH=./storage
* - Local cannaiq-postgres on port 54320
* - NO MinIO, NO Kubernetes
*/
import { query, getClient, closePool } from '../dutchie-az/db/connection';
import { runDispensaryOrchestrator } from '../services/dispensary-orchestrator';
// Verify local mode
function verifyLocalMode(): void {
const storageDriver = process.env.STORAGE_DRIVER || 'local';
const minioEndpoint = process.env.MINIO_ENDPOINT;
console.log('=== LOCAL MODE VERIFICATION ===');
console.log(`STORAGE_DRIVER: ${storageDriver}`);
console.log(`MINIO_ENDPOINT: ${minioEndpoint || 'NOT SET (good)'}`);
console.log(`STORAGE_BASE_PATH: ${process.env.STORAGE_BASE_PATH || './storage'}`);
console.log('DB Connection: Using canonical CannaiQ pool');
if (storageDriver !== 'local') {
console.error('ERROR: STORAGE_DRIVER must be "local"');
process.exit(1);
}
if (minioEndpoint) {
console.error('ERROR: MINIO_ENDPOINT should NOT be set in local mode');
process.exit(1);
}
console.log('✅ Local mode verified\n');
}
async function getDispensaryInfo(dispensaryId: number) {
const result = await query(`
SELECT d.id, d.name, d.city, d.menu_type, d.platform_dispensary_id, d.menu_url,
p.profile_key, p.status as profile_status, p.config
FROM dispensaries d
LEFT JOIN dispensary_crawler_profiles p ON p.dispensary_id = d.id
WHERE d.id = $1
`, [dispensaryId]);
return result.rows[0];
}
async function getLatestTrace(dispensaryId: number) {
const result = await query(`
SELECT *
FROM crawl_orchestration_traces
WHERE dispensary_id = $1
ORDER BY created_at DESC
LIMIT 1
`, [dispensaryId]);
return result.rows[0];
}
async function main() {
console.warn('\n⚠ LEGACY SCRIPT: This script should be replaced with CannaiQ API calls.\n');
const dispensaryId = parseInt(process.argv[2], 10);
if (!dispensaryId || isNaN(dispensaryId)) {
console.error('Usage: npx tsx src/scripts/sandbox-test.ts <dispensary_id>');
console.error('Example: npx tsx src/scripts/sandbox-test.ts 101');
process.exit(1);
}
// Verify local mode first
verifyLocalMode();
try {
// Get dispensary info
console.log(`=== DISPENSARY INFO (ID: ${dispensaryId}) ===`);
const dispensary = await getDispensaryInfo(dispensaryId);
if (!dispensary) {
console.error(`Dispensary ${dispensaryId} not found`);
process.exit(1);
}
console.log(`Name: ${dispensary.name}`);
console.log(`City: ${dispensary.city}`);
console.log(`Menu Type: ${dispensary.menu_type}`);
console.log(`Platform Dispensary ID: ${dispensary.platform_dispensary_id || 'NULL'}`);
console.log(`Menu URL: ${dispensary.menu_url || 'NULL'}`);
console.log(`Profile Key: ${dispensary.profile_key || 'NONE'}`);
console.log(`Profile Status: ${dispensary.profile_status || 'N/A'}`);
console.log(`Profile Config: ${JSON.stringify(dispensary.config, null, 2)}`);
console.log('');
// Run sandbox crawl
console.log('=== RUNNING SANDBOX CRAWL ===');
console.log(`Starting sandbox crawl for ${dispensary.name}...`);
const startTime = Date.now();
const result = await runDispensaryOrchestrator(dispensaryId);
const duration = Date.now() - startTime;
console.log('\n=== CRAWL RESULT ===');
console.log(`Status: ${result.status}`);
console.log(`Summary: ${result.summary}`);
console.log(`Run ID: ${result.runId}`);
console.log(`Duration: ${duration}ms`);
console.log(`Detection Ran: ${result.detectionRan}`);
console.log(`Crawl Ran: ${result.crawlRan}`);
console.log(`Crawl Type: ${result.crawlType || 'N/A'}`);
console.log(`Products Found: ${result.productsFound || 0}`);
console.log(`Products New: ${result.productsNew || 0}`);
console.log(`Products Updated: ${result.productsUpdated || 0}`);
if (result.error) {
console.log(`Error: ${result.error}`);
}
// Get the trace
console.log('\n=== ORCHESTRATOR TRACE ===');
const trace = await getLatestTrace(dispensaryId);
if (trace) {
console.log(`Trace ID: ${trace.id}`);
console.log(`Profile Key: ${trace.profile_key || 'N/A'}`);
console.log(`Mode: ${trace.mode}`);
console.log(`Status: ${trace.status}`);
console.log(`Started At: ${trace.started_at}`);
console.log(`Completed At: ${trace.completed_at || 'In Progress'}`);
if (trace.steps && Array.isArray(trace.steps)) {
console.log(`\nSteps (${trace.steps.length} total):`);
trace.steps.forEach((step: any, i: number) => {
const status = step.status === 'completed' ? '✅' : step.status === 'failed' ? '❌' : '⏳';
console.log(` ${i + 1}. ${status} ${step.action}: ${step.description}`);
if (step.output && Object.keys(step.output).length > 0) {
console.log(` Output: ${JSON.stringify(step.output)}`);
}
if (step.error) {
console.log(` Error: ${step.error}`);
}
});
}
if (trace.result) {
console.log(`\nResult: ${JSON.stringify(trace.result, null, 2)}`);
}
if (trace.error_message) {
console.log(`\nError Message: ${trace.error_message}`);
}
} else {
console.log('No trace found for this dispensary');
}
} catch (error: any) {
console.error('Error running sandbox test:', error.message);
console.error(error.stack);
process.exit(1);
} finally {
await closePool();
}
}
main();

View File

@@ -1,332 +0,0 @@
/**
* LEGACY SCRIPT - Scrape All Active Products
*
* DEPRECATED: This script creates its own database pool.
* Future implementations should use the CannaiQ API endpoints instead.
*
* Scrapes ALL active products via direct GraphQL pagination.
* This is more reliable than category navigation.
*
* DO NOT:
* - Add this to package.json scripts
* - Run this in automated jobs
* - Use DATABASE_URL directly
*/
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { Pool } from 'pg';
import { normalizeDutchieProduct, DutchieProduct } from '../scrapers/dutchie-graphql';
puppeteer.use(StealthPlugin());
console.warn('\n⚠ LEGACY SCRIPT: This script should be replaced with CannaiQ API calls.\n');
// Single database connection (cannaiq in cannaiq-postgres container)
const DATABASE_URL = process.env.CANNAIQ_DB_URL ||
`postgresql://${process.env.CANNAIQ_DB_USER || 'dutchie'}:${process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass'}@${process.env.CANNAIQ_DB_HOST || 'localhost'}:${process.env.CANNAIQ_DB_PORT || '54320'}/${process.env.CANNAIQ_DB_NAME || 'cannaiq'}`;
const GRAPHQL_HASH = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0';
async function scrapeAllProducts(menuUrl: string, storeId: number) {
const pool = new Pool({ connectionString: DATABASE_URL });
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
try {
const page = await browser.newPage();
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36'
);
console.log('Loading menu to establish session...');
await page.goto(menuUrl, {
waitUntil: 'networkidle2',
timeout: 60000,
});
await new Promise((r) => setTimeout(r, 3000));
const dispensaryId = await page.evaluate(() => (window as any).reactEnv?.dispensaryId);
console.log('Dispensary ID:', dispensaryId);
// Paginate through all products
const allProducts: DutchieProduct[] = [];
let pageNum = 0;
const perPage = 100;
console.log('\nFetching all products via paginated GraphQL...');
while (true) {
const result = await page.evaluate(
async (dispId: string, hash: string, page: number, perPage: number) => {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: dispId,
pricingType: 'rec',
Status: 'Active',
types: [],
useCache: false,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page,
perPage,
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify({ persistedQuery: { version: 1, sha256Hash: hash } }),
});
const resp = await fetch(`https://dutchie.com/graphql?${qs.toString()}`, {
method: 'GET',
headers: {
'content-type': 'application/json',
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include',
});
const json = await resp.json();
return {
products: json?.data?.filteredProducts?.products || [],
totalCount: json?.data?.filteredProducts?.queryInfo?.totalCount,
};
},
dispensaryId,
GRAPHQL_HASH,
pageNum,
perPage
);
if (result.products.length === 0) {
break;
}
allProducts.push(...result.products);
console.log(
`Page ${pageNum}: ${result.products.length} products (total so far: ${allProducts.length}/${result.totalCount})`
);
pageNum++;
// Safety limit
if (pageNum > 50) {
console.log('Reached page limit');
break;
}
}
console.log(`\nTotal products fetched: ${allProducts.length}`);
// Normalize and upsert
console.log('\nNormalizing and upserting to database...');
const normalized = allProducts.map(normalizeDutchieProduct);
const client = await pool.connect();
let inserted = 0;
let updated = 0;
try {
await client.query('BEGIN');
for (const product of normalized) {
const result = await client.query(
`
INSERT INTO products (
store_id, external_id, slug, name, enterprise_product_id,
brand, brand_external_id, brand_logo_url,
subcategory, strain_type, canonical_category,
price, rec_price, med_price, rec_special_price, med_special_price,
is_on_special, special_name, discount_percent, special_data,
sku, inventory_quantity, inventory_available, is_below_threshold, status,
thc_percentage, cbd_percentage, cannabinoids,
weight_mg, net_weight_value, net_weight_unit, options, raw_options,
image_url, additional_images,
is_featured, medical_only, rec_only,
source_created_at, source_updated_at,
description, raw_data,
dutchie_url, last_seen_at, updated_at
)
VALUES (
$1, $2, $3, $4, $5,
$6, $7, $8,
$9, $10, $11,
$12, $13, $14, $15, $16,
$17, $18, $19, $20,
$21, $22, $23, $24, $25,
$26, $27, $28,
$29, $30, $31, $32, $33,
$34, $35,
$36, $37, $38,
$39, $40,
$41, $42,
'', NOW(), NOW()
)
ON CONFLICT (store_id, slug) DO UPDATE SET
name = EXCLUDED.name,
enterprise_product_id = EXCLUDED.enterprise_product_id,
brand = EXCLUDED.brand,
brand_external_id = EXCLUDED.brand_external_id,
brand_logo_url = EXCLUDED.brand_logo_url,
subcategory = EXCLUDED.subcategory,
strain_type = EXCLUDED.strain_type,
canonical_category = EXCLUDED.canonical_category,
price = EXCLUDED.price,
rec_price = EXCLUDED.rec_price,
med_price = EXCLUDED.med_price,
rec_special_price = EXCLUDED.rec_special_price,
med_special_price = EXCLUDED.med_special_price,
is_on_special = EXCLUDED.is_on_special,
special_name = EXCLUDED.special_name,
discount_percent = EXCLUDED.discount_percent,
special_data = EXCLUDED.special_data,
sku = EXCLUDED.sku,
inventory_quantity = EXCLUDED.inventory_quantity,
inventory_available = EXCLUDED.inventory_available,
is_below_threshold = EXCLUDED.is_below_threshold,
status = EXCLUDED.status,
thc_percentage = EXCLUDED.thc_percentage,
cbd_percentage = EXCLUDED.cbd_percentage,
cannabinoids = EXCLUDED.cannabinoids,
weight_mg = EXCLUDED.weight_mg,
net_weight_value = EXCLUDED.net_weight_value,
net_weight_unit = EXCLUDED.net_weight_unit,
options = EXCLUDED.options,
raw_options = EXCLUDED.raw_options,
image_url = EXCLUDED.image_url,
additional_images = EXCLUDED.additional_images,
is_featured = EXCLUDED.is_featured,
medical_only = EXCLUDED.medical_only,
rec_only = EXCLUDED.rec_only,
source_created_at = EXCLUDED.source_created_at,
source_updated_at = EXCLUDED.source_updated_at,
description = EXCLUDED.description,
raw_data = EXCLUDED.raw_data,
last_seen_at = NOW(),
updated_at = NOW()
RETURNING (xmax = 0) AS was_inserted
`,
[
storeId,
product.external_id,
product.slug,
product.name,
product.enterprise_product_id,
product.brand,
product.brand_external_id,
product.brand_logo_url,
product.subcategory,
product.strain_type,
product.canonical_category,
product.price,
product.rec_price,
product.med_price,
product.rec_special_price,
product.med_special_price,
product.is_on_special,
product.special_name,
product.discount_percent,
product.special_data ? JSON.stringify(product.special_data) : null,
product.sku,
product.inventory_quantity,
product.inventory_available,
product.is_below_threshold,
product.status,
product.thc_percentage,
product.cbd_percentage,
product.cannabinoids ? JSON.stringify(product.cannabinoids) : null,
product.weight_mg,
product.net_weight_value,
product.net_weight_unit,
product.options,
product.raw_options,
product.image_url,
product.additional_images,
product.is_featured,
product.medical_only,
product.rec_only,
product.source_created_at,
product.source_updated_at,
product.description,
product.raw_data ? JSON.stringify(product.raw_data) : null,
]
);
if (result.rows[0]?.was_inserted) {
inserted++;
} else {
updated++;
}
}
await client.query('COMMIT');
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
console.log(`\nDatabase: ${inserted} inserted, ${updated} updated`);
// Show summary stats
const stats = await pool.query(
`
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE is_on_special) as specials,
COUNT(DISTINCT brand) as brands,
COUNT(DISTINCT subcategory) as categories
FROM products WHERE store_id = $1
`,
[storeId]
);
console.log('\nStore summary:');
console.log(` Total products: ${stats.rows[0].total}`);
console.log(` On special: ${stats.rows[0].specials}`);
console.log(` Unique brands: ${stats.rows[0].brands}`);
console.log(` Categories: ${stats.rows[0].categories}`);
return {
success: true,
totalProducts: allProducts.length,
inserted,
updated,
};
} finally {
await browser.close();
await pool.end();
}
}
// Run
const menuUrl = process.argv[2] || 'https://dutchie.com/embedded-menu/AZ-Deeply-Rooted';
const storeId = parseInt(process.argv[3] || '1', 10);
console.log('='.repeat(60));
console.log('DUTCHIE GRAPHQL FULL SCRAPE');
console.log('='.repeat(60));
console.log(`Menu URL: ${menuUrl}`);
console.log(`Store ID: ${storeId}`);
console.log('');
scrapeAllProducts(menuUrl, storeId)
.then((result) => {
console.log('\n' + '='.repeat(60));
console.log('COMPLETE');
console.log(JSON.stringify(result, null, 2));
})
.catch((error) => {
console.error('Error:', error.message);
process.exit(1);
});

View File

@@ -1,156 +0,0 @@
/**
* Test script: End-to-end Dutchie GraphQL → DB → Dashboard flow
*
* This demonstrates the complete data pipeline:
* 1. Fetch one product from Dutchie GraphQL via Puppeteer
* 2. Normalize it to our schema
* 3. Show the mapping
*/
import { normalizeDutchieProduct, DutchieProduct, NormalizedProduct } from '../scrapers/dutchie-graphql';
import * as fs from 'fs';
// Load the captured sample product from schema capture
const capturedData = JSON.parse(
fs.readFileSync('/tmp/dutchie-schema-capture.json', 'utf-8')
);
const sampleProduct: DutchieProduct = capturedData.sampleProduct;
console.log('='.repeat(80));
console.log('DUTCHIE GRAPHQL → DATABASE MAPPING DEMONSTRATION');
console.log('='.repeat(80));
console.log('\n📥 RAW DUTCHIE GRAPHQL PRODUCT:');
console.log('-'.repeat(80));
// Show key fields from raw product
const keyRawFields = {
'_id': sampleProduct._id,
'Name': sampleProduct.Name,
'cName': sampleProduct.cName,
'brandName': sampleProduct.brandName,
'brand.id': sampleProduct.brand?.id,
'type': sampleProduct.type,
'subcategory': sampleProduct.subcategory,
'strainType': sampleProduct.strainType,
'Prices': sampleProduct.Prices,
'recPrices': sampleProduct.recPrices,
'recSpecialPrices': sampleProduct.recSpecialPrices,
'special': sampleProduct.special,
'specialData.saleSpecials[0].specialName': sampleProduct.specialData?.saleSpecials?.[0]?.specialName,
'specialData.saleSpecials[0].discount': sampleProduct.specialData?.saleSpecials?.[0]?.discount,
'THCContent.range[0]': sampleProduct.THCContent?.range?.[0],
'CBDContent.range[0]': sampleProduct.CBDContent?.range?.[0],
'Status': sampleProduct.Status,
'Image': sampleProduct.Image,
'POSMetaData.canonicalSKU': sampleProduct.POSMetaData?.canonicalSKU,
'POSMetaData.children[0].quantity': sampleProduct.POSMetaData?.children?.[0]?.quantity,
'POSMetaData.children[0].quantityAvailable': sampleProduct.POSMetaData?.children?.[0]?.quantityAvailable,
};
Object.entries(keyRawFields).forEach(([key, value]) => {
console.log(` ${key}: ${JSON.stringify(value)}`);
});
console.log('\n📤 NORMALIZED DATABASE ROW:');
console.log('-'.repeat(80));
// Normalize the product
const normalized: NormalizedProduct = normalizeDutchieProduct(sampleProduct);
// Show the normalized result (excluding raw_data for readability)
const { raw_data, cannabinoids, special_data, ...displayFields } = normalized;
Object.entries(displayFields).forEach(([key, value]) => {
if (value !== undefined && value !== null) {
console.log(` ${key}: ${JSON.stringify(value)}`);
}
});
console.log('\n🔗 FIELD MAPPING:');
console.log('-'.repeat(80));
const fieldMappings = [
['_id / id', 'external_id', sampleProduct._id, normalized.external_id],
['Name', 'name', sampleProduct.Name, normalized.name],
['cName', 'slug', sampleProduct.cName, normalized.slug],
['brandName', 'brand', sampleProduct.brandName, normalized.brand],
['brand.id', 'brand_external_id', sampleProduct.brand?.id, normalized.brand_external_id],
['subcategory', 'subcategory', sampleProduct.subcategory, normalized.subcategory],
['strainType', 'strain_type', sampleProduct.strainType, normalized.strain_type],
['recPrices[0]', 'rec_price', sampleProduct.recPrices?.[0], normalized.rec_price],
['recSpecialPrices[0]', 'rec_special_price', sampleProduct.recSpecialPrices?.[0], normalized.rec_special_price],
['special', 'is_on_special', sampleProduct.special, normalized.is_on_special],
['specialData...specialName', 'special_name', sampleProduct.specialData?.saleSpecials?.[0]?.specialName?.substring(0, 40) + '...', normalized.special_name?.substring(0, 40) + '...'],
['THCContent.range[0]', 'thc_percentage', sampleProduct.THCContent?.range?.[0], normalized.thc_percentage],
['CBDContent.range[0]', 'cbd_percentage', sampleProduct.CBDContent?.range?.[0], normalized.cbd_percentage],
['Status', 'status', sampleProduct.Status, normalized.status],
['Image', 'image_url', sampleProduct.Image?.substring(0, 50) + '...', normalized.image_url?.substring(0, 50) + '...'],
['POSMetaData.canonicalSKU', 'sku', sampleProduct.POSMetaData?.canonicalSKU, normalized.sku],
];
console.log(' GraphQL Field → DB Column | Value');
console.log(' ' + '-'.repeat(75));
fieldMappings.forEach(([gqlField, dbCol, gqlVal, dbVal]) => {
const gqlStr = String(gqlField).padEnd(30);
const dbStr = String(dbCol).padEnd(20);
console.log(` ${gqlStr}${dbStr} | ${JSON.stringify(dbVal)}`);
});
console.log('\n📊 SQL INSERT STATEMENT:');
console.log('-'.repeat(80));
// Generate example SQL
const sqlExample = `
INSERT INTO products (
store_id, external_id, slug, name,
brand, brand_external_id,
subcategory, strain_type,
rec_price, rec_special_price,
is_on_special, special_name, discount_percent,
thc_percentage, cbd_percentage,
status, image_url, sku
) VALUES (
1, -- store_id (Deeply Rooted)
'${normalized.external_id}', -- external_id
'${normalized.slug}', -- slug
'${normalized.name}', -- name
'${normalized.brand}', -- brand
'${normalized.brand_external_id}', -- brand_external_id
'${normalized.subcategory}', -- subcategory
'${normalized.strain_type}', -- strain_type
${normalized.rec_price}, -- rec_price
${normalized.rec_special_price}, -- rec_special_price
${normalized.is_on_special}, -- is_on_special
'${normalized.special_name?.substring(0, 50)}...', -- special_name
${normalized.discount_percent || 'NULL'}, -- discount_percent
${normalized.thc_percentage}, -- thc_percentage
${normalized.cbd_percentage}, -- cbd_percentage
'${normalized.status}', -- status
'${normalized.image_url}', -- image_url
'${normalized.sku}' -- sku
)
ON CONFLICT (store_id, slug) DO UPDATE SET ...;
`;
console.log(sqlExample);
console.log('\n✅ SUMMARY:');
console.log('-'.repeat(80));
console.log(` Product: ${normalized.name}`);
console.log(` Brand: ${normalized.brand}`);
console.log(` Category: ${normalized.subcategory}`);
console.log(` Price: $${normalized.rec_price}$${normalized.rec_special_price} (${normalized.discount_percent}% off)`);
console.log(` THC: ${normalized.thc_percentage}%`);
console.log(` Status: ${normalized.status}`);
console.log(` On Special: ${normalized.is_on_special}`);
console.log(` SKU: ${normalized.sku}`);
console.log('\n🎯 DERIVED VIEWS (computed from products table):');
console.log('-'.repeat(80));
console.log(' - current_specials: Products where is_on_special = true');
console.log(' - derived_brands: Aggregated by brand name with counts/prices');
console.log(' - derived_categories: Aggregated by subcategory');
console.log('\nAll views are computed from the single products table - no separate tables needed!');

Some files were not shown because too many files have changed in this diff Show More