diff --git a/CLAUDE.md b/CLAUDE.md index 90a4a716..b88ec83e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -193,6 +193,44 @@ CannaiQ has **TWO databases** with distinct purposes: | `dutchie_menus` | **Canonical CannaiQ database** - All schema, migrations, and application data | READ/WRITE | | `dutchie_legacy` | **Legacy read-only archive** - Historical data from old system | READ-ONLY | +### Store vs Dispensary Terminology + +**"Store" and "Dispensary" are SYNONYMS in CannaiQ.** + +| Term | Usage | DB Table | +|------|-------|----------| +| Store | API routes (`/api/stores`) | `dispensaries` | +| Dispensary | DB table, internal code | `dispensaries` | + +- `/api/stores` and `/api/dispensaries` both query the `dispensaries` table +- There is NO `stores` table in use - it's a legacy empty table +- Use these terms interchangeably in code and documentation + +### Canonical vs Legacy Tables + +**CANONICAL TABLES (USE THESE):** + +| Table | Purpose | Row Count | +|-------|---------|-----------| +| `dispensaries` | Store/dispensary records | ~188+ rows | +| `dutchie_products` | Product catalog | ~37,000+ rows | +| `dutchie_product_snapshots` | Price/stock history | ~millions | +| `store_products` | Canonical product schema | ~37,000+ rows | +| `store_product_snapshots` | Canonical snapshot schema | growing | + +**LEGACY TABLES (EMPTY - DO NOT USE):** + +| Table | Status | Action | +|-------|--------|--------| +| `stores` | EMPTY (0 rows) | Use `dispensaries` instead | +| `products` | EMPTY (0 rows) | Use `dutchie_products` or `store_products` | +| `categories` | EMPTY (0 rows) | Categories stored in product records | + +**Code must NEVER:** +- Query the `stores` table (use `dispensaries`) +- Query the `products` table (use `dutchie_products` or `store_products`) +- Query the `categories` table (categories are in product records) + **CRITICAL RULES:** - **Migrations ONLY run on `dutchie_menus`** - NEVER on `dutchie_legacy` - **Application code connects ONLY to `dutchie_menus`** @@ -615,15 +653,28 @@ export default defineConfig({ ### Detailed Rules -1) **Dispensary vs Store** - - Dutchie pipeline uses `dispensaries` (not legacy `stores`). For dutchie crawls, always work with dispensary ID. +1) **Dispensary = Store (SAME THING)** + - "Dispensary" and "store" are synonyms in CannaiQ. Use interchangeably. + - **API endpoint**: `/api/stores` (NOT `/api/dispensaries`) + - **DB table**: `dispensaries` + - When you need to create/query stores via API, use `/api/stores` - Use the record's `menu_url` and `platform_dispensary_id`. -2) **Menu detection and platform IDs** +2) **API Authentication** + - **Trusted Origins (no auth needed)**: + - IPs: `127.0.0.1`, `::1`, `::ffff:127.0.0.1` + - Origins: `https://cannaiq.co`, `https://findadispo.com`, `https://findagram.co` + - Also: `http://localhost:3010`, `http://localhost:8080`, `http://localhost:5173` + - Requests from trusted IPs/origins get automatic admin access (`role: 'internal'`) + - **Remote (non-trusted)**: Use Bearer token (JWT or API token). NO username/password auth. + - Never try to login with username/password via API - use tokens only. + - See `src/auth/middleware.ts` for `TRUSTED_ORIGINS` and `TRUSTED_IPS` lists. + +3) **Menu detection and platform IDs** - Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`. - Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set. -3) **Queries and mapping** +4) **Queries and mapping** - The DB returns snake_case; code expects camelCase. Always alias/map: - `platform_dispensary_id AS "platformDispensaryId"` - Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl). diff --git a/backend/.env b/backend/.env index 74a2486e..574f46e1 100644 --- a/backend/.env +++ b/backend/.env @@ -1,30 +1,52 @@ +# CannaiQ Backend Environment Configuration +# Copy this file to .env and fill in the values + +# Server PORT=3010 NODE_ENV=development # ============================================================================= -# CannaiQ Database (dutchie_menus) - PRIMARY DATABASE +# CANNAIQ DATABASE (dutchie_menus) - PRIMARY DATABASE # ============================================================================= -# This is where all schema migrations run and where canonical tables live. -# All CANNAIQ_DB_* variables are REQUIRED - connection will fail if missing. +# This is where ALL schema migrations run and where canonical tables live. +# All CANNAIQ_DB_* variables are REQUIRED - no defaults. +# The application will fail to start if any are missing. + CANNAIQ_DB_HOST=localhost CANNAIQ_DB_PORT=54320 -CANNAIQ_DB_NAME=dutchie_menus +CANNAIQ_DB_NAME=dutchie_menus # MUST be dutchie_menus - NOT dutchie_legacy CANNAIQ_DB_USER=dutchie CANNAIQ_DB_PASS=dutchie_local_pass +# Alternative: Use a full connection URL instead of individual vars +# If set, this takes priority over individual vars above +# CANNAIQ_DB_URL=postgresql://user:pass@host:port/dutchie_menus + # ============================================================================= -# Legacy Database (dutchie_legacy) - READ-ONLY SOURCE +# LEGACY DATABASE (dutchie_legacy) - READ-ONLY FOR ETL # ============================================================================= # Used ONLY by ETL scripts to read historical data. # NEVER run migrations against this database. +# These are only needed when running 042_legacy_import.ts + LEGACY_DB_HOST=localhost LEGACY_DB_PORT=54320 -LEGACY_DB_NAME=dutchie_legacy +LEGACY_DB_NAME=dutchie_legacy # READ-ONLY - never migrated LEGACY_DB_USER=dutchie -LEGACY_DB_PASS=dutchie_local_pass +LEGACY_DB_PASS= -# Local image storage (no MinIO per CLAUDE.md) +# Alternative: Use a full connection URL instead of individual vars +# LEGACY_DB_URL=postgresql://user:pass@host:port/dutchie_legacy + +# ============================================================================= +# LOCAL STORAGE +# ============================================================================= +# Local image storage path (no MinIO) LOCAL_IMAGES_PATH=./public/images -# JWT +# ============================================================================= +# AUTHENTICATION +# ============================================================================= JWT_SECRET=your-secret-key-change-in-production +ANTHROPIC_API_KEY=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 +OPENAI_API_KEY=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 diff --git a/backend/migrations/057_crawl_enabled_and_dutchie_verified.sql b/backend/migrations/057_crawl_enabled_and_dutchie_verified.sql new file mode 100644 index 00000000..576e7ffe --- /dev/null +++ b/backend/migrations/057_crawl_enabled_and_dutchie_verified.sql @@ -0,0 +1,42 @@ +-- Migration 057: Add crawl_enabled and dutchie_verified fields to dispensaries +-- +-- Purpose: +-- 1. Add crawl_enabled to control which dispensaries get crawled +-- 2. Add dutchie_verified to track Dutchie source-of-truth verification +-- 3. Default existing records to crawl_enabled = TRUE to preserve behavior +-- +-- After this migration, run the harmonization script to: +-- - Match dispensaries to Dutchie discoveries +-- - Update platform_dispensary_id from Dutchie +-- - Set dutchie_verified = TRUE for matches +-- - Set crawl_enabled = FALSE for unverified records + +-- Add crawl_enabled column (defaults to true to not break existing crawls) +ALTER TABLE dispensaries +ADD COLUMN IF NOT EXISTS crawl_enabled BOOLEAN DEFAULT TRUE; + +-- Add dutchie_verified column to track if record is verified against Dutchie +ALTER TABLE dispensaries +ADD COLUMN IF NOT EXISTS dutchie_verified BOOLEAN DEFAULT FALSE; + +-- Add dutchie_verified_at timestamp +ALTER TABLE dispensaries +ADD COLUMN IF NOT EXISTS dutchie_verified_at TIMESTAMP WITH TIME ZONE; + +-- Add dutchie_discovery_id to link back to the discovery record +ALTER TABLE dispensaries +ADD COLUMN IF NOT EXISTS dutchie_discovery_id BIGINT REFERENCES dutchie_discovery_locations(id); + +-- Create index for crawl queries (only crawl enabled dispensaries) +CREATE INDEX IF NOT EXISTS idx_dispensaries_crawl_enabled +ON dispensaries(crawl_enabled, state) +WHERE crawl_enabled = TRUE; + +-- Create index for dutchie verification status +CREATE INDEX IF NOT EXISTS idx_dispensaries_dutchie_verified +ON dispensaries(dutchie_verified, state); + +COMMENT ON COLUMN dispensaries.crawl_enabled IS 'Whether this dispensary should be included in crawl jobs. Set to FALSE for unverified or problematic records.'; +COMMENT ON COLUMN dispensaries.dutchie_verified IS 'Whether this dispensary has been verified against Dutchie source of truth (matched by slug or manually linked).'; +COMMENT ON COLUMN dispensaries.dutchie_verified_at IS 'Timestamp when Dutchie verification was completed.'; +COMMENT ON COLUMN dispensaries.dutchie_discovery_id IS 'Link to the dutchie_discovery_locations record this was matched/verified against.'; diff --git a/backend/migrations/065_slug_verification_tracking.sql b/backend/migrations/065_slug_verification_tracking.sql new file mode 100644 index 00000000..61c9de01 --- /dev/null +++ b/backend/migrations/065_slug_verification_tracking.sql @@ -0,0 +1,56 @@ +-- Migration 065: Slug verification and data source tracking +-- Adds columns to track when slug/menu data was verified and from what source + +-- Add slug verification columns to dispensaries +ALTER TABLE dispensaries +ADD COLUMN IF NOT EXISTS slug_source VARCHAR(50), +ADD COLUMN IF NOT EXISTS slug_verified_at TIMESTAMPTZ, +ADD COLUMN IF NOT EXISTS slug_status VARCHAR(20) DEFAULT 'unverified', +ADD COLUMN IF NOT EXISTS menu_url_source VARCHAR(50), +ADD COLUMN IF NOT EXISTS menu_url_verified_at TIMESTAMPTZ, +ADD COLUMN IF NOT EXISTS platform_id_source VARCHAR(50), +ADD COLUMN IF NOT EXISTS platform_id_verified_at TIMESTAMPTZ, +ADD COLUMN IF NOT EXISTS country VARCHAR(2) DEFAULT 'US'; + +-- Add index for finding unverified stores +CREATE INDEX IF NOT EXISTS idx_dispensaries_slug_status +ON dispensaries(slug_status) +WHERE slug_status != 'verified'; + +-- Add index for country +CREATE INDEX IF NOT EXISTS idx_dispensaries_country +ON dispensaries(country); + +-- Comment on columns +COMMENT ON COLUMN dispensaries.slug_source IS 'Source of slug data: dutchie_api, manual, azdhs, discovery, etc.'; +COMMENT ON COLUMN dispensaries.slug_verified_at IS 'When the slug was last verified against the source'; +COMMENT ON COLUMN dispensaries.slug_status IS 'Status: unverified, verified, invalid, changed'; +COMMENT ON COLUMN dispensaries.menu_url_source IS 'Source of menu_url: dutchie_api, website_scrape, manual, etc.'; +COMMENT ON COLUMN dispensaries.menu_url_verified_at IS 'When the menu_url was last verified'; +COMMENT ON COLUMN dispensaries.platform_id_source IS 'Source of platform_dispensary_id: dutchie_api, graphql_resolution, etc.'; +COMMENT ON COLUMN dispensaries.platform_id_verified_at IS 'When the platform_dispensary_id was last verified'; +COMMENT ON COLUMN dispensaries.country IS 'ISO 2-letter country code: US, CA, etc.'; + +-- Update Green Pharms Mesa with verified Dutchie data +UPDATE dispensaries +SET + slug = 'green-pharms-mesa', + menu_url = 'https://dutchie.com/embedded-menu/green-pharms-mesa', + menu_type = 'dutchie', + platform_dispensary_id = '68dc47a2af90f2e653f8df30', + slug_source = 'dutchie_api', + slug_verified_at = NOW(), + slug_status = 'verified', + menu_url_source = 'dutchie_api', + menu_url_verified_at = NOW(), + platform_id_source = 'dutchie_api', + platform_id_verified_at = NOW(), + updated_at = NOW() +WHERE id = 232; + +-- Mark all other AZ dispensaries as needing verification +UPDATE dispensaries +SET slug_status = 'unverified' +WHERE state = 'AZ' + AND id != 232 + AND (slug_status IS NULL OR slug_status = 'unverified'); diff --git a/backend/src/auth/middleware.ts b/backend/src/auth/middleware.ts index 5c4b3166..4df7ff64 100755 --- a/backend/src/auth/middleware.ts +++ b/backend/src/auth/middleware.ts @@ -1,3 +1,14 @@ +/** + * CannaiQ Authentication Middleware + * + * AUTH METHODS (in order of priority): + * 1. IP-based: Localhost/trusted IPs get 'internal' role (full access, no token needed) + * 2. Token-based: Bearer token (JWT or API token) + * + * NO username/password auth in API. Use tokens only. + * + * Localhost bypass: curl from 127.0.0.1 gets automatic admin access. + */ import { Request, Response, NextFunction } from 'express'; import jwt from 'jsonwebtoken'; import bcrypt from 'bcrypt'; @@ -5,6 +16,61 @@ import { pool } from '../db/pool'; const JWT_SECRET = process.env.JWT_SECRET || 'change_this_in_production'; +// Trusted origins that bypass auth for internal/same-origin requests +const TRUSTED_ORIGINS = [ + 'https://cannaiq.co', + 'https://www.cannaiq.co', + 'https://findadispo.com', + 'https://www.findadispo.com', + 'https://findagram.co', + 'https://www.findagram.co', + 'http://localhost:3010', + 'http://localhost:8080', + 'http://localhost:5173', +]; + +// Trusted IPs for internal pod-to-pod communication +const TRUSTED_IPS = [ + '127.0.0.1', + '::1', + '::ffff:127.0.0.1', +]; + +/** + * Check if request is from a trusted origin/IP + */ +function isTrustedRequest(req: Request): boolean { + // Check origin header + const origin = req.headers.origin; + if (origin && TRUSTED_ORIGINS.includes(origin)) { + return true; + } + + // Check referer header (for same-origin requests without CORS) + const referer = req.headers.referer; + if (referer) { + for (const trusted of TRUSTED_ORIGINS) { + if (referer.startsWith(trusted)) { + return true; + } + } + } + + // Check IP for internal requests (pod-to-pod, localhost) + const clientIp = req.ip || req.socket.remoteAddress || ''; + if (TRUSTED_IPS.includes(clientIp)) { + return true; + } + + // Check for Kubernetes internal header (set by ingress/service mesh) + const internalHeader = req.headers['x-internal-request']; + if (internalHeader === process.env.INTERNAL_REQUEST_SECRET) { + return true; + } + + return false; +} + export interface AuthUser { id: number; email: string; @@ -61,6 +127,16 @@ export async function authenticateUser(email: string, password: string): Promise } export async function authMiddleware(req: AuthRequest, res: Response, next: NextFunction) { + // Allow trusted origins/IPs to bypass auth (internal services, same-origin) + if (isTrustedRequest(req)) { + req.user = { + id: 0, + email: 'internal@system', + role: 'internal' + }; + return next(); + } + const authHeader = req.headers.authorization; if (!authHeader || !authHeader.startsWith('Bearer ')) { @@ -135,12 +211,23 @@ export async function authMiddleware(req: AuthRequest, res: Response, next: Next } } +/** + * Require specific role(s) to access endpoint. + * + * NOTE: 'internal' role (localhost/trusted IPs) bypasses all role checks. + * This allows local development and internal services full access. + */ export function requireRole(...roles: string[]) { return (req: AuthRequest, res: Response, next: NextFunction) => { if (!req.user) { return res.status(401).json({ error: 'Not authenticated' }); } + // Internal role (localhost) bypasses role checks + if (req.user.role === 'internal') { + return next(); + } + if (!roles.includes(req.user.role)) { return res.status(403).json({ error: 'Insufficient permissions' }); } diff --git a/backend/src/canonical-hydration/hydration-service.ts b/backend/src/canonical-hydration/hydration-service.ts index 9921818f..ecd6b445 100644 --- a/backend/src/canonical-hydration/hydration-service.ts +++ b/backend/src/canonical-hydration/hydration-service.ts @@ -472,7 +472,8 @@ export class CanonicalHydrationService { } // Step 3: Create initial snapshots from current product state - const snapshotsWritten = await this.createInitialSnapshots(dispensaryId, crawlRunId); + // crawlRunId is guaranteed to be set at this point (either from existing run or insert) + const snapshotsWritten = await this.createInitialSnapshots(dispensaryId, crawlRunId!); result.snapshotsWritten += snapshotsWritten; // Update crawl run with snapshot count diff --git a/backend/src/cli.ts b/backend/src/cli.ts index b6e77ffe..2b19ed6c 100644 --- a/backend/src/cli.ts +++ b/backend/src/cli.ts @@ -50,15 +50,9 @@ async function main() { showHelp(); } - if (args.includes('--worker')) { - console.log('[CLI] Starting worker process...'); - const { startWorker } = await import('./dutchie-az/services/worker'); - await startWorker(); - } else { - // Default: start API server - console.log('[CLI] Starting API server...'); - await import('./index'); - } + // Default: start API server + console.log('[CLI] Starting API server...'); + await import('./index'); } main().catch((error) => { diff --git a/backend/src/crawlers/base/base-dutchie.ts b/backend/src/crawlers/base/base-dutchie.ts deleted file mode 100644 index f612fae7..00000000 --- a/backend/src/crawlers/base/base-dutchie.ts +++ /dev/null @@ -1,657 +0,0 @@ -/** - * Base Dutchie Crawler Template - * - * This is the base template for all Dutchie store crawlers. - * Per-store crawlers extend this by overriding specific methods. - * - * Exports: - * - crawlProducts(dispensary, options) - Main crawl entry point - * - detectStructure(page) - Detect page structure for sandbox mode - * - extractProducts(document) - Extract product data - * - extractImages(document) - Extract product images - * - extractStock(document) - Extract stock status - * - extractPagination(document) - Extract pagination info - */ - -import { - crawlDispensaryProducts as baseCrawlDispensaryProducts, - CrawlResult, -} from '../../dutchie-az/services/product-crawler'; -import { Dispensary, CrawlerProfileOptions } from '../../dutchie-az/types'; - -// Re-export CrawlResult for convenience -export { CrawlResult }; - -// ============================================================ -// TYPES -// ============================================================ - -/** - * Options passed to the per-store crawler - */ -export interface StoreCrawlOptions { - pricingType?: 'rec' | 'med'; - useBothModes?: boolean; - downloadImages?: boolean; - trackStock?: boolean; - timeoutMs?: number; - config?: Record; -} - -/** - * Progress callback for reporting crawl progress - */ -export interface CrawlProgressCallback { - phase: 'fetching' | 'processing' | 'saving' | 'images' | 'complete'; - current: number; - total: number; - message?: string; -} - -/** - * Structure detection result for sandbox mode - */ -export interface StructureDetectionResult { - success: boolean; - menuType: 'dutchie' | 'treez' | 'jane' | 'unknown'; - iframeUrl?: string; - graphqlEndpoint?: string; - dispensaryId?: string; - selectors: { - productContainer?: string; - productName?: string; - productPrice?: string; - productImage?: string; - productCategory?: string; - pagination?: string; - loadMore?: string; - }; - pagination: { - type: 'scroll' | 'click' | 'graphql' | 'none'; - hasMore?: boolean; - pageSize?: number; - }; - errors: string[]; - metadata: Record; -} - -/** - * Product extraction result - */ -export interface ExtractedProduct { - externalId: string; - name: string; - brand?: string; - category?: string; - subcategory?: string; - price?: number; - priceRec?: number; - priceMed?: number; - weight?: string; - thcContent?: string; - cbdContent?: string; - description?: string; - imageUrl?: string; - stockStatus?: 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown'; - quantity?: number; - raw?: Record; -} - -/** - * Image extraction result - */ -export interface ExtractedImage { - productId: string; - imageUrl: string; - isPrimary: boolean; - position: number; -} - -/** - * Stock extraction result - */ -export interface ExtractedStock { - productId: string; - status: 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown'; - quantity?: number; - lastChecked: Date; -} - -/** - * Pagination extraction result - */ -export interface ExtractedPagination { - hasNextPage: boolean; - currentPage?: number; - totalPages?: number; - totalProducts?: number; - nextCursor?: string; - loadMoreSelector?: string; -} - -/** - * Hook points that per-store crawlers can override - */ -export interface DutchieCrawlerHooks { - /** - * Called before fetching products - * Can be used to set up custom headers, cookies, etc. - */ - beforeFetch?: (dispensary: Dispensary) => Promise; - - /** - * Called after fetching products, before processing - * Can be used to filter or transform raw products - */ - afterFetch?: (products: any[], dispensary: Dispensary) => Promise; - - /** - * Called after all processing is complete - * Can be used for cleanup or post-processing - */ - afterComplete?: (result: CrawlResult, dispensary: Dispensary) => Promise; - - /** - * Custom selector resolver for iframe detection - */ - resolveIframe?: (page: any) => Promise; - - /** - * Custom product container selector - */ - getProductContainerSelector?: () => string; - - /** - * Custom product extraction from container element - */ - extractProductFromElement?: (element: any) => Promise; -} - -/** - * Selectors configuration for per-store overrides - */ -export interface DutchieSelectors { - iframe?: string; - productContainer?: string; - productName?: string; - productPrice?: string; - productPriceRec?: string; - productPriceMed?: string; - productImage?: string; - productCategory?: string; - productBrand?: string; - productWeight?: string; - productThc?: string; - productCbd?: string; - productDescription?: string; - productStock?: string; - loadMore?: string; - pagination?: string; -} - -// ============================================================ -// DEFAULT SELECTORS -// ============================================================ - -export const DEFAULT_DUTCHIE_SELECTORS: DutchieSelectors = { - iframe: 'iframe[src*="dutchie.com"]', - productContainer: '[data-testid="product-card"], .product-card, [class*="ProductCard"]', - productName: '[data-testid="product-title"], .product-title, [class*="ProductTitle"]', - productPrice: '[data-testid="product-price"], .product-price, [class*="ProductPrice"]', - productImage: 'img[src*="dutchie"], img[src*="product"], .product-image img', - productCategory: '[data-testid="category-name"], .category-name', - productBrand: '[data-testid="brand-name"], .brand-name, [class*="BrandName"]', - loadMore: 'button[data-testid="load-more"], .load-more-button', - pagination: '.pagination, [class*="Pagination"]', -}; - -// ============================================================ -// BASE CRAWLER CLASS -// ============================================================ - -/** - * BaseDutchieCrawler - Base class for all Dutchie store crawlers - * - * Per-store crawlers extend this class and override methods as needed. - * The default implementation delegates to the existing shared Dutchie logic. - */ -export class BaseDutchieCrawler { - protected dispensary: Dispensary; - protected options: StoreCrawlOptions; - protected hooks: DutchieCrawlerHooks; - protected selectors: DutchieSelectors; - - constructor( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - hooks: DutchieCrawlerHooks = {}, - selectors: DutchieSelectors = {} - ) { - this.dispensary = dispensary; - this.options = { - pricingType: 'rec', - useBothModes: true, - downloadImages: true, - trackStock: true, - timeoutMs: 30000, - ...options, - }; - this.hooks = hooks; - this.selectors = { ...DEFAULT_DUTCHIE_SELECTORS, ...selectors }; - } - - /** - * Main entry point - crawl products for this dispensary - * Override this in per-store crawlers to customize behavior - */ - async crawlProducts(): Promise { - // Call beforeFetch hook if defined - if (this.hooks.beforeFetch) { - await this.hooks.beforeFetch(this.dispensary); - } - - // Use the existing shared Dutchie crawl logic - const result = await baseCrawlDispensaryProducts( - this.dispensary, - this.options.pricingType || 'rec', - { - useBothModes: this.options.useBothModes, - downloadImages: this.options.downloadImages, - } - ); - - // Call afterComplete hook if defined - if (this.hooks.afterComplete) { - await this.hooks.afterComplete(result, this.dispensary); - } - - return result; - } - - /** - * Detect page structure for sandbox discovery mode - * Override in per-store crawlers if needed - * - * @param page - Puppeteer page object or HTML string - * @returns Structure detection result - */ - async detectStructure(page: any): Promise { - const result: StructureDetectionResult = { - success: false, - menuType: 'unknown', - selectors: {}, - pagination: { type: 'none' }, - errors: [], - metadata: {}, - }; - - try { - // Default implementation: check for Dutchie iframe - if (typeof page === 'string') { - // HTML string mode - if (page.includes('dutchie.com')) { - result.menuType = 'dutchie'; - result.success = true; - } - } else if (page && typeof page.evaluate === 'function') { - // Puppeteer page mode - const detection = await page.evaluate((selectorConfig: DutchieSelectors) => { - const iframe = document.querySelector(selectorConfig.iframe || '') as HTMLIFrameElement; - const iframeUrl = iframe?.src || null; - - // Check for product containers - const containers = document.querySelectorAll(selectorConfig.productContainer || ''); - - return { - hasIframe: !!iframe, - iframeUrl, - productCount: containers.length, - isDutchie: !!iframeUrl?.includes('dutchie.com'), - }; - }, this.selectors); - - if (detection.isDutchie) { - result.menuType = 'dutchie'; - result.iframeUrl = detection.iframeUrl; - result.success = true; - } - - result.metadata = detection; - } - - // Set default selectors for Dutchie - if (result.menuType === 'dutchie') { - result.selectors = { - productContainer: this.selectors.productContainer, - productName: this.selectors.productName, - productPrice: this.selectors.productPrice, - productImage: this.selectors.productImage, - productCategory: this.selectors.productCategory, - }; - result.pagination = { type: 'graphql' }; - } - } catch (error: any) { - result.errors.push(`Detection error: ${error.message}`); - } - - return result; - } - - /** - * Extract products from page/document - * Override in per-store crawlers for custom extraction - * - * @param document - DOM document, Puppeteer page, or raw products array - * @returns Array of extracted products - */ - async extractProducts(document: any): Promise { - // Default implementation: assume document is already an array of products - // from the GraphQL response - if (Array.isArray(document)) { - return document.map((product) => this.mapRawProduct(product)); - } - - // If document is a Puppeteer page, extract from DOM - if (document && typeof document.evaluate === 'function') { - return this.extractProductsFromPage(document); - } - - return []; - } - - /** - * Extract products from Puppeteer page - * Override for custom DOM extraction - */ - protected async extractProductsFromPage(page: any): Promise { - const products = await page.evaluate((selectors: DutchieSelectors) => { - const containers = document.querySelectorAll(selectors.productContainer || ''); - return Array.from(containers).map((container) => { - const nameEl = container.querySelector(selectors.productName || ''); - const priceEl = container.querySelector(selectors.productPrice || ''); - const imageEl = container.querySelector(selectors.productImage || '') as HTMLImageElement; - const brandEl = container.querySelector(selectors.productBrand || ''); - - return { - name: nameEl?.textContent?.trim() || '', - price: priceEl?.textContent?.trim() || '', - imageUrl: imageEl?.src || '', - brand: brandEl?.textContent?.trim() || '', - }; - }); - }, this.selectors); - - return products.map((p: any, i: number) => ({ - externalId: `dom-product-${i}`, - name: p.name, - brand: p.brand, - price: this.parsePrice(p.price), - imageUrl: p.imageUrl, - stockStatus: 'unknown' as const, - })); - } - - /** - * Map raw product from GraphQL to ExtractedProduct - * Override for custom mapping - */ - protected mapRawProduct(raw: any): ExtractedProduct { - return { - externalId: raw.id || raw._id || raw.externalId, - name: raw.name || raw.Name, - brand: raw.brand?.name || raw.brandName || raw.brand, - category: raw.type || raw.category || raw.Category, - subcategory: raw.subcategory || raw.Subcategory, - price: raw.recPrice || raw.price || raw.Price, - priceRec: raw.recPrice || raw.Prices?.rec, - priceMed: raw.medPrice || raw.Prices?.med, - weight: raw.weight || raw.Weight, - thcContent: raw.potencyThc?.formatted || raw.THCContent?.formatted, - cbdContent: raw.potencyCbd?.formatted || raw.CBDContent?.formatted, - description: raw.description || raw.Description, - imageUrl: raw.image || raw.Image, - stockStatus: this.mapStockStatus(raw), - quantity: raw.quantity || raw.Quantity, - raw, - }; - } - - /** - * Map raw stock status to standardized value - */ - protected mapStockStatus(raw: any): 'in_stock' | 'out_of_stock' | 'low_stock' | 'unknown' { - const status = raw.Status || raw.status || raw.stockStatus; - if (status === 'Active' || status === 'active' || status === 'in_stock') { - return 'in_stock'; - } - if (status === 'Inactive' || status === 'inactive' || status === 'out_of_stock') { - return 'out_of_stock'; - } - if (status === 'low_stock') { - return 'low_stock'; - } - return 'unknown'; - } - - /** - * Parse price string to number - */ - protected parsePrice(priceStr: string): number | undefined { - if (!priceStr) return undefined; - const cleaned = priceStr.replace(/[^0-9.]/g, ''); - const num = parseFloat(cleaned); - return isNaN(num) ? undefined : num; - } - - /** - * Extract images from document - * Override for custom image extraction - * - * @param document - DOM document, Puppeteer page, or products array - * @returns Array of extracted images - */ - async extractImages(document: any): Promise { - if (Array.isArray(document)) { - return document - .filter((p) => p.image || p.Image || p.imageUrl) - .map((p, i) => ({ - productId: p.id || p._id || `product-${i}`, - imageUrl: p.image || p.Image || p.imageUrl, - isPrimary: true, - position: 0, - })); - } - - // Puppeteer page extraction - if (document && typeof document.evaluate === 'function') { - return this.extractImagesFromPage(document); - } - - return []; - } - - /** - * Extract images from Puppeteer page - */ - protected async extractImagesFromPage(page: any): Promise { - const images = await page.evaluate((selector: string) => { - const imgs = document.querySelectorAll(selector); - return Array.from(imgs).map((img, i) => ({ - src: (img as HTMLImageElement).src, - position: i, - })); - }, this.selectors.productImage || 'img'); - - return images.map((img: any, i: number) => ({ - productId: `dom-product-${i}`, - imageUrl: img.src, - isPrimary: i === 0, - position: img.position, - })); - } - - /** - * Extract stock information from document - * Override for custom stock extraction - * - * @param document - DOM document, Puppeteer page, or products array - * @returns Array of extracted stock statuses - */ - async extractStock(document: any): Promise { - if (Array.isArray(document)) { - return document.map((p) => ({ - productId: p.id || p._id || p.externalId, - status: this.mapStockStatus(p), - quantity: p.quantity || p.Quantity, - lastChecked: new Date(), - })); - } - - return []; - } - - /** - * Extract pagination information from document - * Override for custom pagination handling - * - * @param document - DOM document, Puppeteer page, or GraphQL response - * @returns Pagination info - */ - async extractPagination(document: any): Promise { - // Default: check for page info in GraphQL response - if (document && document.pageInfo) { - return { - hasNextPage: document.pageInfo.hasNextPage || false, - currentPage: document.pageInfo.currentPage, - totalPages: document.pageInfo.totalPages, - totalProducts: document.pageInfo.totalCount || document.totalCount, - nextCursor: document.pageInfo.endCursor, - }; - } - - // Default: no pagination - return { - hasNextPage: false, - }; - } - - /** - * Get the cName (Dutchie slug) for this dispensary - * Override to customize cName extraction - */ - getCName(): string { - if (this.dispensary.menuUrl) { - try { - const url = new URL(this.dispensary.menuUrl); - const segments = url.pathname.split('/').filter(Boolean); - if (segments.length >= 2) { - return segments[segments.length - 1]; - } - } catch { - // Fall through to default - } - } - return this.dispensary.slug || ''; - } - - /** - * Get custom headers for API requests - * Override for store-specific headers - */ - getCustomHeaders(): Record { - const cName = this.getCName(); - return { - 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', - Origin: 'https://dutchie.com', - Referer: `https://dutchie.com/embedded-menu/${cName}`, - }; - } -} - -// ============================================================ -// FACTORY FUNCTION -// ============================================================ - -/** - * Create a base Dutchie crawler instance - * This is the default export used when no per-store override exists - */ -export function createCrawler( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - hooks: DutchieCrawlerHooks = {}, - selectors: DutchieSelectors = {} -): BaseDutchieCrawler { - return new BaseDutchieCrawler(dispensary, options, hooks, selectors); -} - -// ============================================================ -// STANDALONE FUNCTIONS (required exports for orchestrator) -// ============================================================ - -/** - * Crawl products using the base Dutchie logic - * Per-store files can call this or override it completely - */ -export async function crawlProducts( - dispensary: Dispensary, - options: StoreCrawlOptions = {} -): Promise { - const crawler = createCrawler(dispensary, options); - return crawler.crawlProducts(); -} - -/** - * Detect structure using the base Dutchie logic - */ -export async function detectStructure( - page: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.detectStructure(page); -} - -/** - * Extract products using the base Dutchie logic - */ -export async function extractProducts( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractProducts(document); -} - -/** - * Extract images using the base Dutchie logic - */ -export async function extractImages( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractImages(document); -} - -/** - * Extract stock using the base Dutchie logic - */ -export async function extractStock( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractStock(document); -} - -/** - * Extract pagination using the base Dutchie logic - */ -export async function extractPagination( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractPagination(document); -} diff --git a/backend/src/crawlers/base/base-jane.ts b/backend/src/crawlers/base/base-jane.ts deleted file mode 100644 index ffc46d8b..00000000 --- a/backend/src/crawlers/base/base-jane.ts +++ /dev/null @@ -1,330 +0,0 @@ -/** - * Base Jane Crawler Template (PLACEHOLDER) - * - * This is the base template for all Jane (iheartjane) store crawlers. - * Per-store crawlers extend this by overriding specific methods. - * - * TODO: Implement Jane-specific crawling logic (Algolia-based) - */ - -import { Dispensary } from '../../dutchie-az/types'; -import { - StoreCrawlOptions, - CrawlResult, - StructureDetectionResult, - ExtractedProduct, - ExtractedImage, - ExtractedStock, - ExtractedPagination, -} from './base-dutchie'; - -// Re-export types -export { - StoreCrawlOptions, - CrawlResult, - StructureDetectionResult, - ExtractedProduct, - ExtractedImage, - ExtractedStock, - ExtractedPagination, -}; - -// ============================================================ -// JANE-SPECIFIC TYPES -// ============================================================ - -export interface JaneConfig { - algoliaAppId?: string; - algoliaApiKey?: string; - algoliaIndex?: string; - storeId?: string; -} - -export interface JaneSelectors { - productContainer?: string; - productName?: string; - productPrice?: string; - productImage?: string; - productCategory?: string; - productBrand?: string; - pagination?: string; - loadMore?: string; -} - -export const DEFAULT_JANE_SELECTORS: JaneSelectors = { - productContainer: '[data-testid="product-card"], .product-card', - productName: '[data-testid="product-name"], .product-name', - productPrice: '[data-testid="product-price"], .product-price', - productImage: '.product-image img, [data-testid="product-image"] img', - productCategory: '.product-category', - productBrand: '.product-brand, [data-testid="brand-name"]', - loadMore: '[data-testid="load-more"], .load-more-btn', -}; - -// ============================================================ -// BASE JANE CRAWLER CLASS -// ============================================================ - -export class BaseJaneCrawler { - protected dispensary: Dispensary; - protected options: StoreCrawlOptions; - protected selectors: JaneSelectors; - protected janeConfig: JaneConfig; - - constructor( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - selectors: JaneSelectors = {}, - janeConfig: JaneConfig = {} - ) { - this.dispensary = dispensary; - this.options = { - pricingType: 'rec', - useBothModes: false, - downloadImages: true, - trackStock: true, - timeoutMs: 30000, - ...options, - }; - this.selectors = { ...DEFAULT_JANE_SELECTORS, ...selectors }; - this.janeConfig = janeConfig; - } - - /** - * Main entry point - crawl products for this dispensary - * TODO: Implement Jane/Algolia-specific crawling - */ - async crawlProducts(): Promise { - const startTime = Date.now(); - console.warn(`[BaseJaneCrawler] Jane crawling not yet implemented for ${this.dispensary.name}`); - return { - success: false, - dispensaryId: this.dispensary.id || 0, - productsFound: 0, - productsFetched: 0, - productsUpserted: 0, - snapshotsCreated: 0, - imagesDownloaded: 0, - errorMessage: 'Jane crawler not yet implemented', - durationMs: Date.now() - startTime, - }; - } - - /** - * Detect page structure for sandbox discovery mode - * Jane uses Algolia, so we look for Algolia config - */ - async detectStructure(page: any): Promise { - const result: StructureDetectionResult = { - success: false, - menuType: 'unknown', - selectors: {}, - pagination: { type: 'none' }, - errors: [], - metadata: {}, - }; - - try { - if (page && typeof page.evaluate === 'function') { - // Look for Jane/Algolia indicators - const detection = await page.evaluate(() => { - // Check for iheartjane in page - const hasJane = document.documentElement.innerHTML.includes('iheartjane') || - document.documentElement.innerHTML.includes('jane-menu'); - - // Look for Algolia config - const scripts = Array.from(document.querySelectorAll('script')); - let algoliaConfig: any = null; - - for (const script of scripts) { - const content = script.textContent || ''; - if (content.includes('algolia') || content.includes('ALGOLIA')) { - // Try to extract config - const appIdMatch = content.match(/applicationId['":\s]+['"]([^'"]+)['"]/); - const apiKeyMatch = content.match(/apiKey['":\s]+['"]([^'"]+)['"]/); - if (appIdMatch && apiKeyMatch) { - algoliaConfig = { - appId: appIdMatch[1], - apiKey: apiKeyMatch[1], - }; - } - } - } - - return { - hasJane, - algoliaConfig, - }; - }); - - if (detection.hasJane) { - result.menuType = 'jane'; - result.success = true; - result.metadata = detection; - - if (detection.algoliaConfig) { - result.metadata.algoliaAppId = detection.algoliaConfig.appId; - result.metadata.algoliaApiKey = detection.algoliaConfig.apiKey; - } - } - } - } catch (error: any) { - result.errors.push(`Detection error: ${error.message}`); - } - - return result; - } - - /** - * Extract products from Algolia response or page - */ - async extractProducts(document: any): Promise { - // If document is Algolia hits array - if (Array.isArray(document)) { - return document.map((hit) => this.mapAlgoliaHit(hit)); - } - - console.warn('[BaseJaneCrawler] extractProducts not yet fully implemented'); - return []; - } - - /** - * Map Algolia hit to ExtractedProduct - */ - protected mapAlgoliaHit(hit: any): ExtractedProduct { - return { - externalId: hit.objectID || hit.id || hit.product_id, - name: hit.name || hit.product_name, - brand: hit.brand || hit.brand_name, - category: hit.category || hit.kind, - subcategory: hit.subcategory, - price: hit.price || hit.bucket_price, - priceRec: hit.prices?.rec || hit.price_rec, - priceMed: hit.prices?.med || hit.price_med, - weight: hit.weight || hit.amount, - thcContent: hit.percent_thc ? `${hit.percent_thc}%` : undefined, - cbdContent: hit.percent_cbd ? `${hit.percent_cbd}%` : undefined, - description: hit.description, - imageUrl: hit.image_url || hit.product_image_url, - stockStatus: hit.available ? 'in_stock' : 'out_of_stock', - quantity: hit.quantity_available, - raw: hit, - }; - } - - /** - * Extract images from document - */ - async extractImages(document: any): Promise { - if (Array.isArray(document)) { - return document - .filter((hit) => hit.image_url || hit.product_image_url) - .map((hit, i) => ({ - productId: hit.objectID || hit.id || `jane-product-${i}`, - imageUrl: hit.image_url || hit.product_image_url, - isPrimary: true, - position: 0, - })); - } - - return []; - } - - /** - * Extract stock information from document - */ - async extractStock(document: any): Promise { - if (Array.isArray(document)) { - return document.map((hit) => ({ - productId: hit.objectID || hit.id, - status: hit.available ? 'in_stock' as const : 'out_of_stock' as const, - quantity: hit.quantity_available, - lastChecked: new Date(), - })); - } - - return []; - } - - /** - * Extract pagination information - * Algolia uses cursor-based pagination - */ - async extractPagination(document: any): Promise { - if (document && typeof document === 'object' && !Array.isArray(document)) { - return { - hasNextPage: document.page < document.nbPages - 1, - currentPage: document.page, - totalPages: document.nbPages, - totalProducts: document.nbHits, - }; - } - - return { hasNextPage: false }; - } -} - -// ============================================================ -// FACTORY FUNCTION -// ============================================================ - -export function createCrawler( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - selectors: JaneSelectors = {}, - janeConfig: JaneConfig = {} -): BaseJaneCrawler { - return new BaseJaneCrawler(dispensary, options, selectors, janeConfig); -} - -// ============================================================ -// STANDALONE FUNCTIONS -// ============================================================ - -export async function crawlProducts( - dispensary: Dispensary, - options: StoreCrawlOptions = {} -): Promise { - const crawler = createCrawler(dispensary, options); - return crawler.crawlProducts(); -} - -export async function detectStructure( - page: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.detectStructure(page); -} - -export async function extractProducts( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractProducts(document); -} - -export async function extractImages( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractImages(document); -} - -export async function extractStock( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractStock(document); -} - -export async function extractPagination( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractPagination(document); -} diff --git a/backend/src/crawlers/base/base-treez.ts b/backend/src/crawlers/base/base-treez.ts deleted file mode 100644 index b930f903..00000000 --- a/backend/src/crawlers/base/base-treez.ts +++ /dev/null @@ -1,212 +0,0 @@ -/** - * Base Treez Crawler Template (PLACEHOLDER) - * - * This is the base template for all Treez store crawlers. - * Per-store crawlers extend this by overriding specific methods. - * - * TODO: Implement Treez-specific crawling logic - */ - -import { Dispensary } from '../../dutchie-az/types'; -import { - StoreCrawlOptions, - CrawlResult, - StructureDetectionResult, - ExtractedProduct, - ExtractedImage, - ExtractedStock, - ExtractedPagination, -} from './base-dutchie'; - -// Re-export types -export { - StoreCrawlOptions, - CrawlResult, - StructureDetectionResult, - ExtractedProduct, - ExtractedImage, - ExtractedStock, - ExtractedPagination, -}; - -// ============================================================ -// TREEZ-SPECIFIC TYPES -// ============================================================ - -export interface TreezSelectors { - productContainer?: string; - productName?: string; - productPrice?: string; - productImage?: string; - productCategory?: string; - productBrand?: string; - addToCart?: string; - pagination?: string; -} - -export const DEFAULT_TREEZ_SELECTORS: TreezSelectors = { - productContainer: '.product-tile, [class*="ProductCard"]', - productName: '.product-name, [class*="ProductName"]', - productPrice: '.product-price, [class*="ProductPrice"]', - productImage: '.product-image img', - productCategory: '.product-category', - productBrand: '.product-brand', - addToCart: '.add-to-cart-btn', - pagination: '.pagination', -}; - -// ============================================================ -// BASE TREEZ CRAWLER CLASS -// ============================================================ - -export class BaseTreezCrawler { - protected dispensary: Dispensary; - protected options: StoreCrawlOptions; - protected selectors: TreezSelectors; - - constructor( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - selectors: TreezSelectors = {} - ) { - this.dispensary = dispensary; - this.options = { - pricingType: 'rec', - useBothModes: false, - downloadImages: true, - trackStock: true, - timeoutMs: 30000, - ...options, - }; - this.selectors = { ...DEFAULT_TREEZ_SELECTORS, ...selectors }; - } - - /** - * Main entry point - crawl products for this dispensary - * TODO: Implement Treez-specific crawling - */ - async crawlProducts(): Promise { - const startTime = Date.now(); - console.warn(`[BaseTreezCrawler] Treez crawling not yet implemented for ${this.dispensary.name}`); - return { - success: false, - dispensaryId: this.dispensary.id || 0, - productsFound: 0, - productsFetched: 0, - productsUpserted: 0, - snapshotsCreated: 0, - imagesDownloaded: 0, - errorMessage: 'Treez crawler not yet implemented', - durationMs: Date.now() - startTime, - }; - } - - /** - * Detect page structure for sandbox discovery mode - */ - async detectStructure(page: any): Promise { - return { - success: false, - menuType: 'unknown', - selectors: {}, - pagination: { type: 'none' }, - errors: ['Treez structure detection not yet implemented'], - metadata: {}, - }; - } - - /** - * Extract products from page/document - */ - async extractProducts(document: any): Promise { - console.warn('[BaseTreezCrawler] extractProducts not yet implemented'); - return []; - } - - /** - * Extract images from document - */ - async extractImages(document: any): Promise { - console.warn('[BaseTreezCrawler] extractImages not yet implemented'); - return []; - } - - /** - * Extract stock information from document - */ - async extractStock(document: any): Promise { - console.warn('[BaseTreezCrawler] extractStock not yet implemented'); - return []; - } - - /** - * Extract pagination information from document - */ - async extractPagination(document: any): Promise { - return { hasNextPage: false }; - } -} - -// ============================================================ -// FACTORY FUNCTION -// ============================================================ - -export function createCrawler( - dispensary: Dispensary, - options: StoreCrawlOptions = {}, - selectors: TreezSelectors = {} -): BaseTreezCrawler { - return new BaseTreezCrawler(dispensary, options, selectors); -} - -// ============================================================ -// STANDALONE FUNCTIONS -// ============================================================ - -export async function crawlProducts( - dispensary: Dispensary, - options: StoreCrawlOptions = {} -): Promise { - const crawler = createCrawler(dispensary, options); - return crawler.crawlProducts(); -} - -export async function detectStructure( - page: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.detectStructure(page); -} - -export async function extractProducts( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractProducts(document); -} - -export async function extractImages( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractImages(document); -} - -export async function extractStock( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractStock(document); -} - -export async function extractPagination( - document: any, - dispensary?: Dispensary -): Promise { - const crawler = createCrawler(dispensary || ({} as Dispensary)); - return crawler.extractPagination(document); -} diff --git a/backend/src/crawlers/base/index.ts b/backend/src/crawlers/base/index.ts deleted file mode 100644 index 19142cfe..00000000 --- a/backend/src/crawlers/base/index.ts +++ /dev/null @@ -1,27 +0,0 @@ -/** - * Base Crawler Templates Index - * - * Exports all base crawler templates for easy importing. - */ - -// Dutchie base (primary implementation) -export * from './base-dutchie'; - -// Treez base (placeholder) -export * as Treez from './base-treez'; - -// Jane base (placeholder) -export * as Jane from './base-jane'; - -// Re-export common types from dutchie for convenience -export type { - StoreCrawlOptions, - CrawlResult, - StructureDetectionResult, - ExtractedProduct, - ExtractedImage, - ExtractedStock, - ExtractedPagination, - DutchieCrawlerHooks, - DutchieSelectors, -} from './base-dutchie'; diff --git a/backend/src/crawlers/dutchie/base-dutchie.ts b/backend/src/crawlers/dutchie/base-dutchie.ts deleted file mode 100644 index 01dd3323..00000000 --- a/backend/src/crawlers/dutchie/base-dutchie.ts +++ /dev/null @@ -1,9 +0,0 @@ -/** - * Base Dutchie Crawler Template (Re-export for backward compatibility) - * - * DEPRECATED: Import from '../base/base-dutchie' instead. - * This file re-exports everything from the new location for existing code. - */ - -// Re-export everything from the new base location -export * from '../base/base-dutchie'; diff --git a/backend/src/crawlers/dutchie/stores/trulieve-scottsdale.ts b/backend/src/crawlers/dutchie/stores/trulieve-scottsdale.ts deleted file mode 100644 index 142cc242..00000000 --- a/backend/src/crawlers/dutchie/stores/trulieve-scottsdale.ts +++ /dev/null @@ -1,118 +0,0 @@ -/** - * Trulieve Scottsdale - Per-Store Dutchie Crawler - * - * Store ID: 101 - * Profile Key: trulieve-scottsdale - * Platform Dispensary ID: 5eaf489fa8a61801212577cc - * - * Phase 1: Identity implementation - no overrides, just uses base Dutchie logic. - * Future: Add store-specific selectors, timing, or custom logic as needed. - */ - -import { - BaseDutchieCrawler, - StoreCrawlOptions, - CrawlResult, - DutchieSelectors, - crawlProducts as baseCrawlProducts, -} from '../../base/base-dutchie'; -import { Dispensary } from '../../../dutchie-az/types'; - -// Re-export CrawlResult for the orchestrator -export { CrawlResult }; - -// ============================================================ -// STORE CONFIGURATION -// ============================================================ - -/** - * Store-specific configuration - * These can be used to customize crawler behavior for this store - */ -export const STORE_CONFIG = { - storeId: 101, - profileKey: 'trulieve-scottsdale', - name: 'Trulieve of Scottsdale Dispensary', - platformDispensaryId: '5eaf489fa8a61801212577cc', - - // Store-specific overrides (none for Phase 1) - customOptions: { - // Example future overrides: - // pricingType: 'rec', - // useBothModes: true, - // customHeaders: {}, - // maxRetries: 3, - }, -}; - -// ============================================================ -// STORE CRAWLER CLASS -// ============================================================ - -/** - * TrulieveScottsdaleCrawler - Per-store crawler for Trulieve Scottsdale - * - * Phase 1: Identity implementation - extends BaseDutchieCrawler with no overrides. - * Future phases can override methods like: - * - getCName() for custom slug handling - * - crawlProducts() for completely custom logic - * - Add hooks for pre/post processing - */ -export class TrulieveScottsdaleCrawler extends BaseDutchieCrawler { - constructor(dispensary: Dispensary, options: StoreCrawlOptions = {}) { - // Merge store-specific options with provided options - const mergedOptions: StoreCrawlOptions = { - ...STORE_CONFIG.customOptions, - ...options, - }; - - super(dispensary, mergedOptions); - } - - // Phase 1: No overrides - use base implementation - // Future phases can add overrides here: - // - // async crawlProducts(): Promise { - // // Custom pre-processing - // // ... - // const result = await super.crawlProducts(); - // // Custom post-processing - // // ... - // return result; - // } -} - -// ============================================================ -// EXPORTED CRAWL FUNCTION -// ============================================================ - -/** - * Main entry point for the orchestrator - * - * The orchestrator calls: mod.crawlProducts(dispensary, options) - * This function creates a TrulieveScottsdaleCrawler and runs it. - */ -export async function crawlProducts( - dispensary: Dispensary, - options: StoreCrawlOptions = {} -): Promise { - console.log(`[TrulieveScottsdale] Using per-store crawler for ${dispensary.name}`); - - const crawler = new TrulieveScottsdaleCrawler(dispensary, options); - return crawler.crawlProducts(); -} - -// ============================================================ -// FACTORY FUNCTION (alternative API) -// ============================================================ - -/** - * Create a crawler instance without running it - * Useful for testing or when you need to configure before running - */ -export function createCrawler( - dispensary: Dispensary, - options: StoreCrawlOptions = {} -): TrulieveScottsdaleCrawler { - return new TrulieveScottsdaleCrawler(dispensary, options); -} diff --git a/backend/src/db/pool.ts b/backend/src/db/pool.ts index cdc64472..5df916c7 100644 --- a/backend/src/db/pool.ts +++ b/backend/src/db/pool.ts @@ -77,7 +77,9 @@ export function getPool(): Pool { * This is a getter that lazily initializes on first access. */ export const pool = { - query: (...args: Parameters) => getPool().query(...args), + query: (queryTextOrConfig: string | import('pg').QueryConfig, values?: any[]): Promise> => { + return getPool().query(queryTextOrConfig as any, values); + }, connect: () => getPool().connect(), end: () => getPool().end(), on: (event: 'error' | 'connect' | 'acquire' | 'remove' | 'release', listener: (...args: any[]) => void) => getPool().on(event as any, listener), diff --git a/backend/src/discovery/location-discovery.ts b/backend/src/discovery/location-discovery.ts index 1e927b4a..8b69a5f2 100644 --- a/backend/src/discovery/location-discovery.ts +++ b/backend/src/discovery/location-discovery.ts @@ -26,13 +26,377 @@ import { mapLocationRowToLocation, } from './types'; import { DiscoveryCity } from './types'; +import { + executeGraphQL, + fetchPage, + extractNextData, + GRAPHQL_HASHES, + setProxy, +} from '../platforms/dutchie/client'; +import { getStateProxy, getRandomProxy } from '../utils/proxyManager'; puppeteer.use(StealthPlugin()); +// ============================================================ +// PROXY INITIALIZATION +// ============================================================ +// Call initDiscoveryProxy() before any discovery operations to +// set up proxy if USE_PROXY=true environment variable is set. +// This is opt-in and does NOT break existing behavior. +// ============================================================ + +let proxyInitialized = false; + +/** + * Initialize proxy for discovery operations + * Only runs if USE_PROXY=true is set in environment + * Safe to call multiple times - only initializes once + * + * @param stateCode - Optional state code for state-specific proxy (e.g., 'AZ', 'CA') + * @returns true if proxy was set, false if skipped or failed + */ +export async function initDiscoveryProxy(stateCode?: string): Promise { + // Skip if already initialized + if (proxyInitialized) { + return true; + } + + // Skip if USE_PROXY is not enabled + if (process.env.USE_PROXY !== 'true') { + console.log('[LocationDiscovery] Proxy disabled (USE_PROXY != true)'); + return false; + } + + try { + // Get proxy - prefer state-specific if state code provided + const proxyConfig = stateCode + ? await getStateProxy(stateCode) + : await getRandomProxy(); + + if (!proxyConfig) { + console.warn('[LocationDiscovery] No proxy available, proceeding without proxy'); + return false; + } + + // Build proxy URL with auth if needed + let proxyUrl = proxyConfig.server; + if (proxyConfig.username && proxyConfig.password) { + const url = new URL(proxyConfig.server); + url.username = proxyConfig.username; + url.password = proxyConfig.password; + proxyUrl = url.toString(); + } + + // Set proxy on the Dutchie client + setProxy(proxyUrl); + proxyInitialized = true; + + console.log(`[LocationDiscovery] Proxy initialized for ${stateCode || 'general'} discovery`); + return true; + } catch (error: any) { + console.error(`[LocationDiscovery] Failed to initialize proxy: ${error.message}`); + return false; + } +} + +/** + * Reset proxy initialization flag (for testing or re-initialization) + */ +export function resetProxyInit(): void { + proxyInitialized = false; + setProxy(null); +} + const PLATFORM = 'dutchie'; // ============================================================ -// GRAPHQL / API FETCHING +// CITY-BASED DISCOVERY (CANONICAL SOURCE OF TRUTH) +// ============================================================ +// GraphQL with city+state filter is the SOURCE OF TRUTH for database data. +// +// Method: +// 1. Get city list from statesWithDispensaries (in __NEXT_DATA__) +// 2. Query stores per city using city + state GraphQL filter +// 3. This gives us complete, accurate dispensary data +// +// Geo-coordinate queries (nearLat/nearLng) are ONLY for showing search +// results to users (e.g., "stores within 20 miles of me"). +// They are NOT a source of truth for establishing database records. +// ============================================================ + +/** + * State with dispensary cities from Dutchie's statesWithDispensaries data + */ +export interface StateWithCities { + name: string; // State code (e.g., "CA", "AZ") + country: string; // Country code (e.g., "US") + cities: string[]; // Array of city names +} + +/** + * Fetch all states with their cities from Dutchie's __NEXT_DATA__ + * + * This fetches a city page and extracts the statesWithDispensaries data + * which contains all states and their cities where Dutchie has dispensaries. + */ +export async function fetchStatesWithDispensaries( + options: { verbose?: boolean } = {} +): Promise { + const { verbose = false } = options; + + // Initialize proxy if USE_PROXY=true + await initDiscoveryProxy(); + + console.log('[LocationDiscovery] Fetching statesWithDispensaries from Dutchie...'); + + // Fetch any city page to get the __NEXT_DATA__ with statesWithDispensaries + // Using a known city that's likely to exist + const result = await fetchPage('/dispensaries/az/phoenix', { maxRetries: 3 }); + + if (!result || result.status !== 200) { + console.error('[LocationDiscovery] Failed to fetch city page'); + return []; + } + + const nextData = extractNextData(result.html); + if (!nextData) { + console.error('[LocationDiscovery] No __NEXT_DATA__ found'); + return []; + } + + // Extract statesWithDispensaries from Apollo state + const apolloState = nextData.props?.pageProps?.initialApolloState; + if (!apolloState) { + console.error('[LocationDiscovery] No initialApolloState found'); + return []; + } + + // Find ROOT_QUERY.statesWithDispensaries + const rootQuery = apolloState['ROOT_QUERY']; + if (!rootQuery) { + console.error('[LocationDiscovery] No ROOT_QUERY found'); + return []; + } + + // The statesWithDispensaries is at ROOT_QUERY.statesWithDispensaries + const statesRefs = rootQuery.statesWithDispensaries; + if (!Array.isArray(statesRefs)) { + console.error('[LocationDiscovery] statesWithDispensaries not found or not an array'); + return []; + } + + // Resolve the references to actual state data + const states: StateWithCities[] = []; + for (const ref of statesRefs) { + // ref might be { __ref: "StateWithDispensaries:0" } or direct object + let stateData: any; + + if (ref && ref.__ref) { + stateData = apolloState[ref.__ref]; + } else { + stateData = ref; + } + + if (stateData && stateData.name) { + // Parse cities JSON array if it's a string + let cities = stateData.cities; + if (typeof cities === 'string') { + try { + cities = JSON.parse(cities); + } catch { + cities = []; + } + } + + states.push({ + name: stateData.name, + country: stateData.country || 'US', + cities: Array.isArray(cities) ? cities : [], + }); + } + } + + if (verbose) { + console.log(`[LocationDiscovery] Found ${states.length} states`); + for (const state of states) { + console.log(` ${state.name}: ${state.cities.length} cities`); + } + } + + console.log(`[LocationDiscovery] Loaded ${states.length} states with cities`); + return states; +} + +/** + * Get cities for a specific state + */ +export async function getCitiesForState( + stateCode: string, + options: { verbose?: boolean } = {} +): Promise { + const states = await fetchStatesWithDispensaries(options); + const state = states.find(s => s.name.toUpperCase() === stateCode.toUpperCase()); + + if (!state) { + console.warn(`[LocationDiscovery] No cities found for state: ${stateCode}`); + return []; + } + + console.log(`[LocationDiscovery] Found ${state.cities.length} cities for ${stateCode}`); + return state.cities; +} + +/** + * Fetch dispensaries for a specific city+state using GraphQL + * + * This is the CORRECT method for establishing database data: + * Uses city + state filter, NOT geo-coordinates. + */ +export async function fetchDispensariesByCityState( + city: string, + stateCode: string, + options: { verbose?: boolean; perPage?: number; maxPages?: number } = {} +): Promise { + const { verbose = false, perPage = 200, maxPages = 10 } = options; + + // Initialize proxy if USE_PROXY=true (state-specific proxy preferred) + await initDiscoveryProxy(stateCode); + + console.log(`[LocationDiscovery] Fetching dispensaries for ${city}, ${stateCode}...`); + + const allDispensaries: any[] = []; + let page = 0; + let hasMore = true; + + while (hasMore && page < maxPages) { + const variables = { + dispensaryFilter: { + activeOnly: true, + city: city, + state: stateCode, + }, + page, + perPage, + }; + + try { + const result = await executeGraphQL( + 'ConsumerDispensaries', + variables, + GRAPHQL_HASHES.ConsumerDispensaries, + { cName: `${city.toLowerCase().replace(/\s+/g, '-')}-${stateCode.toLowerCase()}`, maxRetries: 2, retryOn403: true } + ); + + const dispensaries = result?.data?.filteredDispensaries || []; + + if (verbose) { + console.log(`[LocationDiscovery] Page ${page}: ${dispensaries.length} dispensaries`); + } + + if (dispensaries.length === 0) { + hasMore = false; + } else { + // Filter to ensure we only get dispensaries in the correct state + const stateFiltered = dispensaries.filter((d: any) => + d.location?.state?.toUpperCase() === stateCode.toUpperCase() + ); + allDispensaries.push(...stateFiltered); + + if (dispensaries.length < perPage) { + hasMore = false; + } else { + page++; + } + } + } catch (error: any) { + console.error(`[LocationDiscovery] Error fetching page ${page}: ${error.message}`); + hasMore = false; + } + } + + // Dedupe by ID + const uniqueMap = new Map(); + for (const d of allDispensaries) { + const id = d.id || d._id; + if (id && !uniqueMap.has(id)) { + uniqueMap.set(id, d); + } + } + + const unique = Array.from(uniqueMap.values()); + console.log(`[LocationDiscovery] Found ${unique.length} unique dispensaries in ${city}, ${stateCode}`); + + return unique.map(d => normalizeLocationResponse(d)); +} + +/** + * Fetch ALL dispensaries for a state by querying each city + * + * This is the canonical method for establishing state data: + * 1. Get city list from statesWithDispensaries + * 2. Query each city using city+state filter + * 3. Dedupe and return all dispensaries + */ +export async function fetchAllDispensariesForState( + stateCode: string, + options: { verbose?: boolean; progressCallback?: (city: string, count: number, total: number) => void } = {} +): Promise<{ dispensaries: DutchieLocationResponse[]; citiesQueried: number; citiesWithResults: number }> { + const { verbose = false, progressCallback } = options; + + console.log(`[LocationDiscovery] Fetching all dispensaries for ${stateCode}...`); + + // Step 1: Get city list + const cities = await getCitiesForState(stateCode, { verbose }); + if (cities.length === 0) { + console.warn(`[LocationDiscovery] No cities found for ${stateCode}`); + return { dispensaries: [], citiesQueried: 0, citiesWithResults: 0 }; + } + + console.log(`[LocationDiscovery] Will query ${cities.length} cities for ${stateCode}`); + + // Step 2: Query each city + const allDispensaries = new Map(); + let citiesWithResults = 0; + + for (let i = 0; i < cities.length; i++) { + const city = cities[i]; + + if (progressCallback) { + progressCallback(city, i + 1, cities.length); + } + + try { + const dispensaries = await fetchDispensariesByCityState(city, stateCode, { verbose }); + + if (dispensaries.length > 0) { + citiesWithResults++; + for (const d of dispensaries) { + const id = d.id || d.slug; + if (id && !allDispensaries.has(id)) { + allDispensaries.set(id, d); + } + } + } + + // Small delay between cities to avoid rate limiting + await new Promise(r => setTimeout(r, 300)); + } catch (error: any) { + console.error(`[LocationDiscovery] Error querying ${city}: ${error.message}`); + } + } + + const result = Array.from(allDispensaries.values()); + console.log(`[LocationDiscovery] Total: ${result.length} unique dispensaries across ${citiesWithResults}/${cities.length} cities`); + + return { + dispensaries: result, + citiesQueried: cities.length, + citiesWithResults, + }; +} + +// ============================================================ +// GRAPHQL / API FETCHING (LEGACY - PUPPETEER-BASED) // ============================================================ interface SessionCredentials { @@ -91,57 +455,77 @@ async function closeSession(session: SessionCredentials): Promise { } /** - * Fetch locations for a city using Dutchie's internal search API. + * Fetch locations for a city. + * + * PRIMARY METHOD: Uses city+state GraphQL filter (source of truth) + * FALLBACK: Legacy Puppeteer-based methods for edge cases */ export async function fetchLocationsForCity( city: DiscoveryCity, options: { session?: SessionCredentials; verbose?: boolean; + useLegacyMethods?: boolean; } = {} ): Promise { - const { verbose = false } = options; - let session = options.session; - let shouldCloseSession = false; + const { verbose = false, useLegacyMethods = false } = options; - if (!session) { - session = await createSession(city.citySlug); - shouldCloseSession = true; - } + console.log(`[LocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`); - try { - console.log(`[LocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`); - - // Try multiple approaches to get location data - - // Approach 1: Extract from page __NEXT_DATA__ or similar - const locations = await extractLocationsFromPage(session.page, verbose); - if (locations.length > 0) { - console.log(`[LocationDiscovery] Found ${locations.length} locations from page data`); - return locations; - } - - // Approach 2: Try the geo-based GraphQL query - const geoLocations = await fetchLocationsViaGraphQL(session, city, verbose); - if (geoLocations.length > 0) { - console.log(`[LocationDiscovery] Found ${geoLocations.length} locations from GraphQL`); - return geoLocations; - } - - // Approach 3: Scrape visible location cards - const scrapedLocations = await scrapeLocationCards(session.page, verbose); - if (scrapedLocations.length > 0) { - console.log(`[LocationDiscovery] Found ${scrapedLocations.length} locations from scraping`); - return scrapedLocations; - } - - console.log(`[LocationDiscovery] No locations found for ${city.cityName}`); - return []; - } finally { - if (shouldCloseSession) { - await closeSession(session); + // PRIMARY METHOD: City+State GraphQL query (SOURCE OF TRUTH) + if (city.cityName && city.stateCode) { + try { + const locations = await fetchDispensariesByCityState(city.cityName, city.stateCode, { verbose }); + if (locations.length > 0) { + console.log(`[LocationDiscovery] Found ${locations.length} locations via GraphQL city+state`); + return locations; + } + } catch (error: any) { + console.warn(`[LocationDiscovery] GraphQL city+state failed: ${error.message}`); } } + + // FALLBACK: Legacy Puppeteer-based methods (only if explicitly enabled) + if (useLegacyMethods) { + let session = options.session; + let shouldCloseSession = false; + + if (!session) { + session = await createSession(city.citySlug); + shouldCloseSession = true; + } + + try { + // Legacy Approach 1: Extract from page __NEXT_DATA__ + const locations = await extractLocationsFromPage(session.page, verbose); + if (locations.length > 0) { + console.log(`[LocationDiscovery] Found ${locations.length} locations from page data (legacy)`); + return locations; + } + + // Legacy Approach 2: Try the geo-based GraphQL query + // NOTE: Geo queries are for SEARCH RESULTS only, not source of truth + const geoLocations = await fetchLocationsViaGraphQL(session, city, verbose); + if (geoLocations.length > 0) { + console.log(`[LocationDiscovery] Found ${geoLocations.length} locations from geo GraphQL (legacy)`); + return geoLocations; + } + + // Legacy Approach 3: Scrape visible location cards + const scrapedLocations = await scrapeLocationCards(session.page, verbose); + if (scrapedLocations.length > 0) { + console.log(`[LocationDiscovery] Found ${scrapedLocations.length} locations from scraping (legacy)`); + return scrapedLocations; + } + } finally { + if (shouldCloseSession) { + await closeSession(session); + } + } + } + + console.log(`[LocationDiscovery] No locations found for ${city.cityName}`); + return []; } /** @@ -202,33 +586,52 @@ async function extractLocationsFromPage( /** * Fetch locations via GraphQL geo-based query. + * + * Uses ConsumerDispensaries with geo filtering: + * - dispensaryFilter.nearLat/nearLng for center point + * - dispensaryFilter.distance for radius in miles + * - Response at data.filteredDispensaries */ async function fetchLocationsViaGraphQL( session: SessionCredentials, city: DiscoveryCity, verbose: boolean ): Promise { - // Use a known center point for the city or default to a central US location - const CITY_COORDS: Record = { - 'phoenix': { lat: 33.4484, lng: -112.074 }, - 'tucson': { lat: 32.2226, lng: -110.9747 }, - 'scottsdale': { lat: 33.4942, lng: -111.9261 }, - 'mesa': { lat: 33.4152, lng: -111.8315 }, - 'tempe': { lat: 33.4255, lng: -111.94 }, - 'flagstaff': { lat: 35.1983, lng: -111.6513 }, - // Add more as needed + // City center coordinates with appropriate radius + const CITY_COORDS: Record = { + 'phoenix': { lat: 33.4484, lng: -112.074, radius: 50 }, + 'tucson': { lat: 32.2226, lng: -110.9747, radius: 50 }, + 'scottsdale': { lat: 33.4942, lng: -111.9261, radius: 30 }, + 'mesa': { lat: 33.4152, lng: -111.8315, radius: 30 }, + 'tempe': { lat: 33.4255, lng: -111.94, radius: 30 }, + 'flagstaff': { lat: 35.1983, lng: -111.6513, radius: 50 }, }; - const coords = CITY_COORDS[city.citySlug] || { lat: 33.4484, lng: -112.074 }; + // State-wide coordinates for full coverage + const STATE_COORDS: Record = { + 'AZ': { lat: 33.4484, lng: -112.074, radius: 200 }, + 'CA': { lat: 36.7783, lng: -119.4179, radius: 400 }, + 'CO': { lat: 39.5501, lng: -105.7821, radius: 200 }, + 'FL': { lat: 27.6648, lng: -81.5158, radius: 400 }, + 'MI': { lat: 44.3148, lng: -85.6024, radius: 250 }, + 'NV': { lat: 36.1699, lng: -115.1398, radius: 200 }, + }; + // Try city-specific coords first, then state-wide, then default + const coords = CITY_COORDS[city.citySlug] + || (city.stateCode && STATE_COORDS[city.stateCode]) + || { lat: 33.4484, lng: -112.074, radius: 200 }; + + // Correct GraphQL variables for ConsumerDispensaries const variables = { - dispensariesFilter: { - latitude: coords.lat, - longitude: coords.lng, - distance: 50, // miles - state: city.stateCode, - city: city.cityName, + dispensaryFilter: { + activeOnly: true, + nearLat: coords.lat, + nearLng: coords.lng, + distance: coords.radius, }, + page: 0, + perPage: 200, }; const hash = '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b'; @@ -263,8 +666,19 @@ async function fetchLocationsViaGraphQL( return []; } - const dispensaries = response.data?.data?.consumerDispensaries || []; - return dispensaries.map((d: any) => normalizeLocationResponse(d)); + // Response is at data.filteredDispensaries + const dispensaries = response.data?.data?.filteredDispensaries || []; + + // Filter to specific state if needed (radius may include neighboring states) + const filtered = city.stateCode + ? dispensaries.filter((d: any) => d.location?.state === city.stateCode) + : dispensaries; + + if (verbose) { + console.log(`[LocationDiscovery] GraphQL returned ${dispensaries.length} total, ${filtered.length} in ${city.stateCode || 'all states'}`); + } + + return filtered.map((d: any) => normalizeLocationResponse(d)); } catch (error: any) { if (verbose) { console.log(`[LocationDiscovery] GraphQL error: ${error.message}`); @@ -373,13 +787,20 @@ function normalizeLocationResponse(raw: any): DutchieLocationResponse { /** * Upsert a location into dutchie_discovery_locations. + * REQUIRES a valid platform ID (MongoDB ObjectId) - will skip records without one. */ export async function upsertLocation( pool: Pool, location: DutchieLocationResponse, cityId: number | null -): Promise<{ id: number; isNew: boolean }> { - const platformLocationId = location.id || location.slug; +): Promise<{ id: number; isNew: boolean } | null> { + // REQUIRE actual platform ID - NO fallback to slug + const platformLocationId = location.id; + if (!platformLocationId) { + console.warn(`[LocationDiscovery] Skipping location without platform ID: ${location.name} (${location.slug})`); + return null; + } + const menuUrl = location.menuUrl || `https://dutchie.com/dispensary/${location.slug}`; const result = await pool.query( @@ -642,6 +1063,12 @@ export async function discoverLocationsForCity( const result = await upsertLocation(pool, location, city.id); + // Skip locations without valid platform ID + if (!result) { + errors.push(`Location ${location.slug}: No valid platform ID - skipped`); + continue; + } + if (result.isNew) { newCount++; } else { diff --git a/backend/src/dutchie-az/README_DUTCHIE_AZ.md b/backend/src/dutchie-az/README_DUTCHIE_AZ.md deleted file mode 100644 index ca9b79bf..00000000 --- a/backend/src/dutchie-az/README_DUTCHIE_AZ.md +++ /dev/null @@ -1,199 +0,0 @@ -# Dutchie AZ Pipeline - -## Overview - -The Dutchie AZ pipeline is the **only** authorized way to crawl Dutchie dispensary menus. It uses Dutchie's GraphQL API directly (no DOM scraping) and writes to an isolated database with a proper snapshot model. - -## Key Principles - -1. **GraphQL Only** - All Dutchie data is fetched via their FilteredProducts GraphQL API -2. **Isolated Database** - Data lives in `dutchie_az_*` tables, NOT the legacy `products` table -3. **Append-Only Snapshots** - Every crawl creates snapshots, never overwrites historical data -4. **Stock Status Tracking** - Derived from `POSMetaData.children` inventory data -5. **Missing Product Detection** - Products not in feed are marked with `isPresentInFeed=false` - -## Directory Structure - -``` -src/dutchie-az/ -ā”œā”€ā”€ db/ -│ ā”œā”€ā”€ connection.ts # Database connection pool -│ └── schema.ts # Table definitions and migrations -ā”œā”€ā”€ routes/ -│ └── index.ts # REST API endpoints -ā”œā”€ā”€ services/ -│ ā”œā”€ā”€ graphql-client.ts # Direct GraphQL fetch (Mode A + Mode B) -│ ā”œā”€ā”€ product-crawler.ts # Main crawler orchestration -│ └── scheduler.ts # Jittered scheduling with wandering intervals -└── types/ - └── index.ts # TypeScript interfaces -``` - -## Data Model - -### Tables - -- **dispensaries** - Arizona Dutchie stores with `platform_dispensary_id` -- **dutchie_products** - Canonical product identity (one row per product per store) -- **dutchie_product_snapshots** - Historical state per crawl (append-only) -- **job_schedules** - Scheduler configuration with jitter support -- **job_run_logs** - Execution history - -### Stock Status - -The `stock_status` field is derived from `POSMetaData.children`: - -```typescript -function deriveStockStatus(children?: POSChild[]): StockStatus { - if (!children || children.length === 0) return 'unknown'; - const totalAvailable = children.reduce((sum, c) => - sum + (c.quantityAvailable || 0), 0); - return totalAvailable > 0 ? 'in_stock' : 'out_of_stock'; -} -``` - -### Two-Mode Crawling - -Mode A (UI Parity): -- `Status: null` - Returns what the UI shows -- Best for "current inventory" snapshot - -Mode B (Max Coverage): -- `Status: 'Active'` - Returns all active products -- Catches items with `isBelowThreshold: true` - -Both modes are merged to get maximum product coverage. - -## API Endpoints - -All endpoints are mounted at `/api/dutchie-az/`: - -``` -GET /api/dutchie-az/dispensaries - List all dispensaries -GET /api/dutchie-az/dispensaries/:id - Get dispensary details -GET /api/dutchie-az/products - List products (with filters) -GET /api/dutchie-az/products/:id - Get product with snapshots -GET /api/dutchie-az/products/:id/snapshots - Get product snapshot history -POST /api/dutchie-az/crawl/:dispensaryId - Trigger manual crawl -GET /api/dutchie-az/schedule - Get scheduler status -POST /api/dutchie-az/schedule/run - Manually run scheduled jobs -GET /api/dutchie-az/stats - Dashboard statistics -``` - -## Scheduler - -The scheduler uses **jitter** to avoid detection patterns: - -```typescript -// Each job has independent "wandering" timing -interface JobSchedule { - base_interval_minutes: number; // e.g., 240 (4 hours) - jitter_minutes: number; // e.g., 30 (±30 min) - next_run_at: Date; // Calculated with jitter after each run -} -``` - -Jobs run when `next_run_at <= NOW()`. After completion, the next run is calculated: -``` -next_run_at = NOW() + base_interval + random(-jitter, +jitter) -``` - -This prevents crawls from clustering at predictable times. - -## Manual Testing - -### Run a single dispensary crawl: - -```bash -DATABASE_URL="..." npx tsx -e " -const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); -const { query } = require('./src/dutchie-az/db/connection'); - -async function test() { - const { rows } = await query('SELECT * FROM dispensaries LIMIT 1'); - if (!rows[0]) return console.log('No dispensaries found'); - - const result = await crawlDispensaryProducts(rows[0], 'rec', { useBothModes: true }); - console.log(JSON.stringify(result, null, 2)); -} -test(); -" -``` - -### Check stock status distribution: - -```sql -SELECT stock_status, COUNT(*) -FROM dutchie_products -GROUP BY stock_status; -``` - -### View recent snapshots: - -```sql -SELECT - p.name, - s.stock_status, - s.is_present_in_feed, - s.crawled_at -FROM dutchie_product_snapshots s -JOIN dutchie_products p ON p.id = s.dutchie_product_id -ORDER BY s.crawled_at DESC -LIMIT 20; -``` - -## Deprecated Code - -The following files are **DEPRECATED** and will throw errors if called: - -- `src/scrapers/dutchie-graphql.ts` - Wrote to legacy `products` table -- `src/scrapers/dutchie-graphql-direct.ts` - Wrote to legacy `products` table -- `src/scrapers/templates/dutchie.ts` - HTML/DOM scraper (unreliable) -- `src/scraper-v2/engine.ts` DutchieSpider - DOM-based extraction - -If `store-crawl-orchestrator.ts` detects `provider='dutchie'` with `mode='production'`, it now routes to this dutchie-az pipeline automatically. - -## Integration with Legacy System - -The `store-crawl-orchestrator.ts` bridges the legacy stores system with dutchie-az: - -1. When a store has `product_provider='dutchie'` and `product_crawler_mode='production'` -2. The orchestrator looks up the corresponding dispensary in `dutchie_az.dispensaries` -3. It calls `crawlDispensaryProducts()` from the dutchie-az pipeline -4. Results are logged but data stays in the dutchie_az tables - -To use the dutchie-az pipeline independently: -- Navigate to `/dutchie-az-schedule` in the UI -- Use the REST API endpoints directly -- Run the scheduler service - -## Environment Variables - -```bash -# Database connection for dutchie-az (same DB, separate tables) -DATABASE_URL=postgresql://user:pass@host:port/database -``` - -## Troubleshooting - -### "Dispensary not found in dutchie-az database" - -The dispensary must exist in `dutchie_az.dispensaries` before crawling. Either: -1. Run discovery to populate dispensaries -2. Manually insert the dispensary with `platform_dispensary_id` - -### GraphQL returns empty products - -1. Check `platform_dispensary_id` is correct (the internal Dutchie ID, not slug) -2. Verify the dispensary is online and has menu data -3. Try both `rec` and `med` pricing types - -### Snapshots show `stock_status='unknown'` - -The product likely has no `POSMetaData.children` array. This happens for: -- Products without inventory tracking -- Manually managed inventory - ---- - -Last updated: December 2025 diff --git a/backend/src/dutchie-az/config/dutchie.ts b/backend/src/dutchie-az/config/dutchie.ts deleted file mode 100644 index 9f409684..00000000 --- a/backend/src/dutchie-az/config/dutchie.ts +++ /dev/null @@ -1,129 +0,0 @@ -/** - * Dutchie Configuration - * - * Centralized configuration for Dutchie GraphQL API interaction. - * Update hashes here when Dutchie changes their persisted query system. - */ - -export const dutchieConfig = { - // ============================================================ - // GRAPHQL ENDPOINT - // ============================================================ - - /** GraphQL endpoint - must be the api-3 graphql endpoint (NOT api-gw.dutchie.com which no longer exists) */ - graphqlEndpoint: 'https://dutchie.com/api-3/graphql', - - // ============================================================ - // GRAPHQL PERSISTED QUERY HASHES - // ============================================================ - // - // These hashes identify specific GraphQL operations. - // If Dutchie changes their schema, you may need to capture - // new hashes from live browser traffic (Network tab → graphql requests). - - /** FilteredProducts - main product listing query */ - filteredProductsHash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0', - - /** GetAddressBasedDispensaryData - resolve slug to internal ID */ - getDispensaryDataHash: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', - - /** - * ConsumerDispensaries - geo-based discovery - * NOTE: This is a placeholder guess. If discovery fails, either: - * 1. Capture the real hash from live traffic - * 2. Rely on known AZDHS slugs instead (set useDiscovery: false) - */ - consumerDispensariesHash: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b', - - // ============================================================ - // BEHAVIOR FLAGS - // ============================================================ - - /** Enable geo-based discovery (false = use known AZDHS slugs only) */ - useDiscovery: true, - - /** Prefer GET requests (true) or POST (false). GET is default. */ - preferGet: true, - - /** - * Enable POST fallback when GET fails with 405 or blocked. - * If true, will retry failed GETs as POSTs. - */ - enablePostFallback: true, - - // ============================================================ - // PAGINATION & RETRY - // ============================================================ - - /** Products per page for pagination */ - perPage: 100, - - /** Maximum pages to fetch (safety limit) */ - maxPages: 200, - - /** Number of retries for failed page fetches */ - maxRetries: 1, - - /** Delay between pages in ms */ - pageDelayMs: 500, - - /** Delay between modes in ms */ - modeDelayMs: 2000, - - // ============================================================ - // HTTP HEADERS - // ============================================================ - - /** Default headers to mimic browser requests */ - defaultHeaders: { - 'accept': 'application/json, text/plain, */*', - 'accept-language': 'en-US,en;q=0.9', - 'apollographql-client-name': 'Marketplace (production)', - } as Record, - - /** User agent string */ - userAgent: - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', - - // ============================================================ - // BROWSER LAUNCH OPTIONS - // ============================================================ - - browserArgs: [ - '--no-sandbox', - '--disable-setuid-sandbox', - '--disable-dev-shm-usage', - '--disable-blink-features=AutomationControlled', - ], - - /** Navigation timeout in ms */ - navigationTimeout: 60000, - - /** Initial page load delay in ms */ - pageLoadDelay: 2000, -}; - -/** - * Get GraphQL hashes object for backward compatibility - */ -export const GRAPHQL_HASHES = { - FilteredProducts: dutchieConfig.filteredProductsHash, - GetAddressBasedDispensaryData: dutchieConfig.getDispensaryDataHash, - ConsumerDispensaries: dutchieConfig.consumerDispensariesHash, -}; - -/** - * Arizona geo centerpoints for discovery scans - */ -export const ARIZONA_CENTERPOINTS = [ - { name: 'Phoenix', lat: 33.4484, lng: -112.074 }, - { name: 'Tucson', lat: 32.2226, lng: -110.9747 }, - { name: 'Flagstaff', lat: 35.1983, lng: -111.6513 }, - { name: 'Mesa', lat: 33.4152, lng: -111.8315 }, - { name: 'Scottsdale', lat: 33.4942, lng: -111.9261 }, - { name: 'Tempe', lat: 33.4255, lng: -111.94 }, - { name: 'Yuma', lat: 32.6927, lng: -114.6277 }, - { name: 'Prescott', lat: 34.54, lng: -112.4685 }, - { name: 'Lake Havasu', lat: 34.4839, lng: -114.3224 }, - { name: 'Sierra Vista', lat: 31.5455, lng: -110.2773 }, -]; diff --git a/backend/src/dutchie-az/db/connection.ts b/backend/src/dutchie-az/db/connection.ts deleted file mode 100644 index 5f470889..00000000 --- a/backend/src/dutchie-az/db/connection.ts +++ /dev/null @@ -1,131 +0,0 @@ -/** - * CannaiQ Database Connection - * - * All database access for the CannaiQ platform goes through this module. - * - * SINGLE DATABASE ARCHITECTURE: - * - All services (auth, orchestrator, crawlers, admin) use this ONE database - * - States are modeled via states table + state_id on dispensaries (not separate DBs) - * - * CONFIGURATION (in priority order): - * 1. CANNAIQ_DB_URL - Full connection string (preferred) - * 2. Individual vars: CANNAIQ_DB_HOST, CANNAIQ_DB_PORT, CANNAIQ_DB_NAME, CANNAIQ_DB_USER, CANNAIQ_DB_PASS - * 3. DATABASE_URL - Legacy fallback for K8s compatibility - * - * IMPORTANT: - * - Do NOT create separate pools elsewhere - * - All services should import from this module - */ - -import { Pool, PoolClient } from 'pg'; - -/** - * Get the database connection string from environment variables. - * Supports multiple configuration methods with fallback for legacy compatibility. - */ -function getConnectionString(): string { - // Priority 1: Full CANNAIQ connection URL - if (process.env.CANNAIQ_DB_URL) { - return process.env.CANNAIQ_DB_URL; - } - - // Priority 2: Build from individual CANNAIQ env vars - const host = process.env.CANNAIQ_DB_HOST; - const port = process.env.CANNAIQ_DB_PORT; - const name = process.env.CANNAIQ_DB_NAME; - const user = process.env.CANNAIQ_DB_USER; - const pass = process.env.CANNAIQ_DB_PASS; - - if (host && port && name && user && pass) { - return `postgresql://${user}:${pass}@${host}:${port}/${name}`; - } - - // Priority 3: Fallback to DATABASE_URL for legacy/K8s compatibility - if (process.env.DATABASE_URL) { - return process.env.DATABASE_URL; - } - - // Report what's missing - const required = ['CANNAIQ_DB_HOST', 'CANNAIQ_DB_PORT', 'CANNAIQ_DB_NAME', 'CANNAIQ_DB_USER', 'CANNAIQ_DB_PASS']; - const missing = required.filter((key) => !process.env[key]); - - throw new Error( - `[CannaiQ DB] Missing database configuration.\n` + - `Set CANNAIQ_DB_URL, DATABASE_URL, or all of: ${missing.join(', ')}` - ); -} - -let pool: Pool | null = null; - -/** - * Get the CannaiQ database pool (singleton) - * - * This is the canonical pool for all CannaiQ services. - * Do NOT create separate pools elsewhere. - */ -export function getPool(): Pool { - if (!pool) { - pool = new Pool({ - connectionString: getConnectionString(), - max: 10, - idleTimeoutMillis: 30000, - connectionTimeoutMillis: 5000, - }); - - pool.on('error', (err) => { - console.error('[CannaiQ DB] Unexpected error on idle client:', err); - }); - - console.log('[CannaiQ DB] Pool initialized'); - } - return pool; -} - -/** - * @deprecated Use getPool() instead - */ -export function getDutchieAZPool(): Pool { - console.warn('[CannaiQ DB] getDutchieAZPool() is deprecated. Use getPool() instead.'); - return getPool(); -} - -/** - * Execute a query on the CannaiQ database - */ -export async function query(text: string, params?: any[]): Promise<{ rows: T[]; rowCount: number }> { - const p = getPool(); - const result = await p.query(text, params); - return { rows: result.rows as T[], rowCount: result.rowCount || 0 }; -} - -/** - * Get a client from the pool for transaction use - */ -export async function getClient(): Promise { - const p = getPool(); - return p.connect(); -} - -/** - * Close the pool connection - */ -export async function closePool(): Promise { - if (pool) { - await pool.end(); - pool = null; - console.log('[CannaiQ DB] Pool closed'); - } -} - -/** - * Check if the database is accessible - */ -export async function healthCheck(): Promise { - try { - const result = await query('SELECT 1 as ok'); - return result.rows.length > 0 && result.rows[0].ok === 1; - } catch (error) { - console.error('[CannaiQ DB] Health check failed:', error); - return false; - } -} diff --git a/backend/src/dutchie-az/db/dispensary-columns.ts b/backend/src/dutchie-az/db/dispensary-columns.ts deleted file mode 100644 index e6caada1..00000000 --- a/backend/src/dutchie-az/db/dispensary-columns.ts +++ /dev/null @@ -1,137 +0,0 @@ -/** - * Dispensary Column Definitions - * - * Centralized column list for dispensaries table queries. - * Handles optional columns that may not exist in all environments. - * - * USAGE: - * import { DISPENSARY_COLUMNS, DISPENSARY_COLUMNS_WITH_FAILED } from '../db/dispensary-columns'; - * const result = await query(`SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE ...`); - */ - -/** - * Core dispensary columns that always exist. - * These are guaranteed to be present in all environments. - */ -const CORE_COLUMNS = ` - id, name, slug, city, state, zip, address, latitude, longitude, - menu_type, menu_url, platform_dispensary_id, website, - created_at, updated_at -`; - -/** - * Optional columns with NULL fallback. - * - * provider_detection_data: Added in migration 044 - * active_crawler_profile_id: Added in migration 041 - * - * Using COALESCE ensures the query works whether or not the column exists: - * - If column exists: returns the actual value - * - If column doesn't exist: query fails (but migration should be run) - * - * For pre-migration compatibility, we select NULL::jsonb which always works. - * After migration 044 is applied, this can be changed to the real column. - */ - -// TEMPORARY: Use NULL fallback until migration 044 is applied -// After running 044, change this to: provider_detection_data -const PROVIDER_DETECTION_COLUMN = `NULL::jsonb AS provider_detection_data`; - -// After migration 044 is applied, uncomment this line and remove the above: -// const PROVIDER_DETECTION_COLUMN = `provider_detection_data`; - -/** - * Standard dispensary columns for most queries. - * Includes provider_detection_data with NULL fallback for pre-migration compatibility. - */ -export const DISPENSARY_COLUMNS = `${CORE_COLUMNS.trim()}, - ${PROVIDER_DETECTION_COLUMN}`; - -/** - * Dispensary columns including active_crawler_profile_id. - * Used by routes that need profile information. - */ -export const DISPENSARY_COLUMNS_WITH_PROFILE = `${CORE_COLUMNS.trim()}, - ${PROVIDER_DETECTION_COLUMN}, - active_crawler_profile_id`; - -/** - * Dispensary columns including failed_at. - * Used by worker for compatibility checks. - */ -export const DISPENSARY_COLUMNS_WITH_FAILED = `${CORE_COLUMNS.trim()}, - ${PROVIDER_DETECTION_COLUMN}, - failed_at`; - -/** - * NOTE: After migration 044 is applied, update PROVIDER_DETECTION_COLUMN above - * to use the real column instead of NULL fallback. - * - * To verify migration status: - * SELECT column_name FROM information_schema.columns - * WHERE table_name = 'dispensaries' AND column_name = 'provider_detection_data'; - */ - -// Cache for column existence check -let _providerDetectionColumnExists: boolean | null = null; - -/** - * Check if provider_detection_data column exists in dispensaries table. - * Result is cached after first check. - */ -export async function hasProviderDetectionColumn(pool: { query: (sql: string) => Promise<{ rows: any[] }> }): Promise { - if (_providerDetectionColumnExists !== null) { - return _providerDetectionColumnExists; - } - - try { - const result = await pool.query(` - SELECT 1 FROM information_schema.columns - WHERE table_name = 'dispensaries' AND column_name = 'provider_detection_data' - `); - _providerDetectionColumnExists = result.rows.length > 0; - } catch { - _providerDetectionColumnExists = false; - } - - return _providerDetectionColumnExists; -} - -/** - * Safely update provider_detection_data column. - * If column doesn't exist, logs a warning but doesn't crash. - * - * @param pool - Database pool with query method - * @param dispensaryId - ID of dispensary to update - * @param data - JSONB data to merge into provider_detection_data - * @returns true if update succeeded, false if column doesn't exist - */ -export async function safeUpdateProviderDetectionData( - pool: { query: (sql: string, params?: any[]) => Promise }, - dispensaryId: number, - data: Record -): Promise { - const hasColumn = await hasProviderDetectionColumn(pool); - - if (!hasColumn) { - console.warn(`[DispensaryColumns] provider_detection_data column not found. Run migration 044 to add it.`); - return false; - } - - try { - await pool.query( - `UPDATE dispensaries - SET provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || $1::jsonb, - updated_at = NOW() - WHERE id = $2`, - [JSON.stringify(data), dispensaryId] - ); - return true; - } catch (error: any) { - if (error.message?.includes('provider_detection_data')) { - console.warn(`[DispensaryColumns] Failed to update provider_detection_data: ${error.message}`); - return false; - } - throw error; - } -} diff --git a/backend/src/dutchie-az/db/migrate.ts b/backend/src/dutchie-az/db/migrate.ts deleted file mode 100644 index 2055463c..00000000 --- a/backend/src/dutchie-az/db/migrate.ts +++ /dev/null @@ -1,29 +0,0 @@ -/** - * Dutchie AZ Schema Bootstrap - * - * Run this to create/update the dutchie_az tables (dutchie_products, dutchie_product_snapshots, etc.) - * in the AZ pipeline database. This is separate from the legacy schema. - * - * Usage: - * TS_NODE_TRANSPILE_ONLY=1 npx ts-node src/dutchie-az/db/migrate.ts - * or (after build) - * node dist/dutchie-az/db/migrate.js - */ - -import { createSchema } from './schema'; -import { closePool } from './connection'; - -async function main() { - try { - console.log('[DutchieAZ] Running schema migration...'); - await createSchema(); - console.log('[DutchieAZ] Schema migration complete.'); - } catch (err: any) { - console.error('[DutchieAZ] Schema migration failed:', err.message); - process.exitCode = 1; - } finally { - await closePool(); - } -} - -main(); diff --git a/backend/src/dutchie-az/db/schema.ts b/backend/src/dutchie-az/db/schema.ts deleted file mode 100644 index ad6e2036..00000000 --- a/backend/src/dutchie-az/db/schema.ts +++ /dev/null @@ -1,408 +0,0 @@ -/** - * Dutchie AZ Database Schema - * - * Creates all tables for the isolated Dutchie Arizona data pipeline. - * Run this to initialize the dutchie_az database. - */ - -import { query, getClient } from './connection'; - -/** - * SQL statements to create all tables - */ -const SCHEMA_SQL = ` --- ============================================================ --- DISPENSARIES TABLE --- Stores discovered Dutchie dispensaries in Arizona --- ============================================================ -CREATE TABLE IF NOT EXISTS dispensaries ( - id SERIAL PRIMARY KEY, - platform VARCHAR(20) NOT NULL DEFAULT 'dutchie', - name VARCHAR(255) NOT NULL, - slug VARCHAR(255) NOT NULL, - city VARCHAR(100) NOT NULL, - state VARCHAR(10) NOT NULL DEFAULT 'AZ', - postal_code VARCHAR(20), - address TEXT, - latitude DECIMAL(10, 7), - longitude DECIMAL(10, 7), - platform_dispensary_id VARCHAR(100), - is_delivery BOOLEAN DEFAULT false, - is_pickup BOOLEAN DEFAULT true, - raw_metadata JSONB, - last_crawled_at TIMESTAMPTZ, - product_count INTEGER DEFAULT 0, - created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW(), - - CONSTRAINT uk_dispensaries_platform_slug UNIQUE (platform, slug, city, state) -); - -CREATE INDEX IF NOT EXISTS idx_dispensaries_platform ON dispensaries(platform); -CREATE INDEX IF NOT EXISTS idx_dispensaries_platform_id ON dispensaries(platform_dispensary_id); -CREATE INDEX IF NOT EXISTS idx_dispensaries_state ON dispensaries(state); -CREATE INDEX IF NOT EXISTS idx_dispensaries_city ON dispensaries(city); - --- ============================================================ --- DUTCHIE_PRODUCTS TABLE --- Canonical product identity per store --- ============================================================ -CREATE TABLE IF NOT EXISTS dutchie_products ( - id SERIAL PRIMARY KEY, - dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE, - platform VARCHAR(20) NOT NULL DEFAULT 'dutchie', - - external_product_id VARCHAR(100) NOT NULL, - platform_dispensary_id VARCHAR(100) NOT NULL, - c_name VARCHAR(500), - name VARCHAR(500) NOT NULL, - - -- Brand - brand_name VARCHAR(255), - brand_id VARCHAR(100), - brand_logo_url TEXT, - - -- Classification - type VARCHAR(100), - subcategory VARCHAR(100), - strain_type VARCHAR(50), - provider VARCHAR(100), - - -- Potency - thc DECIMAL(10, 4), - thc_content DECIMAL(10, 4), - cbd DECIMAL(10, 4), - cbd_content DECIMAL(10, 4), - cannabinoids_v2 JSONB, - effects JSONB, - - -- Status / flags - status VARCHAR(50), - medical_only BOOLEAN DEFAULT false, - rec_only BOOLEAN DEFAULT false, - featured BOOLEAN DEFAULT false, - coming_soon BOOLEAN DEFAULT false, - certificate_of_analysis_enabled BOOLEAN DEFAULT false, - - is_below_threshold BOOLEAN DEFAULT false, - is_below_kiosk_threshold BOOLEAN DEFAULT false, - options_below_threshold BOOLEAN DEFAULT false, - options_below_kiosk_threshold BOOLEAN DEFAULT false, - - -- Derived stock status: 'in_stock', 'out_of_stock', 'unknown' - stock_status VARCHAR(20) DEFAULT 'unknown', - total_quantity_available INTEGER DEFAULT 0, - - -- Images - primary_image_url TEXT, - images JSONB, - - -- Misc - measurements JSONB, - weight VARCHAR(50), - past_c_names TEXT[], - - created_at_dutchie TIMESTAMPTZ, - updated_at_dutchie TIMESTAMPTZ, - - latest_raw_payload JSONB, - - created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW(), - - CONSTRAINT uk_dutchie_products UNIQUE (dispensary_id, external_product_id) -); - -CREATE INDEX IF NOT EXISTS idx_dutchie_products_dispensary ON dutchie_products(dispensary_id); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_external_id ON dutchie_products(external_product_id); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_platform_disp ON dutchie_products(platform_dispensary_id); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_brand ON dutchie_products(brand_name); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_type ON dutchie_products(type); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_subcategory ON dutchie_products(subcategory); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_status ON dutchie_products(status); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_strain ON dutchie_products(strain_type); -CREATE INDEX IF NOT EXISTS idx_dutchie_products_stock_status ON dutchie_products(stock_status); - --- ============================================================ --- DUTCHIE_PRODUCT_SNAPSHOTS TABLE --- Historical state per crawl, includes options[] --- ============================================================ -CREATE TABLE IF NOT EXISTS dutchie_product_snapshots ( - id SERIAL PRIMARY KEY, - dutchie_product_id INTEGER NOT NULL REFERENCES dutchie_products(id) ON DELETE CASCADE, - dispensary_id INTEGER NOT NULL REFERENCES dispensaries(id) ON DELETE CASCADE, - platform_dispensary_id VARCHAR(100) NOT NULL, - external_product_id VARCHAR(100) NOT NULL, - pricing_type VARCHAR(20) DEFAULT 'unknown', - crawl_mode VARCHAR(20) DEFAULT 'mode_a', -- 'mode_a' (UI parity) or 'mode_b' (max coverage) - - status VARCHAR(50), - featured BOOLEAN DEFAULT false, - special BOOLEAN DEFAULT false, - medical_only BOOLEAN DEFAULT false, - rec_only BOOLEAN DEFAULT false, - - -- Flag indicating if product was present in feed (false = missing_from_feed snapshot) - is_present_in_feed BOOLEAN DEFAULT true, - - -- Derived stock status - stock_status VARCHAR(20) DEFAULT 'unknown', - - -- Price summary (in cents) - rec_min_price_cents INTEGER, - rec_max_price_cents INTEGER, - rec_min_special_price_cents INTEGER, - med_min_price_cents INTEGER, - med_max_price_cents INTEGER, - med_min_special_price_cents INTEGER, - wholesale_min_price_cents INTEGER, - - -- Inventory summary - total_quantity_available INTEGER, - total_kiosk_quantity_available INTEGER, - manual_inventory BOOLEAN DEFAULT false, - is_below_threshold BOOLEAN DEFAULT false, - is_below_kiosk_threshold BOOLEAN DEFAULT false, - - -- Option-level data (from POSMetaData.children) - options JSONB, - - -- Full raw product node - raw_payload JSONB NOT NULL, - - crawled_at TIMESTAMPTZ NOT NULL, - created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW() -); - -CREATE INDEX IF NOT EXISTS idx_snapshots_product ON dutchie_product_snapshots(dutchie_product_id); -CREATE INDEX IF NOT EXISTS idx_snapshots_dispensary ON dutchie_product_snapshots(dispensary_id); -CREATE INDEX IF NOT EXISTS idx_snapshots_crawled_at ON dutchie_product_snapshots(crawled_at); -CREATE INDEX IF NOT EXISTS idx_snapshots_platform_disp ON dutchie_product_snapshots(platform_dispensary_id); -CREATE INDEX IF NOT EXISTS idx_snapshots_external_id ON dutchie_product_snapshots(external_product_id); -CREATE INDEX IF NOT EXISTS idx_snapshots_special ON dutchie_product_snapshots(special) WHERE special = true; -CREATE INDEX IF NOT EXISTS idx_snapshots_stock_status ON dutchie_product_snapshots(stock_status); -CREATE INDEX IF NOT EXISTS idx_snapshots_crawl_mode ON dutchie_product_snapshots(crawl_mode); - --- ============================================================ --- CRAWL_JOBS TABLE --- Tracks crawl execution status --- ============================================================ -CREATE TABLE IF NOT EXISTS crawl_jobs ( - id SERIAL PRIMARY KEY, - job_type VARCHAR(50) NOT NULL, - dispensary_id INTEGER REFERENCES dispensaries(id) ON DELETE SET NULL, - status VARCHAR(20) NOT NULL DEFAULT 'pending', - started_at TIMESTAMPTZ, - completed_at TIMESTAMPTZ, - error_message TEXT, - products_found INTEGER, - snapshots_created INTEGER, - metadata JSONB, - created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW() -); - -CREATE INDEX IF NOT EXISTS idx_crawl_jobs_type ON crawl_jobs(job_type); -CREATE INDEX IF NOT EXISTS idx_crawl_jobs_status ON crawl_jobs(status); -CREATE INDEX IF NOT EXISTS idx_crawl_jobs_dispensary ON crawl_jobs(dispensary_id); -CREATE INDEX IF NOT EXISTS idx_crawl_jobs_created ON crawl_jobs(created_at); - --- ============================================================ --- JOB_SCHEDULES TABLE --- Stores schedule configuration for recurring jobs with jitter support --- Each job has independent timing that "wanders" over time --- ============================================================ -CREATE TABLE IF NOT EXISTS job_schedules ( - id SERIAL PRIMARY KEY, - job_name VARCHAR(100) NOT NULL UNIQUE, - description TEXT, - enabled BOOLEAN DEFAULT true, - - -- Timing configuration (jitter makes times "wander") - base_interval_minutes INTEGER NOT NULL DEFAULT 240, -- e.g., 4 hours - jitter_minutes INTEGER NOT NULL DEFAULT 30, -- e.g., ±30 min - - -- Last run tracking - last_run_at TIMESTAMPTZ, - last_status VARCHAR(20), -- 'success', 'error', 'partial', 'running' - last_error_message TEXT, - last_duration_ms INTEGER, - - -- Next run (calculated with jitter after each run) - next_run_at TIMESTAMPTZ, - - -- Additional config - job_config JSONB, -- e.g., { pricingType: 'rec', useBothModes: true } - - created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW() -); - -CREATE INDEX IF NOT EXISTS idx_job_schedules_enabled ON job_schedules(enabled); -CREATE INDEX IF NOT EXISTS idx_job_schedules_next_run ON job_schedules(next_run_at); - --- ============================================================ --- JOB_RUN_LOGS TABLE --- Stores history of job runs for monitoring --- ============================================================ -CREATE TABLE IF NOT EXISTS job_run_logs ( - id SERIAL PRIMARY KEY, - schedule_id INTEGER NOT NULL REFERENCES job_schedules(id) ON DELETE CASCADE, - job_name VARCHAR(100) NOT NULL, - status VARCHAR(20) NOT NULL, -- 'pending', 'running', 'success', 'error', 'partial' - started_at TIMESTAMPTZ, - completed_at TIMESTAMPTZ, - duration_ms INTEGER, - error_message TEXT, - - -- Results summary - items_processed INTEGER, - items_succeeded INTEGER, - items_failed INTEGER, - - metadata JSONB, -- Additional run details - - created_at TIMESTAMPTZ DEFAULT NOW() -); - -CREATE INDEX IF NOT EXISTS idx_job_run_logs_schedule ON job_run_logs(schedule_id); -CREATE INDEX IF NOT EXISTS idx_job_run_logs_job_name ON job_run_logs(job_name); -CREATE INDEX IF NOT EXISTS idx_job_run_logs_status ON job_run_logs(status); -CREATE INDEX IF NOT EXISTS idx_job_run_logs_created ON job_run_logs(created_at); - --- ============================================================ --- VIEWS FOR EASY QUERYING --- ============================================================ - --- Categories derived from products -CREATE OR REPLACE VIEW v_categories AS -SELECT - type, - subcategory, - COUNT(DISTINCT id) as product_count, - COUNT(DISTINCT dispensary_id) as dispensary_count, - AVG(thc) as avg_thc, - MIN(thc) as min_thc, - MAX(thc) as max_thc -FROM dutchie_products -WHERE type IS NOT NULL -GROUP BY type, subcategory -ORDER BY type, subcategory; - --- Brands derived from products -CREATE OR REPLACE VIEW v_brands AS -SELECT - brand_name, - brand_id, - MAX(brand_logo_url) as brand_logo_url, - COUNT(DISTINCT id) as product_count, - COUNT(DISTINCT dispensary_id) as dispensary_count, - ARRAY_AGG(DISTINCT type) FILTER (WHERE type IS NOT NULL) as product_types -FROM dutchie_products -WHERE brand_name IS NOT NULL -GROUP BY brand_name, brand_id -ORDER BY product_count DESC; - --- Latest snapshot per product (most recent crawl data) -CREATE OR REPLACE VIEW v_latest_snapshots AS -SELECT DISTINCT ON (dutchie_product_id) - s.* -FROM dutchie_product_snapshots s -ORDER BY dutchie_product_id, crawled_at DESC; - --- Dashboard stats -CREATE OR REPLACE VIEW v_dashboard_stats AS -SELECT - (SELECT COUNT(*) FROM dispensaries WHERE state = 'AZ') as dispensary_count, - (SELECT COUNT(*) FROM dutchie_products) as product_count, - (SELECT COUNT(*) FROM dutchie_product_snapshots WHERE crawled_at > NOW() - INTERVAL '24 hours') as snapshots_24h, - (SELECT MAX(crawled_at) FROM dutchie_product_snapshots) as last_crawl_time, - (SELECT COUNT(*) FROM crawl_jobs WHERE status = 'failed' AND created_at > NOW() - INTERVAL '24 hours') as failed_jobs_24h, - (SELECT COUNT(DISTINCT brand_name) FROM dutchie_products WHERE brand_name IS NOT NULL) as brand_count, - (SELECT COUNT(DISTINCT (type, subcategory)) FROM dutchie_products WHERE type IS NOT NULL) as category_count; -`; - -/** - * Run the schema migration - */ -export async function createSchema(): Promise { - console.log('[DutchieAZ Schema] Creating database schema...'); - - const client = await getClient(); - - try { - await client.query('BEGIN'); - - // Split into individual statements and execute - const statements = SCHEMA_SQL - .split(';') - .map(s => s.trim()) - .filter(s => s.length > 0 && !s.startsWith('--')); - - for (const statement of statements) { - if (statement.trim()) { - await client.query(statement + ';'); - } - } - - await client.query('COMMIT'); - console.log('[DutchieAZ Schema] Schema created successfully'); - } catch (error) { - await client.query('ROLLBACK'); - console.error('[DutchieAZ Schema] Failed to create schema:', error); - throw error; - } finally { - client.release(); - } -} - -/** - * Drop all tables (for development/testing) - */ -export async function dropSchema(): Promise { - console.log('[DutchieAZ Schema] Dropping all tables...'); - - await query(` - DROP VIEW IF EXISTS v_dashboard_stats CASCADE; - DROP VIEW IF EXISTS v_latest_snapshots CASCADE; - DROP VIEW IF EXISTS v_brands CASCADE; - DROP VIEW IF EXISTS v_categories CASCADE; - DROP TABLE IF EXISTS crawl_schedule CASCADE; - DROP TABLE IF EXISTS crawl_jobs CASCADE; - DROP TABLE IF EXISTS dutchie_product_snapshots CASCADE; - DROP TABLE IF EXISTS dutchie_products CASCADE; - DROP TABLE IF EXISTS dispensaries CASCADE; - `); - - console.log('[DutchieAZ Schema] All tables dropped'); -} - -/** - * Check if schema exists - */ -export async function schemaExists(): Promise { - try { - const result = await query(` - SELECT EXISTS ( - SELECT FROM information_schema.tables - WHERE table_name = 'dispensaries' - ) as exists - `); - return result.rows[0]?.exists === true; - } catch (error) { - return false; - } -} - -/** - * Initialize schema if it doesn't exist - */ -export async function ensureSchema(): Promise { - const exists = await schemaExists(); - if (!exists) { - await createSchema(); - } else { - console.log('[DutchieAZ Schema] Schema already exists'); - } -} diff --git a/backend/src/dutchie-az/discovery/DtCityDiscoveryService.ts b/backend/src/dutchie-az/discovery/DtCityDiscoveryService.ts deleted file mode 100644 index 99d38d7a..00000000 --- a/backend/src/dutchie-az/discovery/DtCityDiscoveryService.ts +++ /dev/null @@ -1,403 +0,0 @@ -/** - * DtCityDiscoveryService - * - * Core service for Dutchie city discovery. - * Contains shared logic used by multiple entrypoints. - * - * Responsibilities: - * - Browser/API-based city fetching - * - Manual city seeding - * - City upsert operations - */ - -import { Pool } from 'pg'; -import axios from 'axios'; -import puppeteer from 'puppeteer-extra'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; - -puppeteer.use(StealthPlugin()); - -// ============================================================ -// TYPES -// ============================================================ - -export interface DutchieCity { - name: string; - slug: string; - stateCode: string | null; - countryCode: string; - url?: string; -} - -export interface CityDiscoveryResult { - citiesFound: number; - citiesInserted: number; - citiesUpdated: number; - errors: string[]; - durationMs: number; -} - -export interface ManualSeedResult { - city: DutchieCity; - id: number; - wasInserted: boolean; -} - -// ============================================================ -// US STATE CODE MAPPING -// ============================================================ - -export const US_STATE_MAP: Record = { - 'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR', - 'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE', - 'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI', 'idaho': 'ID', - 'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS', - 'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD', - 'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS', - 'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV', - 'new-hampshire': 'NH', 'new-jersey': 'NJ', 'new-mexico': 'NM', 'new-york': 'NY', - 'north-carolina': 'NC', 'north-dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK', - 'oregon': 'OR', 'pennsylvania': 'PA', 'rhode-island': 'RI', 'south-carolina': 'SC', - 'south-dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT', - 'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA', 'west-virginia': 'WV', - 'wisconsin': 'WI', 'wyoming': 'WY', 'district-of-columbia': 'DC', -}; - -// Canadian province mapping -export const CA_PROVINCE_MAP: Record = { - 'alberta': 'AB', 'british-columbia': 'BC', 'manitoba': 'MB', - 'new-brunswick': 'NB', 'newfoundland-and-labrador': 'NL', - 'northwest-territories': 'NT', 'nova-scotia': 'NS', 'nunavut': 'NU', - 'ontario': 'ON', 'prince-edward-island': 'PE', 'quebec': 'QC', - 'saskatchewan': 'SK', 'yukon': 'YT', -}; - -// ============================================================ -// CITY FETCHING (AUTO DISCOVERY) -// ============================================================ - -/** - * Fetch cities from Dutchie's /cities page using Puppeteer. - */ -export async function fetchCitiesFromBrowser(): Promise { - console.log('[DtCityDiscoveryService] Launching browser to fetch cities...'); - - const browser = await puppeteer.launch({ - headless: 'new', - args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'], - }); - - try { - const page = await browser.newPage(); - await page.setUserAgent( - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' - ); - - console.log('[DtCityDiscoveryService] Navigating to https://dutchie.com/cities...'); - await page.goto('https://dutchie.com/cities', { - waitUntil: 'networkidle2', - timeout: 60000, - }); - - await new Promise((r) => setTimeout(r, 3000)); - - const cities = await page.evaluate(() => { - const cityLinks: Array<{ - name: string; - slug: string; - url: string; - stateSlug: string | null; - }> = []; - - const links = document.querySelectorAll('a[href*="/city/"]'); - links.forEach((link) => { - const href = (link as HTMLAnchorElement).href; - const text = (link as HTMLElement).innerText?.trim(); - - const match = href.match(/\/city\/([^/]+)\/([^/?]+)/); - if (match && text) { - cityLinks.push({ - name: text, - slug: match[2], - url: href, - stateSlug: match[1], - }); - } - }); - - return cityLinks; - }); - - console.log(`[DtCityDiscoveryService] Extracted ${cities.length} city links from page`); - - return cities.map((city) => { - let countryCode = 'US'; - let stateCode: string | null = null; - - if (city.stateSlug) { - if (US_STATE_MAP[city.stateSlug]) { - stateCode = US_STATE_MAP[city.stateSlug]; - countryCode = 'US'; - } else if (CA_PROVINCE_MAP[city.stateSlug]) { - stateCode = CA_PROVINCE_MAP[city.stateSlug]; - countryCode = 'CA'; - } else if (city.stateSlug.length === 2) { - stateCode = city.stateSlug.toUpperCase(); - if (Object.values(CA_PROVINCE_MAP).includes(stateCode)) { - countryCode = 'CA'; - } - } - } - - return { - name: city.name, - slug: city.slug, - stateCode, - countryCode, - url: city.url, - }; - }); - } finally { - await browser.close(); - } -} - -/** - * Fetch cities via API endpoints (fallback). - */ -export async function fetchCitiesFromAPI(): Promise { - console.log('[DtCityDiscoveryService] Attempting API-based city discovery...'); - - const apiEndpoints = [ - 'https://dutchie.com/api/cities', - 'https://api.dutchie.com/v1/cities', - ]; - - for (const endpoint of apiEndpoints) { - try { - const response = await axios.get(endpoint, { - headers: { - 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0', - Accept: 'application/json', - }, - timeout: 15000, - }); - - if (response.data && Array.isArray(response.data)) { - console.log(`[DtCityDiscoveryService] API returned ${response.data.length} cities`); - return response.data.map((c: any) => ({ - name: c.name || c.city, - slug: c.slug || c.citySlug, - stateCode: c.stateCode || c.state, - countryCode: c.countryCode || c.country || 'US', - })); - } - } catch (error: any) { - console.log(`[DtCityDiscoveryService] API ${endpoint} failed: ${error.message}`); - } - } - - return []; -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Upsert a city into dutchie_discovery_cities - */ -export async function upsertCity( - pool: Pool, - city: DutchieCity -): Promise<{ id: number; inserted: boolean; updated: boolean }> { - const result = await pool.query( - ` - INSERT INTO dutchie_discovery_cities ( - platform, - city_name, - city_slug, - state_code, - country_code, - crawl_enabled, - created_at, - updated_at - ) VALUES ( - 'dutchie', - $1, - $2, - $3, - $4, - TRUE, - NOW(), - NOW() - ) - ON CONFLICT (platform, country_code, state_code, city_slug) - DO UPDATE SET - city_name = EXCLUDED.city_name, - crawl_enabled = TRUE, - updated_at = NOW() - RETURNING id, (xmax = 0) AS inserted - `, - [city.name, city.slug, city.stateCode, city.countryCode] - ); - - const inserted = result.rows[0]?.inserted === true; - return { - id: result.rows[0]?.id, - inserted, - updated: !inserted, - }; -} - -// ============================================================ -// MAIN SERVICE CLASS -// ============================================================ - -export class DtCityDiscoveryService { - constructor(private pool: Pool) {} - - /** - * Run auto-discovery (browser + API fallback) - */ - async runAutoDiscovery(): Promise { - const startTime = Date.now(); - const errors: string[] = []; - let citiesFound = 0; - let citiesInserted = 0; - let citiesUpdated = 0; - - console.log('[DtCityDiscoveryService] Starting auto city discovery...'); - - try { - let cities = await fetchCitiesFromBrowser(); - - if (cities.length === 0) { - console.log('[DtCityDiscoveryService] Browser returned 0 cities, trying API...'); - cities = await fetchCitiesFromAPI(); - } - - citiesFound = cities.length; - console.log(`[DtCityDiscoveryService] Found ${citiesFound} cities`); - - for (const city of cities) { - try { - const result = await upsertCity(this.pool, city); - if (result.inserted) citiesInserted++; - else if (result.updated) citiesUpdated++; - } catch (error: any) { - const msg = `Failed to upsert city ${city.slug}: ${error.message}`; - console.error(`[DtCityDiscoveryService] ${msg}`); - errors.push(msg); - } - } - } catch (error: any) { - const msg = `Auto discovery failed: ${error.message}`; - console.error(`[DtCityDiscoveryService] ${msg}`); - errors.push(msg); - } - - const durationMs = Date.now() - startTime; - - return { - citiesFound, - citiesInserted, - citiesUpdated, - errors, - durationMs, - }; - } - - /** - * Seed a single city manually - */ - async seedCity(city: DutchieCity): Promise { - console.log(`[DtCityDiscoveryService] Seeding city: ${city.name} (${city.slug}), ${city.stateCode}, ${city.countryCode}`); - - const result = await upsertCity(this.pool, city); - - return { - city, - id: result.id, - wasInserted: result.inserted, - }; - } - - /** - * Seed multiple cities from a list - */ - async seedCities(cities: DutchieCity[]): Promise<{ - results: ManualSeedResult[]; - errors: string[]; - }> { - const results: ManualSeedResult[] = []; - const errors: string[] = []; - - for (const city of cities) { - try { - const result = await this.seedCity(city); - results.push(result); - } catch (error: any) { - errors.push(`${city.slug}: ${error.message}`); - } - } - - return { results, errors }; - } - - /** - * Get statistics about discovered cities - */ - async getStats(): Promise<{ - total: number; - byCountry: Array<{ countryCode: string; count: number }>; - byState: Array<{ stateCode: string; countryCode: string; count: number }>; - crawlEnabled: number; - neverCrawled: number; - }> { - const [totalRes, byCountryRes, byStateRes, enabledRes, neverRes] = await Promise.all([ - this.pool.query('SELECT COUNT(*) as cnt FROM dutchie_discovery_cities WHERE platform = \'dutchie\''), - this.pool.query(` - SELECT country_code, COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' - GROUP BY country_code - ORDER BY cnt DESC - `), - this.pool.query(` - SELECT state_code, country_code, COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND state_code IS NOT NULL - GROUP BY state_code, country_code - ORDER BY cnt DESC - `), - this.pool.query(` - SELECT COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND crawl_enabled = TRUE - `), - this.pool.query(` - SELECT COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND last_crawled_at IS NULL - `), - ]); - - return { - total: parseInt(totalRes.rows[0]?.cnt || '0', 10), - byCountry: byCountryRes.rows.map((r) => ({ - countryCode: r.country_code, - count: parseInt(r.cnt, 10), - })), - byState: byStateRes.rows.map((r) => ({ - stateCode: r.state_code, - countryCode: r.country_code, - count: parseInt(r.cnt, 10), - })), - crawlEnabled: parseInt(enabledRes.rows[0]?.cnt || '0', 10), - neverCrawled: parseInt(neverRes.rows[0]?.cnt || '0', 10), - }; - } -} - -export default DtCityDiscoveryService; diff --git a/backend/src/dutchie-az/discovery/DtLocationDiscoveryService.ts b/backend/src/dutchie-az/discovery/DtLocationDiscoveryService.ts deleted file mode 100644 index d6661531..00000000 --- a/backend/src/dutchie-az/discovery/DtLocationDiscoveryService.ts +++ /dev/null @@ -1,1249 +0,0 @@ -/** - * DtLocationDiscoveryService - * - * Core service for Dutchie location discovery. - * Contains shared logic used by multiple entrypoints. - * - * Responsibilities: - * - Fetch locations from city pages - * - Extract geo coordinates when available - * - Upsert locations to dutchie_discovery_locations - * - DO NOT overwrite protected statuses or existing lat/lng - */ - -import { Pool } from 'pg'; -import puppeteer from 'puppeteer-extra'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; - -puppeteer.use(StealthPlugin()); - -// ============================================================ -// TYPES -// ============================================================ - -export interface DiscoveryCity { - id: number; - platform: string; - cityName: string; - citySlug: string; - stateCode: string | null; - countryCode: string; - crawlEnabled: boolean; -} - -export interface DutchieLocation { - platformLocationId: string; - platformSlug: string; - platformMenuUrl: string; - name: string; - rawAddress: string | null; - addressLine1: string | null; - addressLine2: string | null; - city: string | null; - stateCode: string | null; - postalCode: string | null; - countryCode: string | null; - latitude: number | null; - longitude: number | null; - timezone: string | null; - offersDelivery: boolean | null; - offersPickup: boolean | null; - isRecreational: boolean | null; - isMedical: boolean | null; - metadata: Record; -} - -export interface LocationDiscoveryResult { - cityId: number; - citySlug: string; - locationsFound: number; - locationsInserted: number; - locationsUpdated: number; - locationsSkipped: number; - reportedStoreCount: number | null; - errors: string[]; - durationMs: number; -} - -interface FetchResult { - locations: DutchieLocation[]; - reportedStoreCount: number | null; -} - -export interface BatchDiscoveryResult { - totalCities: number; - totalLocationsFound: number; - totalInserted: number; - totalUpdated: number; - totalSkipped: number; - errors: string[]; - durationMs: number; -} - -// ============================================================ -// COORDINATE EXTRACTION HELPERS -// ============================================================ - -/** - * Extract latitude from various payload formats - */ -function extractLatitude(data: any): number | null { - // Direct lat/latitude fields - if (typeof data.lat === 'number') return data.lat; - if (typeof data.latitude === 'number') return data.latitude; - - // Nested in location object - if (data.location) { - if (typeof data.location.lat === 'number') return data.location.lat; - if (typeof data.location.latitude === 'number') return data.location.latitude; - } - - // Nested in coordinates object - if (data.coordinates) { - if (typeof data.coordinates.lat === 'number') return data.coordinates.lat; - if (typeof data.coordinates.latitude === 'number') return data.coordinates.latitude; - // GeoJSON format [lng, lat] - if (Array.isArray(data.coordinates) && data.coordinates.length >= 2) { - return data.coordinates[1]; - } - } - - // Geometry object (GeoJSON) - if (data.geometry?.coordinates && Array.isArray(data.geometry.coordinates)) { - return data.geometry.coordinates[1]; - } - - // Nested in address - if (data.address) { - if (typeof data.address.lat === 'number') return data.address.lat; - if (typeof data.address.latitude === 'number') return data.address.latitude; - } - - // geo object - if (data.geo) { - if (typeof data.geo.lat === 'number') return data.geo.lat; - if (typeof data.geo.latitude === 'number') return data.geo.latitude; - } - - return null; -} - -/** - * Extract longitude from various payload formats - */ -function extractLongitude(data: any): number | null { - // Direct lng/longitude fields - if (typeof data.lng === 'number') return data.lng; - if (typeof data.lon === 'number') return data.lon; - if (typeof data.longitude === 'number') return data.longitude; - - // Nested in location object - if (data.location) { - if (typeof data.location.lng === 'number') return data.location.lng; - if (typeof data.location.lon === 'number') return data.location.lon; - if (typeof data.location.longitude === 'number') return data.location.longitude; - } - - // Nested in coordinates object - if (data.coordinates) { - if (typeof data.coordinates.lng === 'number') return data.coordinates.lng; - if (typeof data.coordinates.lon === 'number') return data.coordinates.lon; - if (typeof data.coordinates.longitude === 'number') return data.coordinates.longitude; - // GeoJSON format [lng, lat] - if (Array.isArray(data.coordinates) && data.coordinates.length >= 2) { - return data.coordinates[0]; - } - } - - // Geometry object (GeoJSON) - if (data.geometry?.coordinates && Array.isArray(data.geometry.coordinates)) { - return data.geometry.coordinates[0]; - } - - // Nested in address - if (data.address) { - if (typeof data.address.lng === 'number') return data.address.lng; - if (typeof data.address.lon === 'number') return data.address.lon; - if (typeof data.address.longitude === 'number') return data.address.longitude; - } - - // geo object - if (data.geo) { - if (typeof data.geo.lng === 'number') return data.geo.lng; - if (typeof data.geo.lon === 'number') return data.geo.lon; - if (typeof data.geo.longitude === 'number') return data.geo.longitude; - } - - return null; -} - -// ============================================================ -// LOCATION FETCHING -// ============================================================ - -/** - * Parse dispensary data from Dutchie's API/JSON response with coordinate extraction - */ -function parseDispensaryData(d: any, city: DiscoveryCity): DutchieLocation { - const id = d.id || d._id || d.dispensaryId || ''; - const slug = d.slug || d.cName || d.name?.toLowerCase().replace(/\s+/g, '-') || ''; - - // Build menu URL - let menuUrl = `https://dutchie.com/dispensary/${slug}`; - if (d.menuUrl) { - menuUrl = d.menuUrl; - } else if (d.embeddedMenuUrl) { - menuUrl = d.embeddedMenuUrl; - } - - // Parse address - const address = d.address || d.location?.address || {}; - const rawAddress = [ - address.line1 || address.street1 || d.address1, - address.line2 || address.street2 || d.address2, - [ - address.city || d.city, - address.state || address.stateCode || d.state, - address.zip || address.zipCode || address.postalCode || d.zip, - ] - .filter(Boolean) - .join(' '), - ] - .filter(Boolean) - .join(', '); - - // Extract coordinates from various possible locations in the payload - const latitude = extractLatitude(d); - const longitude = extractLongitude(d); - - if (latitude !== null && longitude !== null) { - console.log(`[DtLocationDiscoveryService] Extracted coordinates for ${slug}: ${latitude}, ${longitude}`); - } - - return { - platformLocationId: id, - platformSlug: slug, - platformMenuUrl: menuUrl, - name: d.name || d.dispensaryName || '', - rawAddress: rawAddress || null, - addressLine1: address.line1 || address.street1 || d.address1 || null, - addressLine2: address.line2 || address.street2 || d.address2 || null, - city: address.city || d.city || city.cityName, - stateCode: address.state || address.stateCode || d.state || city.stateCode, - postalCode: address.zip || address.zipCode || address.postalCode || d.zip || null, - countryCode: address.country || address.countryCode || d.country || city.countryCode, - latitude, - longitude, - timezone: d.timezone || d.timeZone || null, - offersDelivery: d.offerDelivery ?? d.offersDelivery ?? d.delivery ?? null, - offersPickup: d.offerPickup ?? d.offersPickup ?? d.pickup ?? null, - isRecreational: d.isRecreational ?? d.recreational ?? (d.retailType === 'recreational' || d.retailType === 'both'), - isMedical: d.isMedical ?? d.medical ?? (d.retailType === 'medical' || d.retailType === 'both'), - metadata: { - source: 'next_data', - retailType: d.retailType, - brand: d.brand, - logo: d.logo || d.logoUrl, - raw: d, - }, - }; -} - -/** - * Fetch locations for a city using Puppeteer - * Returns both locations and Dutchie's reported store count from page header - */ -async function fetchLocationsForCity(city: DiscoveryCity): Promise { - console.log(`[DtLocationDiscoveryService] Fetching locations for ${city.cityName}, ${city.stateCode}...`); - - const browser = await puppeteer.launch({ - headless: 'new', - args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'], - }); - - try { - const page = await browser.newPage(); - await page.setUserAgent( - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' - ); - - // Use the /us/dispensaries/{city_slug} pattern (NOT /city/{state}/{slug}) - const cityUrl = `https://dutchie.com/us/dispensaries/${city.citySlug}`; - console.log(`[DtLocationDiscoveryService] Navigating to ${cityUrl}...`); - - await page.goto(cityUrl, { - waitUntil: 'networkidle2', - timeout: 60000, - }); - - await new Promise((r) => setTimeout(r, 3000)); - - // Extract reported store count from page header (e.g., "18 dispensaries") - const reportedStoreCount = await page.evaluate(() => { - // Look for patterns like "18 dispensaries", "18 stores", "18 results" - const headerSelectors = [ - 'h1', 'h2', '[data-testid="city-header"]', '[data-testid="results-count"]', - '.results-header', '.city-header', '.page-header' - ]; - - for (const selector of headerSelectors) { - const elements = Array.from(document.querySelectorAll(selector)); - for (const el of elements) { - const text = el.textContent || ''; - // Match patterns like "18 dispensaries", "18 stores", "18 results", or just "18" followed by word - const match = text.match(/(\d+)\s*(?:dispensar(?:y|ies)|stores?|results?|locations?)/i); - if (match) { - return parseInt(match[1], 10); - } - } - } - - // Also check for count in any element containing "dispensaries" or "stores" - const allText = document.body.innerText; - const globalMatch = allText.match(/(\d+)\s+dispensar(?:y|ies)/i); - if (globalMatch) { - return parseInt(globalMatch[1], 10); - } - - return null; - }); - - if (reportedStoreCount !== null) { - console.log(`[DtLocationDiscoveryService] Dutchie reports ${reportedStoreCount} stores for ${city.citySlug}`); - } - - // Try to extract __NEXT_DATA__ - const nextData = await page.evaluate(() => { - const script = document.querySelector('script#__NEXT_DATA__'); - if (script) { - try { - return JSON.parse(script.textContent || '{}'); - } catch { - return null; - } - } - return null; - }); - - let locations: DutchieLocation[] = []; - - if (nextData?.props?.pageProps?.dispensaries) { - const dispensaries = nextData.props.pageProps.dispensaries; - console.log(`[DtLocationDiscoveryService] Found ${dispensaries.length} dispensaries in __NEXT_DATA__`); - locations = dispensaries.map((d: any) => parseDispensaryData(d, city)); - } else { - // Fall back to DOM scraping - console.log('[DtLocationDiscoveryService] No __NEXT_DATA__, trying DOM scraping...'); - - const scrapedData = await page.evaluate(() => { - const stores: Array<{ - name: string; - href: string; - address: string | null; - }> = []; - - const cards = document.querySelectorAll('[data-testid="dispensary-card"], .dispensary-card, a[href*="/dispensary/"]'); - cards.forEach((card) => { - const link = card.querySelector('a[href*="/dispensary/"]') || (card as HTMLAnchorElement); - const href = (link as HTMLAnchorElement).href || ''; - const name = - card.querySelector('[data-testid="dispensary-name"]')?.textContent || - card.querySelector('h2, h3, .name')?.textContent || - link.textContent || - ''; - const address = card.querySelector('[data-testid="dispensary-address"], .address')?.textContent || null; - - if (href && name) { - stores.push({ - name: name.trim(), - href, - address: address?.trim() || null, - }); - } - }); - - return stores; - }); - - console.log(`[DtLocationDiscoveryService] DOM scraping found ${scrapedData.length} raw store cards`); - - locations = scrapedData.map((s) => { - const match = s.href.match(/\/dispensary\/([^/?]+)/); - const slug = match ? match[1] : s.name.toLowerCase().replace(/\s+/g, '-'); - - return { - platformLocationId: slug, - platformSlug: slug, - platformMenuUrl: `https://dutchie.com/dispensary/${slug}`, - name: s.name, - rawAddress: s.address, - addressLine1: null, - addressLine2: null, - city: city.cityName, - stateCode: city.stateCode, - postalCode: null, - countryCode: city.countryCode, - latitude: null, // Not available from DOM scraping - longitude: null, - timezone: null, - offersDelivery: null, - offersPickup: null, - isRecreational: null, - isMedical: null, - metadata: { source: 'dom_scrape', originalUrl: s.href }, - }; - }); - } - - // ========================================================================= - // FILTERING AND DEDUPLICATION - // ========================================================================= - - const beforeFilterCount = locations.length; - - // 1. Filter out ghost entries and marketing links - locations = locations.filter((loc) => { - // Filter out slug matching city slug (e.g., /dispensary/ak-anchorage) - if (loc.platformSlug === city.citySlug) { - console.log(`[DtLocationDiscoveryService] Filtering ghost entry: /dispensary/${loc.platformSlug} (matches city slug)`); - return false; - } - - // Filter out marketing/referral links (e.g., try.dutchie.com/dispensary/referral/) - if (!loc.platformMenuUrl.startsWith('https://dutchie.com/dispensary/')) { - console.log(`[DtLocationDiscoveryService] Filtering non-store URL: ${loc.platformMenuUrl}`); - return false; - } - - // Filter out generic marketing slugs - const marketingSlugs = ['referral', 'refer-a-dispensary', 'sign-up', 'signup']; - if (marketingSlugs.includes(loc.platformSlug.toLowerCase())) { - console.log(`[DtLocationDiscoveryService] Filtering marketing slug: ${loc.platformSlug}`); - return false; - } - - return true; - }); - - // 2. Deduplicate by platformMenuUrl (unique store URL) - const seenUrls = new Set(); - locations = locations.filter((loc) => { - if (seenUrls.has(loc.platformMenuUrl)) { - return false; - } - seenUrls.add(loc.platformMenuUrl); - return true; - }); - - const afterFilterCount = locations.length; - if (beforeFilterCount !== afterFilterCount) { - console.log(`[DtLocationDiscoveryService] Filtered: ${beforeFilterCount} -> ${afterFilterCount} (removed ${beforeFilterCount - afterFilterCount} ghost/duplicate entries)`); - } - - // Log comparison for QA - console.log(`[DtLocationDiscoveryService] [${city.citySlug}] reported_store_count=${reportedStoreCount ?? 'N/A'}, scraped_store_count=${afterFilterCount}`); - if (reportedStoreCount !== null && reportedStoreCount !== afterFilterCount) { - console.log(`[DtLocationDiscoveryService] [${city.citySlug}] MISMATCH: Dutchie reports ${reportedStoreCount}, we scraped ${afterFilterCount}`); - } - - return { locations, reportedStoreCount }; - } finally { - await browser.close(); - } -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Upsert a location into dutchie_discovery_locations - * - Does NOT overwrite status if already verified/merged/rejected - * - Does NOT overwrite dispensary_id if already set - * - Does NOT overwrite existing lat/lng (only fills nulls) - */ -async function upsertLocation( - pool: Pool, - location: DutchieLocation, - cityId: number -): Promise<{ inserted: boolean; updated: boolean; skipped: boolean }> { - // First check if this location exists and has a protected status - const existing = await pool.query( - ` - SELECT id, status, dispensary_id, latitude, longitude - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND platform_location_id = $1 - `, - [location.platformLocationId] - ); - - if (existing.rows.length > 0) { - const row = existing.rows[0]; - const protectedStatuses = ['verified', 'merged', 'rejected']; - - if (protectedStatuses.includes(row.status)) { - // Only update last_seen_at for protected statuses - // But still update coordinates if they were null and we now have them - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET - last_seen_at = NOW(), - updated_at = NOW(), - latitude = CASE WHEN latitude IS NULL THEN $2 ELSE latitude END, - longitude = CASE WHEN longitude IS NULL THEN $3 ELSE longitude END - WHERE id = $1 - `, - [row.id, location.latitude, location.longitude] - ); - return { inserted: false, updated: false, skipped: true }; - } - - // Update existing discovered location - // Preserve existing lat/lng if already set (only fill nulls) - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET - platform_slug = $2, - platform_menu_url = $3, - name = $4, - raw_address = COALESCE($5, raw_address), - address_line1 = COALESCE($6, address_line1), - address_line2 = COALESCE($7, address_line2), - city = COALESCE($8, city), - state_code = COALESCE($9, state_code), - postal_code = COALESCE($10, postal_code), - country_code = COALESCE($11, country_code), - latitude = CASE WHEN latitude IS NULL THEN $12 ELSE latitude END, - longitude = CASE WHEN longitude IS NULL THEN $13 ELSE longitude END, - timezone = COALESCE($14, timezone), - offers_delivery = COALESCE($15, offers_delivery), - offers_pickup = COALESCE($16, offers_pickup), - is_recreational = COALESCE($17, is_recreational), - is_medical = COALESCE($18, is_medical), - metadata = COALESCE($19, metadata), - discovery_city_id = $20, - last_seen_at = NOW(), - updated_at = NOW() - WHERE id = $1 - `, - [ - row.id, - location.platformSlug, - location.platformMenuUrl, - location.name, - location.rawAddress, - location.addressLine1, - location.addressLine2, - location.city, - location.stateCode, - location.postalCode, - location.countryCode, - location.latitude, - location.longitude, - location.timezone, - location.offersDelivery, - location.offersPickup, - location.isRecreational, - location.isMedical, - JSON.stringify(location.metadata), - cityId, - ] - ); - return { inserted: false, updated: true, skipped: false }; - } - - // Insert new location - await pool.query( - ` - INSERT INTO dutchie_discovery_locations ( - platform, - platform_location_id, - platform_slug, - platform_menu_url, - name, - raw_address, - address_line1, - address_line2, - city, - state_code, - postal_code, - country_code, - latitude, - longitude, - timezone, - status, - offers_delivery, - offers_pickup, - is_recreational, - is_medical, - metadata, - discovery_city_id, - first_seen_at, - last_seen_at, - active, - created_at, - updated_at - ) VALUES ( - 'dutchie', - $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, - 'discovered', - $15, $16, $17, $18, $19, $20, - NOW(), NOW(), TRUE, NOW(), NOW() - ) - `, - [ - location.platformLocationId, - location.platformSlug, - location.platformMenuUrl, - location.name, - location.rawAddress, - location.addressLine1, - location.addressLine2, - location.city, - location.stateCode, - location.postalCode, - location.countryCode, - location.latitude, - location.longitude, - location.timezone, - location.offersDelivery, - location.offersPickup, - location.isRecreational, - location.isMedical, - JSON.stringify(location.metadata), - cityId, - ] - ); - - return { inserted: true, updated: false, skipped: false }; -} - -// ============================================================ -// MAIN SERVICE CLASS -// ============================================================ - -export class DtLocationDiscoveryService { - constructor(private pool: Pool) {} - - /** - * Get a city by slug - */ - async getCityBySlug(citySlug: string): Promise { - const { rows } = await this.pool.query( - ` - SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND city_slug = $1 - LIMIT 1 - `, - [citySlug] - ); - - if (rows.length === 0) return null; - - const r = rows[0]; - return { - id: r.id, - platform: r.platform, - cityName: r.city_name, - citySlug: r.city_slug, - stateCode: r.state_code, - countryCode: r.country_code, - crawlEnabled: r.crawl_enabled, - }; - } - - /** - * Get all crawl-enabled cities - */ - async getEnabledCities(limit?: number): Promise { - const { rows } = await this.pool.query( - ` - SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND crawl_enabled = TRUE - ORDER BY last_crawled_at ASC NULLS FIRST, city_name ASC - ${limit ? `LIMIT ${limit}` : ''} - ` - ); - - return rows.map((r) => ({ - id: r.id, - platform: r.platform, - cityName: r.city_name, - citySlug: r.city_slug, - stateCode: r.state_code, - countryCode: r.country_code, - crawlEnabled: r.crawl_enabled, - })); - } - - /** - * Discover locations for a single city - */ - async discoverForCity(city: DiscoveryCity): Promise { - const startTime = Date.now(); - const errors: string[] = []; - let locationsFound = 0; - let locationsInserted = 0; - let locationsUpdated = 0; - let locationsSkipped = 0; - let reportedStoreCount: number | null = null; - - console.log(`[DtLocationDiscoveryService] Discovering locations for ${city.cityName}, ${city.stateCode}...`); - - try { - const fetchResult = await fetchLocationsForCity(city); - const locations = fetchResult.locations; - reportedStoreCount = fetchResult.reportedStoreCount; - - locationsFound = locations.length; - console.log(`[DtLocationDiscoveryService] Found ${locationsFound} locations`); - - // Count how many have coordinates - const withCoords = locations.filter(l => l.latitude !== null && l.longitude !== null).length; - if (withCoords > 0) { - console.log(`[DtLocationDiscoveryService] ${withCoords}/${locationsFound} locations have coordinates`); - } - - for (const location of locations) { - try { - const result = await upsertLocation(this.pool, location, city.id); - if (result.inserted) locationsInserted++; - else if (result.updated) locationsUpdated++; - else if (result.skipped) locationsSkipped++; - } catch (error: any) { - const msg = `Failed to upsert location ${location.platformSlug}: ${error.message}`; - console.error(`[DtLocationDiscoveryService] ${msg}`); - errors.push(msg); - } - } - - // Update city's last_crawled_at, location_count, and reported_store_count in metadata - await this.pool.query( - ` - UPDATE dutchie_discovery_cities - SET last_crawled_at = NOW(), - location_count = $1, - metadata = COALESCE(metadata, '{}')::jsonb || jsonb_build_object( - 'reported_store_count', $3::int, - 'scraped_store_count', $1::int, - 'last_discovery_at', NOW()::text - ), - updated_at = NOW() - WHERE id = $2 - `, - [locationsFound, city.id, reportedStoreCount] - ); - } catch (error: any) { - const msg = `Location discovery failed for ${city.citySlug}: ${error.message}`; - console.error(`[DtLocationDiscoveryService] ${msg}`); - errors.push(msg); - } - - const durationMs = Date.now() - startTime; - - console.log(`[DtLocationDiscoveryService] City ${city.citySlug} complete:`); - console.log(` Reported count: ${reportedStoreCount ?? 'N/A'}`); - console.log(` Locations found: ${locationsFound}`); - console.log(` Inserted: ${locationsInserted}`); - console.log(` Updated: ${locationsUpdated}`); - console.log(` Skipped (protected): ${locationsSkipped}`); - console.log(` Errors: ${errors.length}`); - console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`); - - return { - cityId: city.id, - citySlug: city.citySlug, - locationsFound, - locationsInserted, - locationsUpdated, - locationsSkipped, - reportedStoreCount, - errors, - durationMs, - }; - } - - /** - * Discover locations for all enabled cities - */ - async discoverAllEnabled(options: { - limit?: number; - delayMs?: number; - } = {}): Promise { - const { limit, delayMs = 2000 } = options; - const startTime = Date.now(); - let totalLocationsFound = 0; - let totalInserted = 0; - let totalUpdated = 0; - let totalSkipped = 0; - const allErrors: string[] = []; - - const cities = await this.getEnabledCities(limit); - console.log(`[DtLocationDiscoveryService] Discovering locations for ${cities.length} cities...`); - - for (let i = 0; i < cities.length; i++) { - const city = cities[i]; - console.log(`\n[DtLocationDiscoveryService] City ${i + 1}/${cities.length}: ${city.cityName}, ${city.stateCode}`); - - try { - const result = await this.discoverForCity(city); - totalLocationsFound += result.locationsFound; - totalInserted += result.locationsInserted; - totalUpdated += result.locationsUpdated; - totalSkipped += result.locationsSkipped; - allErrors.push(...result.errors); - } catch (error: any) { - allErrors.push(`City ${city.citySlug} failed: ${error.message}`); - } - - if (i < cities.length - 1 && delayMs > 0) { - await new Promise((r) => setTimeout(r, delayMs)); - } - } - - const durationMs = Date.now() - startTime; - - return { - totalCities: cities.length, - totalLocationsFound, - totalInserted, - totalUpdated, - totalSkipped, - errors: allErrors, - durationMs, - }; - } - - /** - * Get location statistics - */ - async getStats(): Promise<{ - total: number; - withCoordinates: number; - byStatus: Array<{ status: string; count: number }>; - byState: Array<{ stateCode: string; count: number }>; - }> { - const [totalRes, coordsRes, byStatusRes, byStateRes] = await Promise.all([ - this.pool.query(` - SELECT COUNT(*) as cnt FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE - `), - this.pool.query(` - SELECT COUNT(*) as cnt FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE - AND latitude IS NOT NULL AND longitude IS NOT NULL - `), - this.pool.query(` - SELECT status, COUNT(*) as cnt - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE - GROUP BY status - ORDER BY cnt DESC - `), - this.pool.query(` - SELECT state_code, COUNT(*) as cnt - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE AND state_code IS NOT NULL - GROUP BY state_code - ORDER BY cnt DESC - LIMIT 20 - `), - ]); - - return { - total: parseInt(totalRes.rows[0]?.cnt || '0', 10), - withCoordinates: parseInt(coordsRes.rows[0]?.cnt || '0', 10), - byStatus: byStatusRes.rows.map((r) => ({ - status: r.status, - count: parseInt(r.cnt, 10), - })), - byState: byStateRes.rows.map((r) => ({ - stateCode: r.state_code, - count: parseInt(r.cnt, 10), - })), - }; - } - - // ============================================================ - // ALICE - FULL DISCOVERY FROM /CITIES PAGE - // ============================================================ - - /** - * Fetch all states and cities from https://dutchie.com/cities - * Returns the complete hierarchy of states -> cities - */ - async fetchCitiesFromMasterPage(): Promise<{ - states: Array<{ - stateCode: string; - stateName: string; - cities: Array<{ cityName: string; citySlug: string; storeCount?: number }>; - }>; - errors: string[]; - }> { - console.log('[Alice] Fetching master cities page from https://dutchie.com/cities...'); - - const browser = await puppeteer.launch({ - headless: 'new', - args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'], - }); - - try { - const page = await browser.newPage(); - await page.setUserAgent( - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' - ); - - await page.goto('https://dutchie.com/cities', { - waitUntil: 'networkidle2', - timeout: 60000, - }); - - await new Promise((r) => setTimeout(r, 3000)); - - // Try to extract from __NEXT_DATA__ - const citiesData = await page.evaluate(() => { - const script = document.querySelector('script#__NEXT_DATA__'); - if (script) { - try { - const data = JSON.parse(script.textContent || '{}'); - return data?.props?.pageProps || null; - } catch { - return null; - } - } - return null; - }); - - const states: Array<{ - stateCode: string; - stateName: string; - cities: Array<{ cityName: string; citySlug: string; storeCount?: number }>; - }> = []; - const errors: string[] = []; - - if (citiesData?.states || citiesData?.regions) { - // Parse from structured data - const statesList = citiesData.states || citiesData.regions || []; - for (const state of statesList) { - const stateCode = state.code || state.stateCode || state.abbreviation || ''; - const stateName = state.name || state.stateName || ''; - const cities = (state.cities || []).map((c: any) => ({ - cityName: c.name || c.cityName || '', - citySlug: c.slug || c.citySlug || c.name?.toLowerCase().replace(/\s+/g, '-') || '', - storeCount: c.dispensaryCount || c.storeCount || undefined, - })); - if (stateCode && cities.length > 0) { - states.push({ stateCode, stateName, cities }); - } - } - } else { - // Fallback: DOM scraping - console.log('[Alice] No __NEXT_DATA__, attempting DOM scrape...'); - const scrapedStates = await page.evaluate(() => { - const result: Array<{ - stateCode: string; - stateName: string; - cities: Array<{ cityName: string; citySlug: string }>; - }> = []; - - // Look for state sections - const stateHeaders = document.querySelectorAll('h2, h3, [data-testid*="state"]'); - stateHeaders.forEach((header) => { - const stateName = header.textContent?.trim() || ''; - // Try to extract state code from data attributes or guess from name - const stateCode = (header as HTMLElement).dataset?.stateCode || - stateName.substring(0, 2).toUpperCase(); - - // Find city links following this header - const container = header.closest('section') || header.parentElement; - const cityLinks = container?.querySelectorAll('a[href*="/dispensaries/"]') || []; - const cities: Array<{ cityName: string; citySlug: string }> = []; - - cityLinks.forEach((link) => { - const href = (link as HTMLAnchorElement).href || ''; - const match = href.match(/\/dispensaries\/([^/?]+)/); - if (match) { - cities.push({ - cityName: link.textContent?.trim() || '', - citySlug: match[1], - }); - } - }); - - if (stateName && cities.length > 0) { - result.push({ stateCode, stateName, cities }); - } - }); - - return result; - }); - - states.push(...scrapedStates); - - if (states.length === 0) { - errors.push('Could not parse cities from master page'); - } - } - - console.log(`[Alice] Found ${states.length} states with cities from master page`); - return { states, errors }; - - } finally { - await browser.close(); - } - } - - /** - * Upsert cities from master page discovery - */ - async upsertCitiesFromMaster(states: Array<{ - stateCode: string; - stateName: string; - cities: Array<{ cityName: string; citySlug: string; storeCount?: number }>; - }>): Promise<{ inserted: number; updated: number }> { - let inserted = 0; - let updated = 0; - - for (const state of states) { - for (const city of state.cities) { - const existing = await this.pool.query( - `SELECT id FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND city_slug = $1`, - [city.citySlug] - ); - - if (existing.rows.length === 0) { - // Insert new city - await this.pool.query( - `INSERT INTO dutchie_discovery_cities ( - platform, city_name, city_slug, state_code, state_name, - country_code, crawl_enabled, discovered_at, last_verified_at, - store_count_reported, created_at, updated_at - ) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW(), NOW(), $8, NOW(), NOW())`, - [ - 'dutchie', - city.cityName, - city.citySlug, - state.stateCode, - state.stateName, - 'US', - true, - city.storeCount || null, - ] - ); - inserted++; - } else { - // Update existing city - await this.pool.query( - `UPDATE dutchie_discovery_cities SET - city_name = COALESCE($2, city_name), - state_code = COALESCE($3, state_code), - state_name = COALESCE($4, state_name), - last_verified_at = NOW(), - store_count_reported = COALESCE($5, store_count_reported), - updated_at = NOW() - WHERE id = $1`, - [existing.rows[0].id, city.cityName, state.stateCode, state.stateName, city.storeCount] - ); - updated++; - } - } - } - - return { inserted, updated }; - } - - /** - * Detect stores that have been removed from source - * Mark them as retired instead of deleting - */ - async detectAndMarkRemovedStores( - currentLocationIds: Set - ): Promise<{ retiredCount: number; retiredIds: string[] }> { - // Get all active locations we know about - const { rows: existingLocations } = await this.pool.query<{ - id: number; - platform_location_id: string; - name: string; - }>(` - SELECT id, platform_location_id, name - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' - AND active = TRUE - AND retired_at IS NULL - `); - - const retiredIds: string[] = []; - - for (const loc of existingLocations) { - if (!currentLocationIds.has(loc.platform_location_id)) { - // This store no longer appears in source - mark as retired - await this.pool.query( - `UPDATE dutchie_discovery_locations SET - active = FALSE, - retired_at = NOW(), - retirement_reason = 'removed_from_source', - updated_at = NOW() - WHERE id = $1`, - [loc.id] - ); - retiredIds.push(loc.platform_location_id); - console.log(`[Alice] Marked store as retired: ${loc.name} (${loc.platform_location_id})`); - } - } - - return { retiredCount: retiredIds.length, retiredIds }; - } - - /** - * Detect and track slug changes - */ - async detectSlugChanges( - locationId: string, - newSlug: string - ): Promise<{ changed: boolean; previousSlug?: string }> { - const { rows } = await this.pool.query<{ platform_slug: string }>( - `SELECT platform_slug FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND platform_location_id = $1`, - [locationId] - ); - - if (rows.length === 0) return { changed: false }; - - const currentSlug = rows[0].platform_slug; - if (currentSlug && currentSlug !== newSlug) { - // Slug changed - update with tracking - await this.pool.query( - `UPDATE dutchie_discovery_locations SET - platform_slug = $1, - previous_slug = $2, - slug_changed_at = NOW(), - updated_at = NOW() - WHERE platform = 'dutchie' AND platform_location_id = $3`, - [newSlug, currentSlug, locationId] - ); - console.log(`[Alice] Slug change detected: ${currentSlug} -> ${newSlug}`); - return { changed: true, previousSlug: currentSlug }; - } - - return { changed: false }; - } - - /** - * Full discovery run with change detection (Alice's main job) - * Fetches from /cities, discovers all stores, detects changes - */ - async runFullDiscoveryWithChangeDetection(options: { - scope?: { states?: string[]; storeIds?: number[] }; - delayMs?: number; - } = {}): Promise<{ - statesDiscovered: number; - citiesDiscovered: number; - newStoreCount: number; - removedStoreCount: number; - updatedStoreCount: number; - slugChangedCount: number; - totalLocationsFound: number; - errors: string[]; - durationMs: number; - }> { - const startTime = Date.now(); - const { scope, delayMs = 2000 } = options; - const errors: string[] = []; - let slugChangedCount = 0; - - console.log('[Alice] Starting full discovery with change detection...'); - if (scope?.states) { - console.log(`[Alice] Scope limited to states: ${scope.states.join(', ')}`); - } - - // Step 1: Fetch master cities page - const { states: masterStates, errors: fetchErrors } = await this.fetchCitiesFromMasterPage(); - errors.push(...fetchErrors); - - // Filter by scope if provided - const statesToProcess = scope?.states - ? masterStates.filter(s => scope.states!.includes(s.stateCode)) - : masterStates; - - // Step 2: Upsert cities - const citiesResult = await this.upsertCitiesFromMaster(statesToProcess); - console.log(`[Alice] Cities: ${citiesResult.inserted} new, ${citiesResult.updated} updated`); - - // Step 3: Discover locations for each city - const allLocationIds = new Set(); - let totalLocationsFound = 0; - let totalInserted = 0; - let totalUpdated = 0; - - const cities = await this.getEnabledCities(); - const citiesToProcess = scope?.states - ? cities.filter(c => c.stateCode && scope.states!.includes(c.stateCode)) - : cities; - - for (let i = 0; i < citiesToProcess.length; i++) { - const city = citiesToProcess[i]; - console.log(`[Alice] City ${i + 1}/${citiesToProcess.length}: ${city.cityName}, ${city.stateCode}`); - - try { - const result = await this.discoverForCity(city); - totalLocationsFound += result.locationsFound; - totalInserted += result.locationsInserted; - totalUpdated += result.locationsUpdated; - errors.push(...result.errors); - - // Track all discovered location IDs for removal detection - // (This requires modifying discoverForCity to return IDs, or query them after) - - } catch (error: any) { - errors.push(`City ${city.citySlug}: ${error.message}`); - } - - if (i < citiesToProcess.length - 1 && delayMs > 0) { - await new Promise((r) => setTimeout(r, delayMs)); - } - } - - // Step 4: Get all current active location IDs for removal detection - const { rows: currentLocations } = await this.pool.query<{ platform_location_id: string }>( - `SELECT platform_location_id FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE AND last_seen_at > NOW() - INTERVAL '1 day'` - ); - currentLocations.forEach(loc => allLocationIds.add(loc.platform_location_id)); - - // Step 5: Detect removed stores (only if we had a successful discovery) - let removedResult = { retiredCount: 0, retiredIds: [] as string[] }; - if (totalLocationsFound > 0 && !scope) { - // Only detect removals on full (unscoped) runs - removedResult = await this.detectAndMarkRemovedStores(allLocationIds); - } - - const durationMs = Date.now() - startTime; - - console.log('[Alice] Full discovery complete:'); - console.log(` States: ${statesToProcess.length}`); - console.log(` Cities: ${citiesToProcess.length}`); - console.log(` Locations found: ${totalLocationsFound}`); - console.log(` New: ${totalInserted}, Updated: ${totalUpdated}`); - console.log(` Removed: ${removedResult.retiredCount}`); - console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`); - - return { - statesDiscovered: statesToProcess.length, - citiesDiscovered: citiesToProcess.length, - newStoreCount: totalInserted, - removedStoreCount: removedResult.retiredCount, - updatedStoreCount: totalUpdated, - slugChangedCount, - totalLocationsFound, - errors, - durationMs, - }; - } -} - -export default DtLocationDiscoveryService; diff --git a/backend/src/dutchie-az/discovery/DutchieCityDiscovery.ts b/backend/src/dutchie-az/discovery/DutchieCityDiscovery.ts deleted file mode 100644 index cbde7dd7..00000000 --- a/backend/src/dutchie-az/discovery/DutchieCityDiscovery.ts +++ /dev/null @@ -1,390 +0,0 @@ -/** - * DutchieCityDiscovery - * - * Discovers cities from Dutchie's /cities page and upserts to dutchie_discovery_cities. - * - * Responsibilities: - * - Fetch all cities available on Dutchie - * - For each city derive: city_name, city_slug, state_code, country_code - * - Upsert into dutchie_discovery_cities - */ - -import { Pool } from 'pg'; -import axios from 'axios'; -import puppeteer from 'puppeteer-extra'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; -import type { Browser, Page } from 'puppeteer'; - -puppeteer.use(StealthPlugin()); - -// ============================================================ -// TYPES -// ============================================================ - -export interface DutchieCity { - name: string; - slug: string; - stateCode: string | null; - countryCode: string; - url?: string; -} - -export interface CityDiscoveryResult { - citiesFound: number; - citiesInserted: number; - citiesUpdated: number; - errors: string[]; - durationMs: number; -} - -// ============================================================ -// US STATE CODE MAPPING -// ============================================================ - -const US_STATE_MAP: Record = { - 'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR', - 'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE', - 'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI', 'idaho': 'ID', - 'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA', 'kansas': 'KS', - 'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME', 'maryland': 'MD', - 'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN', 'mississippi': 'MS', - 'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE', 'nevada': 'NV', - 'new-hampshire': 'NH', 'new-jersey': 'NJ', 'new-mexico': 'NM', 'new-york': 'NY', - 'north-carolina': 'NC', 'north-dakota': 'ND', 'ohio': 'OH', 'oklahoma': 'OK', - 'oregon': 'OR', 'pennsylvania': 'PA', 'rhode-island': 'RI', 'south-carolina': 'SC', - 'south-dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX', 'utah': 'UT', - 'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA', 'west-virginia': 'WV', - 'wisconsin': 'WI', 'wyoming': 'WY', 'district-of-columbia': 'DC', -}; - -// Canadian province mapping -const CA_PROVINCE_MAP: Record = { - 'alberta': 'AB', 'british-columbia': 'BC', 'manitoba': 'MB', - 'new-brunswick': 'NB', 'newfoundland-and-labrador': 'NL', - 'northwest-territories': 'NT', 'nova-scotia': 'NS', 'nunavut': 'NU', - 'ontario': 'ON', 'prince-edward-island': 'PE', 'quebec': 'QC', - 'saskatchewan': 'SK', 'yukon': 'YT', -}; - -// ============================================================ -// CITY FETCHING -// ============================================================ - -/** - * Fetch cities from Dutchie's /cities page using Puppeteer to extract data. - */ -async function fetchCitiesFromDutchie(): Promise { - console.log('[DutchieCityDiscovery] Launching browser to fetch cities...'); - - const browser = await puppeteer.launch({ - headless: 'new', - args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'], - }); - - try { - const page = await browser.newPage(); - await page.setUserAgent( - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' - ); - - // Navigate to cities page - console.log('[DutchieCityDiscovery] Navigating to https://dutchie.com/cities...'); - await page.goto('https://dutchie.com/cities', { - waitUntil: 'networkidle2', - timeout: 60000, - }); - - // Wait for content to load - await new Promise((r) => setTimeout(r, 3000)); - - // Extract city links from the page - const cities = await page.evaluate(() => { - const cityLinks: Array<{ - name: string; - slug: string; - url: string; - stateSlug: string | null; - }> = []; - - // Find all city links - they typically follow pattern /city/{state}/{city} - const links = document.querySelectorAll('a[href*="/city/"]'); - links.forEach((link) => { - const href = (link as HTMLAnchorElement).href; - const text = (link as HTMLElement).innerText?.trim(); - - // Parse URL: https://dutchie.com/city/{state}/{city} - const match = href.match(/\/city\/([^/]+)\/([^/?]+)/); - if (match && text) { - cityLinks.push({ - name: text, - slug: match[2], - url: href, - stateSlug: match[1], - }); - } - }); - - return cityLinks; - }); - - console.log(`[DutchieCityDiscovery] Extracted ${cities.length} city links from page`); - - // Convert to DutchieCity format - const result: DutchieCity[] = []; - - for (const city of cities) { - // Determine country and state code - let countryCode = 'US'; - let stateCode: string | null = null; - - if (city.stateSlug) { - // Check if it's a US state - if (US_STATE_MAP[city.stateSlug]) { - stateCode = US_STATE_MAP[city.stateSlug]; - countryCode = 'US'; - } - // Check if it's a Canadian province - else if (CA_PROVINCE_MAP[city.stateSlug]) { - stateCode = CA_PROVINCE_MAP[city.stateSlug]; - countryCode = 'CA'; - } - // Check if it's already a 2-letter code - else if (city.stateSlug.length === 2) { - stateCode = city.stateSlug.toUpperCase(); - // Determine country based on state code - if (Object.values(CA_PROVINCE_MAP).includes(stateCode)) { - countryCode = 'CA'; - } - } - } - - result.push({ - name: city.name, - slug: city.slug, - stateCode, - countryCode, - url: city.url, - }); - } - - return result; - } finally { - await browser.close(); - } -} - -/** - * Alternative: Fetch cities by making API/GraphQL requests. - * Falls back to this if scraping fails. - */ -async function fetchCitiesFromAPI(): Promise { - console.log('[DutchieCityDiscovery] Attempting API-based city discovery...'); - - // Dutchie may have an API endpoint for cities - // Try common patterns - const apiEndpoints = [ - 'https://dutchie.com/api/cities', - 'https://api.dutchie.com/v1/cities', - ]; - - for (const endpoint of apiEndpoints) { - try { - const response = await axios.get(endpoint, { - headers: { - 'User-Agent': - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0', - Accept: 'application/json', - }, - timeout: 15000, - }); - - if (response.data && Array.isArray(response.data)) { - console.log(`[DutchieCityDiscovery] API returned ${response.data.length} cities`); - return response.data.map((c: any) => ({ - name: c.name || c.city, - slug: c.slug || c.citySlug, - stateCode: c.stateCode || c.state, - countryCode: c.countryCode || c.country || 'US', - })); - } - } catch (error: any) { - console.log(`[DutchieCityDiscovery] API ${endpoint} failed: ${error.message}`); - } - } - - return []; -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Upsert a city into dutchie_discovery_cities - */ -async function upsertCity( - pool: Pool, - city: DutchieCity -): Promise<{ inserted: boolean; updated: boolean }> { - const result = await pool.query( - ` - INSERT INTO dutchie_discovery_cities ( - platform, - city_name, - city_slug, - state_code, - country_code, - last_crawled_at, - updated_at - ) VALUES ( - 'dutchie', - $1, - $2, - $3, - $4, - NOW(), - NOW() - ) - ON CONFLICT (platform, country_code, state_code, city_slug) - DO UPDATE SET - city_name = EXCLUDED.city_name, - last_crawled_at = NOW(), - updated_at = NOW() - RETURNING (xmax = 0) AS inserted - `, - [city.name, city.slug, city.stateCode, city.countryCode] - ); - - const inserted = result.rows[0]?.inserted === true; - return { inserted, updated: !inserted }; -} - -// ============================================================ -// MAIN DISCOVERY FUNCTION -// ============================================================ - -export class DutchieCityDiscovery { - private pool: Pool; - - constructor(pool: Pool) { - this.pool = pool; - } - - /** - * Run the city discovery process - */ - async run(): Promise { - const startTime = Date.now(); - const errors: string[] = []; - let citiesFound = 0; - let citiesInserted = 0; - let citiesUpdated = 0; - - console.log('[DutchieCityDiscovery] Starting city discovery...'); - - try { - // Try scraping first, fall back to API - let cities = await fetchCitiesFromDutchie(); - - if (cities.length === 0) { - console.log('[DutchieCityDiscovery] Scraping returned 0 cities, trying API...'); - cities = await fetchCitiesFromAPI(); - } - - citiesFound = cities.length; - console.log(`[DutchieCityDiscovery] Found ${citiesFound} cities`); - - // Upsert each city - for (const city of cities) { - try { - const result = await upsertCity(this.pool, city); - if (result.inserted) { - citiesInserted++; - } else if (result.updated) { - citiesUpdated++; - } - } catch (error: any) { - const msg = `Failed to upsert city ${city.slug}: ${error.message}`; - console.error(`[DutchieCityDiscovery] ${msg}`); - errors.push(msg); - } - } - } catch (error: any) { - const msg = `City discovery failed: ${error.message}`; - console.error(`[DutchieCityDiscovery] ${msg}`); - errors.push(msg); - } - - const durationMs = Date.now() - startTime; - - console.log('[DutchieCityDiscovery] Discovery complete:'); - console.log(` Cities found: ${citiesFound}`); - console.log(` Inserted: ${citiesInserted}`); - console.log(` Updated: ${citiesUpdated}`); - console.log(` Errors: ${errors.length}`); - console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`); - - return { - citiesFound, - citiesInserted, - citiesUpdated, - errors, - durationMs, - }; - } - - /** - * Get statistics about discovered cities - */ - async getStats(): Promise<{ - total: number; - byCountry: Array<{ countryCode: string; count: number }>; - byState: Array<{ stateCode: string; countryCode: string; count: number }>; - crawlEnabled: number; - neverCrawled: number; - }> { - const [totalRes, byCountryRes, byStateRes, enabledRes, neverRes] = await Promise.all([ - this.pool.query('SELECT COUNT(*) as cnt FROM dutchie_discovery_cities'), - this.pool.query(` - SELECT country_code, COUNT(*) as cnt - FROM dutchie_discovery_cities - GROUP BY country_code - ORDER BY cnt DESC - `), - this.pool.query(` - SELECT state_code, country_code, COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE state_code IS NOT NULL - GROUP BY state_code, country_code - ORDER BY cnt DESC - `), - this.pool.query(` - SELECT COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE crawl_enabled = TRUE - `), - this.pool.query(` - SELECT COUNT(*) as cnt - FROM dutchie_discovery_cities - WHERE last_crawled_at IS NULL - `), - ]); - - return { - total: parseInt(totalRes.rows[0]?.cnt || '0', 10), - byCountry: byCountryRes.rows.map((r) => ({ - countryCode: r.country_code, - count: parseInt(r.cnt, 10), - })), - byState: byStateRes.rows.map((r) => ({ - stateCode: r.state_code, - countryCode: r.country_code, - count: parseInt(r.cnt, 10), - })), - crawlEnabled: parseInt(enabledRes.rows[0]?.cnt || '0', 10), - neverCrawled: parseInt(neverRes.rows[0]?.cnt || '0', 10), - }; - } -} - -export default DutchieCityDiscovery; diff --git a/backend/src/dutchie-az/discovery/DutchieLocationDiscovery.ts b/backend/src/dutchie-az/discovery/DutchieLocationDiscovery.ts deleted file mode 100644 index 2bb27e17..00000000 --- a/backend/src/dutchie-az/discovery/DutchieLocationDiscovery.ts +++ /dev/null @@ -1,639 +0,0 @@ -/** - * DutchieLocationDiscovery - * - * Discovers store locations for each city from Dutchie and upserts to dutchie_discovery_locations. - * - * Responsibilities: - * - Given a dutchie_discovery_cities row, call Dutchie's location/search endpoint - * - For each store: extract platform_location_id, platform_slug, platform_menu_url, name, address, coords - * - Upsert into dutchie_discovery_locations - * - DO NOT overwrite status if already verified/merged/rejected - * - DO NOT overwrite dispensary_id if already set - */ - -import { Pool } from 'pg'; -import axios from 'axios'; -import puppeteer from 'puppeteer-extra'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; - -puppeteer.use(StealthPlugin()); - -// ============================================================ -// TYPES -// ============================================================ - -export interface DiscoveryCity { - id: number; - platform: string; - cityName: string; - citySlug: string; - stateCode: string | null; - countryCode: string; - crawlEnabled: boolean; -} - -export interface DutchieLocation { - platformLocationId: string; - platformSlug: string; - platformMenuUrl: string; - name: string; - rawAddress: string | null; - addressLine1: string | null; - addressLine2: string | null; - city: string | null; - stateCode: string | null; - postalCode: string | null; - countryCode: string | null; - latitude: number | null; - longitude: number | null; - timezone: string | null; - offersDelivery: boolean | null; - offersPickup: boolean | null; - isRecreational: boolean | null; - isMedical: boolean | null; - metadata: Record; -} - -export interface LocationDiscoveryResult { - cityId: number; - citySlug: string; - locationsFound: number; - locationsInserted: number; - locationsUpdated: number; - locationsSkipped: number; - errors: string[]; - durationMs: number; -} - -// ============================================================ -// LOCATION FETCHING -// ============================================================ - -/** - * Fetch locations for a city using Puppeteer to scrape the city page - */ -async function fetchLocationsForCity(city: DiscoveryCity): Promise { - console.log(`[DutchieLocationDiscovery] Fetching locations for ${city.cityName}, ${city.stateCode}...`); - - const browser = await puppeteer.launch({ - headless: 'new', - args: ['--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'], - }); - - try { - const page = await browser.newPage(); - await page.setUserAgent( - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' - ); - - // Navigate to city page - use /us/dispensaries/{city_slug} pattern - const cityUrl = `https://dutchie.com/us/dispensaries/${city.citySlug}`; - console.log(`[DutchieLocationDiscovery] Navigating to ${cityUrl}...`); - - await page.goto(cityUrl, { - waitUntil: 'networkidle2', - timeout: 60000, - }); - - // Wait for content - await new Promise((r) => setTimeout(r, 3000)); - - // Try to extract __NEXT_DATA__ which often contains store data - const nextData = await page.evaluate(() => { - const script = document.querySelector('script#__NEXT_DATA__'); - if (script) { - try { - return JSON.parse(script.textContent || '{}'); - } catch { - return null; - } - } - return null; - }); - - let locations: DutchieLocation[] = []; - - if (nextData?.props?.pageProps?.dispensaries) { - // Extract from Next.js data - const dispensaries = nextData.props.pageProps.dispensaries; - console.log(`[DutchieLocationDiscovery] Found ${dispensaries.length} dispensaries in __NEXT_DATA__`); - - locations = dispensaries.map((d: any) => parseDispensaryData(d, city)); - } else { - // Fall back to DOM scraping - console.log('[DutchieLocationDiscovery] No __NEXT_DATA__, trying DOM scraping...'); - - const scrapedData = await page.evaluate(() => { - const stores: Array<{ - name: string; - href: string; - address: string | null; - }> = []; - - // Look for dispensary cards/links - const cards = document.querySelectorAll('[data-testid="dispensary-card"], .dispensary-card, a[href*="/dispensary/"]'); - cards.forEach((card) => { - const link = card.querySelector('a[href*="/dispensary/"]') || (card as HTMLAnchorElement); - const href = (link as HTMLAnchorElement).href || ''; - const name = - card.querySelector('[data-testid="dispensary-name"]')?.textContent || - card.querySelector('h2, h3, .name')?.textContent || - link.textContent || - ''; - const address = card.querySelector('[data-testid="dispensary-address"], .address')?.textContent || null; - - if (href && name) { - stores.push({ - name: name.trim(), - href, - address: address?.trim() || null, - }); - } - }); - - return stores; - }); - - console.log(`[DutchieLocationDiscovery] DOM scraping found ${scrapedData.length} stores`); - - locations = scrapedData.map((s) => { - // Parse slug from URL - const match = s.href.match(/\/dispensary\/([^/?]+)/); - const slug = match ? match[1] : s.name.toLowerCase().replace(/\s+/g, '-'); - - return { - platformLocationId: slug, // Will be resolved later - platformSlug: slug, - platformMenuUrl: `https://dutchie.com/dispensary/${slug}`, - name: s.name, - rawAddress: s.address, - addressLine1: null, - addressLine2: null, - city: city.cityName, - stateCode: city.stateCode, - postalCode: null, - countryCode: city.countryCode, - latitude: null, - longitude: null, - timezone: null, - offersDelivery: null, - offersPickup: null, - isRecreational: null, - isMedical: null, - metadata: { source: 'dom_scrape', originalUrl: s.href }, - }; - }); - } - - return locations; - } finally { - await browser.close(); - } -} - -/** - * Parse dispensary data from Dutchie's API/JSON response - */ -function parseDispensaryData(d: any, city: DiscoveryCity): DutchieLocation { - const id = d.id || d._id || d.dispensaryId || ''; - const slug = d.slug || d.cName || d.name?.toLowerCase().replace(/\s+/g, '-') || ''; - - // Build menu URL - let menuUrl = `https://dutchie.com/dispensary/${slug}`; - if (d.menuUrl) { - menuUrl = d.menuUrl; - } else if (d.embeddedMenuUrl) { - menuUrl = d.embeddedMenuUrl; - } - - // Parse address - const address = d.address || d.location?.address || {}; - const rawAddress = [ - address.line1 || address.street1 || d.address1, - address.line2 || address.street2 || d.address2, - [ - address.city || d.city, - address.state || address.stateCode || d.state, - address.zip || address.zipCode || address.postalCode || d.zip, - ] - .filter(Boolean) - .join(' '), - ] - .filter(Boolean) - .join(', '); - - return { - platformLocationId: id, - platformSlug: slug, - platformMenuUrl: menuUrl, - name: d.name || d.dispensaryName || '', - rawAddress: rawAddress || null, - addressLine1: address.line1 || address.street1 || d.address1 || null, - addressLine2: address.line2 || address.street2 || d.address2 || null, - city: address.city || d.city || city.cityName, - stateCode: address.state || address.stateCode || d.state || city.stateCode, - postalCode: address.zip || address.zipCode || address.postalCode || d.zip || null, - countryCode: address.country || address.countryCode || d.country || city.countryCode, - latitude: d.latitude ?? d.location?.latitude ?? d.location?.lat ?? null, - longitude: d.longitude ?? d.location?.longitude ?? d.location?.lng ?? null, - timezone: d.timezone || d.timeZone || null, - offersDelivery: d.offerDelivery ?? d.offersDelivery ?? d.delivery ?? null, - offersPickup: d.offerPickup ?? d.offersPickup ?? d.pickup ?? null, - isRecreational: d.isRecreational ?? d.recreational ?? (d.retailType === 'recreational' || d.retailType === 'both'), - isMedical: d.isMedical ?? d.medical ?? (d.retailType === 'medical' || d.retailType === 'both'), - metadata: { - source: 'next_data', - retailType: d.retailType, - brand: d.brand, - logo: d.logo || d.logoUrl, - raw: d, - }, - }; -} - -/** - * Alternative: Use GraphQL to discover locations - */ -async function fetchLocationsViaGraphQL(city: DiscoveryCity): Promise { - console.log(`[DutchieLocationDiscovery] Trying GraphQL for ${city.cityName}...`); - - // Try geo-based search - // This would require knowing the city's coordinates - // For now, return empty and rely on page scraping - return []; -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Upsert a location into dutchie_discovery_locations - * Does NOT overwrite status if already verified/merged/rejected - * Does NOT overwrite dispensary_id if already set - */ -async function upsertLocation( - pool: Pool, - location: DutchieLocation, - cityId: number -): Promise<{ inserted: boolean; updated: boolean; skipped: boolean }> { - // First check if this location exists and has a protected status - const existing = await pool.query( - ` - SELECT id, status, dispensary_id - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND platform_location_id = $1 - `, - [location.platformLocationId] - ); - - if (existing.rows.length > 0) { - const row = existing.rows[0]; - const protectedStatuses = ['verified', 'merged', 'rejected']; - - if (protectedStatuses.includes(row.status)) { - // Only update last_seen_at for protected statuses - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET last_seen_at = NOW(), updated_at = NOW() - WHERE id = $1 - `, - [row.id] - ); - return { inserted: false, updated: false, skipped: true }; - } - - // Update existing discovered location (but preserve dispensary_id if set) - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET - platform_slug = $2, - platform_menu_url = $3, - name = $4, - raw_address = COALESCE($5, raw_address), - address_line1 = COALESCE($6, address_line1), - address_line2 = COALESCE($7, address_line2), - city = COALESCE($8, city), - state_code = COALESCE($9, state_code), - postal_code = COALESCE($10, postal_code), - country_code = COALESCE($11, country_code), - latitude = COALESCE($12, latitude), - longitude = COALESCE($13, longitude), - timezone = COALESCE($14, timezone), - offers_delivery = COALESCE($15, offers_delivery), - offers_pickup = COALESCE($16, offers_pickup), - is_recreational = COALESCE($17, is_recreational), - is_medical = COALESCE($18, is_medical), - metadata = COALESCE($19, metadata), - discovery_city_id = $20, - last_seen_at = NOW(), - updated_at = NOW() - WHERE id = $1 - `, - [ - row.id, - location.platformSlug, - location.platformMenuUrl, - location.name, - location.rawAddress, - location.addressLine1, - location.addressLine2, - location.city, - location.stateCode, - location.postalCode, - location.countryCode, - location.latitude, - location.longitude, - location.timezone, - location.offersDelivery, - location.offersPickup, - location.isRecreational, - location.isMedical, - JSON.stringify(location.metadata), - cityId, - ] - ); - return { inserted: false, updated: true, skipped: false }; - } - - // Insert new location - await pool.query( - ` - INSERT INTO dutchie_discovery_locations ( - platform, - platform_location_id, - platform_slug, - platform_menu_url, - name, - raw_address, - address_line1, - address_line2, - city, - state_code, - postal_code, - country_code, - latitude, - longitude, - timezone, - status, - offers_delivery, - offers_pickup, - is_recreational, - is_medical, - metadata, - discovery_city_id, - first_seen_at, - last_seen_at, - active, - created_at, - updated_at - ) VALUES ( - 'dutchie', - $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, - 'discovered', - $15, $16, $17, $18, $19, $20, - NOW(), NOW(), TRUE, NOW(), NOW() - ) - `, - [ - location.platformLocationId, - location.platformSlug, - location.platformMenuUrl, - location.name, - location.rawAddress, - location.addressLine1, - location.addressLine2, - location.city, - location.stateCode, - location.postalCode, - location.countryCode, - location.latitude, - location.longitude, - location.timezone, - location.offersDelivery, - location.offersPickup, - location.isRecreational, - location.isMedical, - JSON.stringify(location.metadata), - cityId, - ] - ); - - return { inserted: true, updated: false, skipped: false }; -} - -// ============================================================ -// MAIN DISCOVERY CLASS -// ============================================================ - -export class DutchieLocationDiscovery { - private pool: Pool; - - constructor(pool: Pool) { - this.pool = pool; - } - - /** - * Get a city by slug - */ - async getCityBySlug(citySlug: string): Promise { - const { rows } = await this.pool.query( - ` - SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND city_slug = $1 - LIMIT 1 - `, - [citySlug] - ); - - if (rows.length === 0) return null; - - const r = rows[0]; - return { - id: r.id, - platform: r.platform, - cityName: r.city_name, - citySlug: r.city_slug, - stateCode: r.state_code, - countryCode: r.country_code, - crawlEnabled: r.crawl_enabled, - }; - } - - /** - * Get all crawl-enabled cities - */ - async getEnabledCities(limit?: number): Promise { - const { rows } = await this.pool.query( - ` - SELECT id, platform, city_name, city_slug, state_code, country_code, crawl_enabled - FROM dutchie_discovery_cities - WHERE platform = 'dutchie' AND crawl_enabled = TRUE - ORDER BY last_crawled_at ASC NULLS FIRST, city_name ASC - ${limit ? `LIMIT ${limit}` : ''} - ` - ); - - return rows.map((r) => ({ - id: r.id, - platform: r.platform, - cityName: r.city_name, - citySlug: r.city_slug, - stateCode: r.state_code, - countryCode: r.country_code, - crawlEnabled: r.crawl_enabled, - })); - } - - /** - * Discover locations for a single city - */ - async discoverForCity(city: DiscoveryCity): Promise { - const startTime = Date.now(); - const errors: string[] = []; - let locationsFound = 0; - let locationsInserted = 0; - let locationsUpdated = 0; - let locationsSkipped = 0; - - console.log(`[DutchieLocationDiscovery] Discovering locations for ${city.cityName}, ${city.stateCode}...`); - - try { - // Fetch locations - let locations = await fetchLocationsForCity(city); - - // If scraping fails, try GraphQL - if (locations.length === 0) { - locations = await fetchLocationsViaGraphQL(city); - } - - locationsFound = locations.length; - console.log(`[DutchieLocationDiscovery] Found ${locationsFound} locations`); - - // Upsert each location - for (const location of locations) { - try { - const result = await upsertLocation(this.pool, location, city.id); - if (result.inserted) locationsInserted++; - else if (result.updated) locationsUpdated++; - else if (result.skipped) locationsSkipped++; - } catch (error: any) { - const msg = `Failed to upsert location ${location.platformSlug}: ${error.message}`; - console.error(`[DutchieLocationDiscovery] ${msg}`); - errors.push(msg); - } - } - - // Update city's last_crawled_at and location_count - await this.pool.query( - ` - UPDATE dutchie_discovery_cities - SET last_crawled_at = NOW(), - location_count = $1, - updated_at = NOW() - WHERE id = $2 - `, - [locationsFound, city.id] - ); - } catch (error: any) { - const msg = `Location discovery failed for ${city.citySlug}: ${error.message}`; - console.error(`[DutchieLocationDiscovery] ${msg}`); - errors.push(msg); - } - - const durationMs = Date.now() - startTime; - - console.log(`[DutchieLocationDiscovery] City ${city.citySlug} complete:`); - console.log(` Locations found: ${locationsFound}`); - console.log(` Inserted: ${locationsInserted}`); - console.log(` Updated: ${locationsUpdated}`); - console.log(` Skipped (protected): ${locationsSkipped}`); - console.log(` Errors: ${errors.length}`); - console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`); - - return { - cityId: city.id, - citySlug: city.citySlug, - locationsFound, - locationsInserted, - locationsUpdated, - locationsSkipped, - errors, - durationMs, - }; - } - - /** - * Discover locations for all enabled cities - */ - async discoverAllEnabled(options: { - limit?: number; - delayMs?: number; - } = {}): Promise<{ - totalCities: number; - totalLocationsFound: number; - totalInserted: number; - totalUpdated: number; - totalSkipped: number; - errors: string[]; - durationMs: number; - }> { - const { limit, delayMs = 2000 } = options; - const startTime = Date.now(); - let totalLocationsFound = 0; - let totalInserted = 0; - let totalUpdated = 0; - let totalSkipped = 0; - const allErrors: string[] = []; - - const cities = await this.getEnabledCities(limit); - console.log(`[DutchieLocationDiscovery] Discovering locations for ${cities.length} cities...`); - - for (let i = 0; i < cities.length; i++) { - const city = cities[i]; - console.log(`\n[DutchieLocationDiscovery] City ${i + 1}/${cities.length}: ${city.cityName}, ${city.stateCode}`); - - try { - const result = await this.discoverForCity(city); - totalLocationsFound += result.locationsFound; - totalInserted += result.locationsInserted; - totalUpdated += result.locationsUpdated; - totalSkipped += result.locationsSkipped; - allErrors.push(...result.errors); - } catch (error: any) { - allErrors.push(`City ${city.citySlug} failed: ${error.message}`); - } - - // Delay between cities - if (i < cities.length - 1 && delayMs > 0) { - await new Promise((r) => setTimeout(r, delayMs)); - } - } - - const durationMs = Date.now() - startTime; - - console.log('\n[DutchieLocationDiscovery] All cities complete:'); - console.log(` Total cities: ${cities.length}`); - console.log(` Total locations found: ${totalLocationsFound}`); - console.log(` Total inserted: ${totalInserted}`); - console.log(` Total updated: ${totalUpdated}`); - console.log(` Total skipped: ${totalSkipped}`); - console.log(` Total errors: ${allErrors.length}`); - console.log(` Duration: ${(durationMs / 1000).toFixed(1)}s`); - - return { - totalCities: cities.length, - totalLocationsFound, - totalInserted, - totalUpdated, - totalSkipped, - errors: allErrors, - durationMs, - }; - } -} - -export default DutchieLocationDiscovery; diff --git a/backend/src/dutchie-az/discovery/discovery-dt-cities-auto.ts b/backend/src/dutchie-az/discovery/discovery-dt-cities-auto.ts deleted file mode 100644 index 7f0b9e48..00000000 --- a/backend/src/dutchie-az/discovery/discovery-dt-cities-auto.ts +++ /dev/null @@ -1,73 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Discovery Entrypoint: Dutchie Cities (Auto) - * - * Attempts browser/API-based /cities discovery. - * Even if currently blocked (403), this runner preserves the auto-discovery path. - * - * Usage: - * npm run discovery:dt:cities:auto - * DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-cities-auto.ts - */ - -import { Pool } from 'pg'; -import { DtCityDiscoveryService } from './DtCityDiscoveryService'; - -const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL || - 'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus'; - -async function main() { - console.log('╔══════════════════════════════════════════════════╗'); - console.log('ā•‘ Dutchie City Discovery (AUTO) ā•‘'); - console.log('ā•‘ Browser + API fallback ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`); - - const pool = new Pool({ connectionString: DB_URL }); - - try { - const { rows } = await pool.query('SELECT NOW() as time'); - console.log(`Connected at: ${rows[0].time}\n`); - - const service = new DtCityDiscoveryService(pool); - const result = await service.runAutoDiscovery(); - - console.log('\n' + '═'.repeat(50)); - console.log('SUMMARY'); - console.log('═'.repeat(50)); - console.log(`Cities found: ${result.citiesFound}`); - console.log(`Cities inserted: ${result.citiesInserted}`); - console.log(`Cities updated: ${result.citiesUpdated}`); - console.log(`Errors: ${result.errors.length}`); - console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`); - - if (result.errors.length > 0) { - console.log('\nErrors:'); - result.errors.forEach((e, i) => console.log(` ${i + 1}. ${e}`)); - } - - const stats = await service.getStats(); - console.log('\nCurrent Database Stats:'); - console.log(` Total cities: ${stats.total}`); - console.log(` Crawl enabled: ${stats.crawlEnabled}`); - console.log(` Never crawled: ${stats.neverCrawled}`); - - if (result.citiesFound === 0) { - console.log('\nāš ļø No cities found via auto-discovery.'); - console.log(' This may be due to Dutchie blocking scraping/API access.'); - console.log(' Use manual seeding instead:'); - console.log(' npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY'); - process.exit(1); - } - - console.log('\nāœ… Auto city discovery completed'); - process.exit(0); - } catch (error: any) { - console.error('\nāŒ Auto city discovery failed:', error.message); - process.exit(1); - } finally { - await pool.end(); - } -} - -main(); diff --git a/backend/src/dutchie-az/discovery/discovery-dt-cities-manual-seed.ts b/backend/src/dutchie-az/discovery/discovery-dt-cities-manual-seed.ts deleted file mode 100644 index b9c422f6..00000000 --- a/backend/src/dutchie-az/discovery/discovery-dt-cities-manual-seed.ts +++ /dev/null @@ -1,137 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Discovery Entrypoint: Dutchie Cities (Manual Seed) - * - * Manually seeds cities into dutchie_discovery_cities via CLI args. - * Use this when auto-discovery is blocked (403). - * - * Usage: - * npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY - * npm run discovery:dt:cities:manual -- --city-slug=ma-boston --city-name=Boston --state-code=MA --country-code=US - * - * Options: - * --city-slug Required. URL slug (e.g., "ny-hudson") - * --city-name Required. Display name (e.g., "Hudson") - * --state-code Required. State/province code (e.g., "NY", "CA", "ON") - * --country-code Optional. Country code (default: "US") - * - * After seeding, run location discovery: - * npm run discovery:dt:locations - */ - -import { Pool } from 'pg'; -import { DtCityDiscoveryService, DutchieCity } from './DtCityDiscoveryService'; - -const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL || - 'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus'; - -interface Args { - citySlug?: string; - cityName?: string; - stateCode?: string; - countryCode: string; -} - -function parseArgs(): Args { - const args: Args = { countryCode: 'US' }; - - for (const arg of process.argv.slice(2)) { - const citySlugMatch = arg.match(/--city-slug=(.+)/); - if (citySlugMatch) args.citySlug = citySlugMatch[1]; - - const cityNameMatch = arg.match(/--city-name=(.+)/); - if (cityNameMatch) args.cityName = cityNameMatch[1]; - - const stateCodeMatch = arg.match(/--state-code=(.+)/); - if (stateCodeMatch) args.stateCode = stateCodeMatch[1].toUpperCase(); - - const countryCodeMatch = arg.match(/--country-code=(.+)/); - if (countryCodeMatch) args.countryCode = countryCodeMatch[1].toUpperCase(); - } - - return args; -} - -function printUsage() { - console.log(` -Usage: - npm run discovery:dt:cities:manual -- --city-slug= --city-name= --state-code= - -Required arguments: - --city-slug URL slug for the city (e.g., "ny-hudson", "ma-boston") - --city-name Display name (e.g., "Hudson", "Boston") - --state-code State/province code (e.g., "NY", "CA", "ON") - -Optional arguments: - --country-code Country code (default: "US") - -Examples: - npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY - npm run discovery:dt:cities:manual -- --city-slug=ca-los-angeles --city-name="Los Angeles" --state-code=CA - npm run discovery:dt:cities:manual -- --city-slug=on-toronto --city-name=Toronto --state-code=ON --country-code=CA - -After seeding, run location discovery: - npm run discovery:dt:locations -`); -} - -async function main() { - const args = parseArgs(); - - console.log('╔══════════════════════════════════════════════════╗'); - console.log('ā•‘ Dutchie City Discovery (MANUAL SEED) ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - - if (!args.citySlug || !args.cityName || !args.stateCode) { - console.error('\nāŒ Error: Missing required arguments\n'); - printUsage(); - process.exit(1); - } - - console.log(`\nCity Slug: ${args.citySlug}`); - console.log(`City Name: ${args.cityName}`); - console.log(`State Code: ${args.stateCode}`); - console.log(`Country Code: ${args.countryCode}`); - console.log(`Database: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`); - - const pool = new Pool({ connectionString: DB_URL }); - - try { - const { rows } = await pool.query('SELECT NOW() as time'); - console.log(`\nConnected at: ${rows[0].time}`); - - const service = new DtCityDiscoveryService(pool); - - const city: DutchieCity = { - slug: args.citySlug, - name: args.cityName, - stateCode: args.stateCode, - countryCode: args.countryCode, - }; - - const result = await service.seedCity(city); - - const action = result.wasInserted ? 'INSERTED' : 'UPDATED'; - console.log(`\nāœ… City ${action}:`); - console.log(` ID: ${result.id}`); - console.log(` City Slug: ${result.city.slug}`); - console.log(` City Name: ${result.city.name}`); - console.log(` State Code: ${result.city.stateCode}`); - console.log(` Country Code: ${result.city.countryCode}`); - - const stats = await service.getStats(); - console.log(`\nTotal Dutchie cities: ${stats.total} (${stats.crawlEnabled} enabled)`); - - console.log('\nšŸ“ Next step: Run location discovery'); - console.log(' npm run discovery:dt:locations'); - - process.exit(0); - } catch (error: any) { - console.error('\nāŒ Failed to seed city:', error.message); - process.exit(1); - } finally { - await pool.end(); - } -} - -main(); diff --git a/backend/src/dutchie-az/discovery/discovery-dt-cities.ts b/backend/src/dutchie-az/discovery/discovery-dt-cities.ts deleted file mode 100644 index 3c875274..00000000 --- a/backend/src/dutchie-az/discovery/discovery-dt-cities.ts +++ /dev/null @@ -1,73 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Discovery Runner: Dutchie Cities - * - * Discovers cities from Dutchie's /cities page and upserts to dutchie_discovery_cities. - * - * Usage: - * npm run discovery:platforms:dt:cities - * DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-cities.ts - */ - -import { Pool } from 'pg'; -import { DutchieCityDiscovery } from './DutchieCityDiscovery'; - -const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL || - 'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus'; - -async function main() { - console.log('╔══════════════════════════════════════════════════╗'); - console.log('ā•‘ Dutchie City Discovery Runner ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`); - - const pool = new Pool({ connectionString: DB_URL }); - - try { - // Test DB connection - const { rows } = await pool.query('SELECT NOW() as time'); - console.log(`Connected at: ${rows[0].time}\n`); - - // Run city discovery - const discovery = new DutchieCityDiscovery(pool); - const result = await discovery.run(); - - // Print summary - console.log('\n' + '═'.repeat(50)); - console.log('SUMMARY'); - console.log('═'.repeat(50)); - console.log(`Cities found: ${result.citiesFound}`); - console.log(`Cities inserted: ${result.citiesInserted}`); - console.log(`Cities updated: ${result.citiesUpdated}`); - console.log(`Errors: ${result.errors.length}`); - console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`); - - if (result.errors.length > 0) { - console.log('\nErrors:'); - result.errors.forEach((e, i) => console.log(` ${i + 1}. ${e}`)); - } - - // Get final stats - const stats = await discovery.getStats(); - console.log('\nCurrent Database Stats:'); - console.log(` Total cities: ${stats.total}`); - console.log(` Crawl enabled: ${stats.crawlEnabled}`); - console.log(` Never crawled: ${stats.neverCrawled}`); - console.log(` By country: ${stats.byCountry.map(c => `${c.countryCode}=${c.count}`).join(', ')}`); - - if (result.errors.length > 0) { - console.log('\nāš ļø Completed with errors'); - process.exit(1); - } - - console.log('\nāœ… City discovery completed successfully'); - process.exit(0); - } catch (error: any) { - console.error('\nāŒ City discovery failed:', error.message); - process.exit(1); - } finally { - await pool.end(); - } -} - -main(); diff --git a/backend/src/dutchie-az/discovery/discovery-dt-locations-from-cities.ts b/backend/src/dutchie-az/discovery/discovery-dt-locations-from-cities.ts deleted file mode 100644 index 61d122d7..00000000 --- a/backend/src/dutchie-az/discovery/discovery-dt-locations-from-cities.ts +++ /dev/null @@ -1,113 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Discovery Entrypoint: Dutchie Locations (From Cities) - * - * Reads from dutchie_discovery_cities (crawl_enabled = true) - * and discovers store locations for each city. - * - * Geo coordinates are captured when available from Dutchie's payloads. - * - * Usage: - * npm run discovery:dt:locations - * npm run discovery:dt:locations -- --limit=10 - * npm run discovery:dt:locations -- --delay=3000 - * DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-locations-from-cities.ts - * - * Options: - * --limit=N Only process N cities (default: all) - * --delay=N Delay between cities in ms (default: 2000) - */ - -import { Pool } from 'pg'; -import { DtLocationDiscoveryService } from './DtLocationDiscoveryService'; - -const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL || - 'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus'; - -function parseArgs(): { limit?: number; delay?: number } { - const args: { limit?: number; delay?: number } = {}; - - for (const arg of process.argv.slice(2)) { - const limitMatch = arg.match(/--limit=(\d+)/); - if (limitMatch) args.limit = parseInt(limitMatch[1], 10); - - const delayMatch = arg.match(/--delay=(\d+)/); - if (delayMatch) args.delay = parseInt(delayMatch[1], 10); - } - - return args; -} - -async function main() { - const args = parseArgs(); - - console.log('╔══════════════════════════════════════════════════╗'); - console.log('ā•‘ Dutchie Location Discovery (From Cities) ā•‘'); - console.log('ā•‘ Reads crawl_enabled cities, discovers stores ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`); - if (args.limit) console.log(`City limit: ${args.limit}`); - if (args.delay) console.log(`Delay: ${args.delay}ms`); - - const pool = new Pool({ connectionString: DB_URL }); - - try { - const { rows } = await pool.query('SELECT NOW() as time'); - console.log(`Connected at: ${rows[0].time}\n`); - - const service = new DtLocationDiscoveryService(pool); - const result = await service.discoverAllEnabled({ - limit: args.limit, - delayMs: args.delay ?? 2000, - }); - - console.log('\n' + '═'.repeat(50)); - console.log('SUMMARY'); - console.log('═'.repeat(50)); - console.log(`Cities processed: ${result.totalCities}`); - console.log(`Locations found: ${result.totalLocationsFound}`); - console.log(`Locations inserted: ${result.totalInserted}`); - console.log(`Locations updated: ${result.totalUpdated}`); - console.log(`Locations skipped: ${result.totalSkipped} (protected status)`); - console.log(`Errors: ${result.errors.length}`); - console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`); - - if (result.errors.length > 0) { - console.log('\nErrors (first 10):'); - result.errors.slice(0, 10).forEach((e, i) => console.log(` ${i + 1}. ${e}`)); - if (result.errors.length > 10) { - console.log(` ... and ${result.errors.length - 10} more`); - } - } - - // Get location stats including coordinates - const stats = await service.getStats(); - console.log('\nCurrent Database Stats:'); - console.log(` Total locations: ${stats.total}`); - console.log(` With coordinates: ${stats.withCoordinates}`); - console.log(` By status:`); - stats.byStatus.forEach(s => console.log(` ${s.status}: ${s.count}`)); - - if (result.totalCities === 0) { - console.log('\nāš ļø No crawl-enabled cities found.'); - console.log(' Seed cities first:'); - console.log(' npm run discovery:dt:cities:manual -- --city-slug=ny-hudson --city-name=Hudson --state-code=NY'); - process.exit(1); - } - - if (result.errors.length > 0) { - console.log('\nāš ļø Completed with errors'); - process.exit(1); - } - - console.log('\nāœ… Location discovery completed successfully'); - process.exit(0); - } catch (error: any) { - console.error('\nāŒ Location discovery failed:', error.message); - process.exit(1); - } finally { - await pool.end(); - } -} - -main(); diff --git a/backend/src/dutchie-az/discovery/discovery-dt-locations.ts b/backend/src/dutchie-az/discovery/discovery-dt-locations.ts deleted file mode 100644 index cb7af618..00000000 --- a/backend/src/dutchie-az/discovery/discovery-dt-locations.ts +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Discovery Runner: Dutchie Locations - * - * Discovers store locations for all crawl-enabled cities and upserts to dutchie_discovery_locations. - * - * Usage: - * npm run discovery:platforms:dt:locations - * npm run discovery:platforms:dt:locations -- --limit=10 - * DATABASE_URL="..." npx tsx src/dutchie-az/discovery/discovery-dt-locations.ts - * - * Options (via args): - * --limit=N Only process N cities (default: all) - * --delay=N Delay between cities in ms (default: 2000) - */ - -import { Pool } from 'pg'; -import { DutchieLocationDiscovery } from './DutchieLocationDiscovery'; - -const DB_URL = process.env.DATABASE_URL || process.env.CANNAIQ_DB_URL || - 'postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus'; - -// Parse CLI args -function parseArgs(): { limit?: number; delay?: number } { - const args: { limit?: number; delay?: number } = {}; - - for (const arg of process.argv.slice(2)) { - const limitMatch = arg.match(/--limit=(\d+)/); - if (limitMatch) args.limit = parseInt(limitMatch[1], 10); - - const delayMatch = arg.match(/--delay=(\d+)/); - if (delayMatch) args.delay = parseInt(delayMatch[1], 10); - } - - return args; -} - -async function main() { - const args = parseArgs(); - - console.log('╔══════════════════════════════════════════════════╗'); - console.log('ā•‘ Dutchie Location Discovery Runner ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - console.log(`\nDatabase: ${DB_URL.replace(/:[^:@]+@/, ':****@')}`); - if (args.limit) console.log(`City limit: ${args.limit}`); - if (args.delay) console.log(`Delay: ${args.delay}ms`); - - const pool = new Pool({ connectionString: DB_URL }); - - try { - // Test DB connection - const { rows } = await pool.query('SELECT NOW() as time'); - console.log(`Connected at: ${rows[0].time}\n`); - - // Run location discovery - const discovery = new DutchieLocationDiscovery(pool); - const result = await discovery.discoverAllEnabled({ - limit: args.limit, - delayMs: args.delay ?? 2000, - }); - - // Print summary - console.log('\n' + '═'.repeat(50)); - console.log('SUMMARY'); - console.log('═'.repeat(50)); - console.log(`Cities processed: ${result.totalCities}`); - console.log(`Locations found: ${result.totalLocationsFound}`); - console.log(`Locations inserted: ${result.totalInserted}`); - console.log(`Locations updated: ${result.totalUpdated}`); - console.log(`Locations skipped: ${result.totalSkipped} (protected status)`); - console.log(`Errors: ${result.errors.length}`); - console.log(`Duration: ${(result.durationMs / 1000).toFixed(1)}s`); - - if (result.errors.length > 0) { - console.log('\nErrors (first 10):'); - result.errors.slice(0, 10).forEach((e, i) => console.log(` ${i + 1}. ${e}`)); - if (result.errors.length > 10) { - console.log(` ... and ${result.errors.length - 10} more`); - } - } - - // Get DB counts - const { rows: countRows } = await pool.query(` - SELECT - COUNT(*) as total, - COUNT(*) FILTER (WHERE status = 'discovered') as discovered, - COUNT(*) FILTER (WHERE status = 'verified') as verified, - COUNT(*) FILTER (WHERE status = 'merged') as merged, - COUNT(*) FILTER (WHERE status = 'rejected') as rejected - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE - `); - - const counts = countRows[0]; - console.log('\nCurrent Database Stats:'); - console.log(` Total locations: ${counts.total}`); - console.log(` Status discovered: ${counts.discovered}`); - console.log(` Status verified: ${counts.verified}`); - console.log(` Status merged: ${counts.merged}`); - console.log(` Status rejected: ${counts.rejected}`); - - if (result.errors.length > 0) { - console.log('\nāš ļø Completed with errors'); - process.exit(1); - } - - console.log('\nāœ… Location discovery completed successfully'); - process.exit(0); - } catch (error: any) { - console.error('\nāŒ Location discovery failed:', error.message); - process.exit(1); - } finally { - await pool.end(); - } -} - -main(); diff --git a/backend/src/dutchie-az/discovery/index.ts b/backend/src/dutchie-az/discovery/index.ts deleted file mode 100644 index 5b10d0b2..00000000 --- a/backend/src/dutchie-az/discovery/index.ts +++ /dev/null @@ -1,10 +0,0 @@ -/** - * Dutchie Discovery Module - * - * Store discovery pipeline for Dutchie platform. - */ - -export { DutchieCityDiscovery } from './DutchieCityDiscovery'; -export { DutchieLocationDiscovery } from './DutchieLocationDiscovery'; -export { createDutchieDiscoveryRoutes } from './routes'; -export { promoteDiscoveryLocation } from './promoteDiscoveryLocation'; diff --git a/backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts b/backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts deleted file mode 100644 index 3311f8e2..00000000 --- a/backend/src/dutchie-az/discovery/promoteDiscoveryLocation.ts +++ /dev/null @@ -1,248 +0,0 @@ -/** - * Promote Discovery Location to Crawlable Dispensary - * - * When a discovery location is verified or merged: - * 1. Ensure a crawl profile exists for the dispensary - * 2. Seed/update crawl schedule - * 3. Create initial crawl job - */ - -import { Pool } from 'pg'; - -export interface PromotionResult { - success: boolean; - discoveryId: number; - dispensaryId: number; - crawlProfileId?: number; - scheduleUpdated?: boolean; - crawlJobCreated?: boolean; - error?: string; -} - -/** - * Promote a verified/merged discovery location to a crawlable dispensary. - * - * This function: - * 1. Verifies the discovery location is verified/merged and has a dispensary_id - * 2. Ensures the dispensary has platform info (menu_type, platform_dispensary_id) - * 3. Creates/updates a crawler profile if the profile table exists - * 4. Queues an initial crawl job - */ -export async function promoteDiscoveryLocation( - pool: Pool, - discoveryLocationId: number -): Promise { - console.log(`[Promote] Starting promotion for discovery location ${discoveryLocationId}...`); - - // Get the discovery location - const { rows: locRows } = await pool.query( - ` - SELECT - dl.*, - d.id as disp_id, - d.name as disp_name, - d.menu_type as disp_menu_type, - d.platform_dispensary_id as disp_platform_id - FROM dutchie_discovery_locations dl - JOIN dispensaries d ON dl.dispensary_id = d.id - WHERE dl.id = $1 - `, - [discoveryLocationId] - ); - - if (locRows.length === 0) { - return { - success: false, - discoveryId: discoveryLocationId, - dispensaryId: 0, - error: 'Discovery location not found or not linked to a dispensary', - }; - } - - const location = locRows[0]; - - // Verify status - if (!['verified', 'merged'].includes(location.status)) { - return { - success: false, - discoveryId: discoveryLocationId, - dispensaryId: location.dispensary_id || 0, - error: `Cannot promote: location status is '${location.status}', must be 'verified' or 'merged'`, - }; - } - - const dispensaryId = location.dispensary_id; - console.log(`[Promote] Location ${discoveryLocationId} -> Dispensary ${dispensaryId} (${location.disp_name})`); - - // Ensure dispensary has platform info - if (!location.disp_platform_id) { - console.log(`[Promote] Updating dispensary with platform info...`); - await pool.query( - ` - UPDATE dispensaries - SET platform_dispensary_id = COALESCE(platform_dispensary_id, $1), - menu_url = COALESCE(menu_url, $2), - menu_type = COALESCE(menu_type, 'dutchie'), - updated_at = NOW() - WHERE id = $3 - `, - [location.platform_location_id, location.platform_menu_url, dispensaryId] - ); - } - - let crawlProfileId: number | undefined; - let scheduleUpdated = false; - let crawlJobCreated = false; - - // Check if dispensary_crawler_profiles table exists - const { rows: tableCheck } = await pool.query(` - SELECT EXISTS ( - SELECT FROM information_schema.tables - WHERE table_name = 'dispensary_crawler_profiles' - ) as exists - `); - - if (tableCheck[0]?.exists) { - // Create or get crawler profile - console.log(`[Promote] Checking crawler profile...`); - - const { rows: profileRows } = await pool.query( - ` - SELECT id FROM dispensary_crawler_profiles - WHERE dispensary_id = $1 AND platform = 'dutchie' - `, - [dispensaryId] - ); - - if (profileRows.length > 0) { - crawlProfileId = profileRows[0].id; - console.log(`[Promote] Using existing profile ${crawlProfileId}`); - } else { - // Create new profile - const profileKey = `dutchie-${location.platform_slug}`; - const { rows: newProfile } = await pool.query( - ` - INSERT INTO dispensary_crawler_profiles ( - dispensary_id, - profile_key, - profile_name, - platform, - config, - status, - enabled, - created_at, - updated_at - ) VALUES ( - $1, $2, $3, 'dutchie', $4, 'sandbox', TRUE, NOW(), NOW() - ) - ON CONFLICT (dispensary_id, platform) DO UPDATE SET - enabled = TRUE, - updated_at = NOW() - RETURNING id - `, - [ - dispensaryId, - profileKey, - `${location.name} (Dutchie)`, - JSON.stringify({ - platformDispensaryId: location.platform_location_id, - platformSlug: location.platform_slug, - menuUrl: location.platform_menu_url, - pricingType: 'rec', - useBothModes: true, - }), - ] - ); - - crawlProfileId = newProfile[0]?.id; - console.log(`[Promote] Created new profile ${crawlProfileId}`); - } - - // Link profile to dispensary if not already linked - await pool.query( - ` - UPDATE dispensaries - SET active_crawler_profile_id = COALESCE(active_crawler_profile_id, $1), - updated_at = NOW() - WHERE id = $2 - `, - [crawlProfileId, dispensaryId] - ); - } - - // Check if crawl_jobs table exists and create initial job - const { rows: jobsTableCheck } = await pool.query(` - SELECT EXISTS ( - SELECT FROM information_schema.tables - WHERE table_name = 'crawl_jobs' - ) as exists - `); - - if (jobsTableCheck[0]?.exists) { - // Check if there's already a pending job - const { rows: existingJobs } = await pool.query( - ` - SELECT id FROM crawl_jobs - WHERE dispensary_id = $1 AND status IN ('pending', 'running') - LIMIT 1 - `, - [dispensaryId] - ); - - if (existingJobs.length === 0) { - // Create initial crawl job - console.log(`[Promote] Creating initial crawl job...`); - await pool.query( - ` - INSERT INTO crawl_jobs ( - dispensary_id, - job_type, - status, - priority, - config, - created_at, - updated_at - ) VALUES ( - $1, 'dutchie_product_crawl', 'pending', 1, $2, NOW(), NOW() - ) - `, - [ - dispensaryId, - JSON.stringify({ - source: 'discovery_promotion', - discoveryLocationId, - pricingType: 'rec', - useBothModes: true, - }), - ] - ); - crawlJobCreated = true; - } else { - console.log(`[Promote] Crawl job already exists for dispensary`); - } - } - - // Update discovery location notes - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET notes = COALESCE(notes || E'\n', '') || $1, - updated_at = NOW() - WHERE id = $2 - `, - [`Promoted to crawlable at ${new Date().toISOString()}`, discoveryLocationId] - ); - - console.log(`[Promote] Promotion complete for discovery location ${discoveryLocationId}`); - - return { - success: true, - discoveryId: discoveryLocationId, - dispensaryId, - crawlProfileId, - scheduleUpdated, - crawlJobCreated, - }; -} - -export default promoteDiscoveryLocation; diff --git a/backend/src/dutchie-az/discovery/routes.ts b/backend/src/dutchie-az/discovery/routes.ts deleted file mode 100644 index 34b6b276..00000000 --- a/backend/src/dutchie-az/discovery/routes.ts +++ /dev/null @@ -1,973 +0,0 @@ -/** - * Platform Discovery API Routes (DT = Dutchie) - * - * Routes for the platform-specific store discovery pipeline. - * Mount at /api/discovery/platforms/dt - * - * Platform Slug Mapping (for trademark-safe URLs): - * dt = Dutchie - * jn = Jane (future) - * wm = Weedmaps (future) - * lf = Leafly (future) - * tz = Treez (future) - * - * Note: The actual platform value stored in the DB remains 'dutchie'. - * Only the URL paths use neutral slugs. - */ - -import { Router, Request, Response } from 'express'; -import { Pool } from 'pg'; -import { DutchieCityDiscovery } from './DutchieCityDiscovery'; -import { DutchieLocationDiscovery } from './DutchieLocationDiscovery'; -import { DiscoveryGeoService } from '../../services/DiscoveryGeoService'; -import { GeoValidationService } from '../../services/GeoValidationService'; - -export function createDutchieDiscoveryRoutes(pool: Pool): Router { - const router = Router(); - - // ============================================================ - // LOCATIONS - // ============================================================ - - /** - * GET /api/discovery/platforms/dt/locations - * - * List discovered locations with filtering. - * - * Query params: - * - status: 'discovered' | 'verified' | 'rejected' | 'merged' - * - state_code: e.g., 'AZ', 'CA' - * - country_code: 'US' | 'CA' - * - unlinked_only: 'true' to show only locations without dispensary_id - * - search: search by name - * - limit: number (default 50) - * - offset: number (default 0) - */ - router.get('/locations', async (req: Request, res: Response) => { - try { - const { - status, - state_code, - country_code, - unlinked_only, - search, - limit = '50', - offset = '0', - } = req.query; - - let whereClause = "WHERE platform = 'dutchie' AND active = TRUE"; - const params: any[] = []; - let paramIndex = 1; - - if (status) { - whereClause += ` AND status = $${paramIndex}`; - params.push(status); - paramIndex++; - } - - if (state_code) { - whereClause += ` AND state_code = $${paramIndex}`; - params.push(state_code); - paramIndex++; - } - - if (country_code) { - whereClause += ` AND country_code = $${paramIndex}`; - params.push(country_code); - paramIndex++; - } - - if (unlinked_only === 'true') { - whereClause += ' AND dispensary_id IS NULL'; - } - - if (search) { - whereClause += ` AND (name ILIKE $${paramIndex} OR platform_slug ILIKE $${paramIndex})`; - params.push(`%${search}%`); - paramIndex++; - } - - const limitVal = parseInt(limit as string, 10); - const offsetVal = parseInt(offset as string, 10); - params.push(limitVal, offsetVal); - - const { rows } = await pool.query( - ` - SELECT - dl.id, - dl.platform, - dl.platform_location_id, - dl.platform_slug, - dl.platform_menu_url, - dl.name, - dl.raw_address, - dl.address_line1, - dl.city, - dl.state_code, - dl.postal_code, - dl.country_code, - dl.latitude, - dl.longitude, - dl.status, - dl.dispensary_id, - dl.offers_delivery, - dl.offers_pickup, - dl.is_recreational, - dl.is_medical, - dl.first_seen_at, - dl.last_seen_at, - dl.verified_at, - dl.verified_by, - dl.notes, - d.name as dispensary_name - FROM dutchie_discovery_locations dl - LEFT JOIN dispensaries d ON dl.dispensary_id = d.id - ${whereClause} - ORDER BY dl.first_seen_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - // Get total count - const countParams = params.slice(0, -2); - const { rows: countRows } = await pool.query( - `SELECT COUNT(*) as total FROM dutchie_discovery_locations dl ${whereClause}`, - countParams - ); - - res.json({ - success: true, - locations: rows.map((r) => ({ - id: r.id, - platform: r.platform, - platformLocationId: r.platform_location_id, - platformSlug: r.platform_slug, - platformMenuUrl: r.platform_menu_url, - name: r.name, - rawAddress: r.raw_address, - addressLine1: r.address_line1, - city: r.city, - stateCode: r.state_code, - postalCode: r.postal_code, - countryCode: r.country_code, - latitude: r.latitude, - longitude: r.longitude, - status: r.status, - dispensaryId: r.dispensary_id, - dispensaryName: r.dispensary_name, - offersDelivery: r.offers_delivery, - offersPickup: r.offers_pickup, - isRecreational: r.is_recreational, - isMedical: r.is_medical, - firstSeenAt: r.first_seen_at, - lastSeenAt: r.last_seen_at, - verifiedAt: r.verified_at, - verifiedBy: r.verified_by, - notes: r.notes, - })), - total: parseInt(countRows[0]?.total || '0', 10), - limit: limitVal, - offset: offsetVal, - }); - } catch (error: any) { - console.error('[Discovery Routes] Error fetching locations:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - /** - * GET /api/discovery/platforms/dt/locations/:id - * - * Get a single location by ID. - */ - router.get('/locations/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - const { rows } = await pool.query( - ` - SELECT - dl.*, - d.name as dispensary_name, - d.menu_url as dispensary_menu_url - FROM dutchie_discovery_locations dl - LEFT JOIN dispensaries d ON dl.dispensary_id = d.id - WHERE dl.id = $1 - `, - [parseInt(id, 10)] - ); - - if (rows.length === 0) { - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - const r = rows[0]; - res.json({ - success: true, - location: { - id: r.id, - platform: r.platform, - platformLocationId: r.platform_location_id, - platformSlug: r.platform_slug, - platformMenuUrl: r.platform_menu_url, - name: r.name, - rawAddress: r.raw_address, - addressLine1: r.address_line1, - addressLine2: r.address_line2, - city: r.city, - stateCode: r.state_code, - postalCode: r.postal_code, - countryCode: r.country_code, - latitude: r.latitude, - longitude: r.longitude, - timezone: r.timezone, - status: r.status, - dispensaryId: r.dispensary_id, - dispensaryName: r.dispensary_name, - dispensaryMenuUrl: r.dispensary_menu_url, - offersDelivery: r.offers_delivery, - offersPickup: r.offers_pickup, - isRecreational: r.is_recreational, - isMedical: r.is_medical, - firstSeenAt: r.first_seen_at, - lastSeenAt: r.last_seen_at, - verifiedAt: r.verified_at, - verifiedBy: r.verified_by, - notes: r.notes, - metadata: r.metadata, - }, - }); - } catch (error: any) { - console.error('[Discovery Routes] Error fetching location:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - // ============================================================ - // VERIFICATION ACTIONS - // ============================================================ - - /** - * POST /api/discovery/platforms/dt/locations/:id/verify-create - * - * Verify a discovered location and create a new canonical dispensary. - */ - router.post('/locations/:id/verify-create', async (req: Request, res: Response) => { - const client = await pool.connect(); - try { - const { id } = req.params; - const { verifiedBy = 'admin' } = req.body; - - await client.query('BEGIN'); - - // Get the discovery location - const { rows: locRows } = await client.query( - `SELECT * FROM dutchie_discovery_locations WHERE id = $1 FOR UPDATE`, - [parseInt(id, 10)] - ); - - if (locRows.length === 0) { - await client.query('ROLLBACK'); - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - const location = locRows[0]; - - if (location.status !== 'discovered') { - await client.query('ROLLBACK'); - return res.status(400).json({ - success: false, - error: `Cannot verify: location status is '${location.status}'`, - }); - } - - // Look up state_id if we have a state_code - let stateId: number | null = null; - if (location.state_code) { - const { rows: stateRows } = await client.query( - `SELECT id FROM states WHERE code = $1`, - [location.state_code] - ); - if (stateRows.length > 0) { - stateId = stateRows[0].id; - } - } - - // Create the canonical dispensary - const { rows: dispRows } = await client.query( - ` - INSERT INTO dispensaries ( - name, - slug, - address, - city, - state, - zip, - latitude, - longitude, - timezone, - menu_type, - menu_url, - platform_dispensary_id, - state_id, - active, - created_at, - updated_at - ) VALUES ( - $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, TRUE, NOW(), NOW() - ) - RETURNING id - `, - [ - location.name, - location.platform_slug, - location.address_line1, - location.city, - location.state_code, - location.postal_code, - location.latitude, - location.longitude, - location.timezone, - 'dutchie', - location.platform_menu_url, - location.platform_location_id, - stateId, - ] - ); - - const dispensaryId = dispRows[0].id; - - // Update the discovery location - await client.query( - ` - UPDATE dutchie_discovery_locations - SET status = 'verified', - dispensary_id = $1, - verified_at = NOW(), - verified_by = $2, - updated_at = NOW() - WHERE id = $3 - `, - [dispensaryId, verifiedBy, id] - ); - - await client.query('COMMIT'); - - res.json({ - success: true, - action: 'created', - discoveryId: parseInt(id, 10), - dispensaryId, - message: `Created new dispensary (ID: ${dispensaryId})`, - }); - } catch (error: any) { - await client.query('ROLLBACK'); - console.error('[Discovery Routes] Error in verify-create:', error); - res.status(500).json({ success: false, error: error.message }); - } finally { - client.release(); - } - }); - - /** - * POST /api/discovery/platforms/dt/locations/:id/verify-link - * - * Link a discovered location to an existing dispensary. - * - * Body: - * - dispensaryId: number (required) - * - verifiedBy: string (optional) - */ - router.post('/locations/:id/verify-link', async (req: Request, res: Response) => { - const client = await pool.connect(); - try { - const { id } = req.params; - const { dispensaryId, verifiedBy = 'admin' } = req.body; - - if (!dispensaryId) { - return res.status(400).json({ success: false, error: 'dispensaryId is required' }); - } - - await client.query('BEGIN'); - - // Verify dispensary exists - const { rows: dispRows } = await client.query( - `SELECT id, name FROM dispensaries WHERE id = $1`, - [dispensaryId] - ); - - if (dispRows.length === 0) { - await client.query('ROLLBACK'); - return res.status(404).json({ success: false, error: 'Dispensary not found' }); - } - - // Get the discovery location - const { rows: locRows } = await client.query( - `SELECT * FROM dutchie_discovery_locations WHERE id = $1 FOR UPDATE`, - [parseInt(id, 10)] - ); - - if (locRows.length === 0) { - await client.query('ROLLBACK'); - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - const location = locRows[0]; - - if (location.status !== 'discovered') { - await client.query('ROLLBACK'); - return res.status(400).json({ - success: false, - error: `Cannot link: location status is '${location.status}'`, - }); - } - - // Update dispensary with platform info if missing - await client.query( - ` - UPDATE dispensaries - SET platform_dispensary_id = COALESCE(platform_dispensary_id, $1), - menu_url = COALESCE(menu_url, $2), - menu_type = COALESCE(menu_type, 'dutchie'), - updated_at = NOW() - WHERE id = $3 - `, - [location.platform_location_id, location.platform_menu_url, dispensaryId] - ); - - // Update the discovery location - await client.query( - ` - UPDATE dutchie_discovery_locations - SET status = 'merged', - dispensary_id = $1, - verified_at = NOW(), - verified_by = $2, - updated_at = NOW() - WHERE id = $3 - `, - [dispensaryId, verifiedBy, id] - ); - - await client.query('COMMIT'); - - res.json({ - success: true, - action: 'linked', - discoveryId: parseInt(id, 10), - dispensaryId, - dispensaryName: dispRows[0].name, - message: `Linked to existing dispensary: ${dispRows[0].name}`, - }); - } catch (error: any) { - await client.query('ROLLBACK'); - console.error('[Discovery Routes] Error in verify-link:', error); - res.status(500).json({ success: false, error: error.message }); - } finally { - client.release(); - } - }); - - /** - * POST /api/discovery/platforms/dt/locations/:id/reject - * - * Reject a discovered location. - * - * Body: - * - reason: string (optional) - * - verifiedBy: string (optional) - */ - router.post('/locations/:id/reject', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { reason, verifiedBy = 'admin' } = req.body; - - // Get current status - const { rows } = await pool.query( - `SELECT status FROM dutchie_discovery_locations WHERE id = $1`, - [parseInt(id, 10)] - ); - - if (rows.length === 0) { - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - if (rows[0].status !== 'discovered') { - return res.status(400).json({ - success: false, - error: `Cannot reject: location status is '${rows[0].status}'`, - }); - } - - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET status = 'rejected', - verified_at = NOW(), - verified_by = $1, - notes = COALESCE($2, notes), - updated_at = NOW() - WHERE id = $3 - `, - [verifiedBy, reason, id] - ); - - res.json({ - success: true, - action: 'rejected', - discoveryId: parseInt(id, 10), - message: 'Location rejected', - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in reject:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - /** - * POST /api/discovery/platforms/dt/locations/:id/unreject - * - * Restore a rejected location to discovered status. - */ - router.post('/locations/:id/unreject', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Get current status - const { rows } = await pool.query( - `SELECT status FROM dutchie_discovery_locations WHERE id = $1`, - [parseInt(id, 10)] - ); - - if (rows.length === 0) { - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - if (rows[0].status !== 'rejected') { - return res.status(400).json({ - success: false, - error: `Cannot unreject: location status is '${rows[0].status}'`, - }); - } - - await pool.query( - ` - UPDATE dutchie_discovery_locations - SET status = 'discovered', - verified_at = NULL, - verified_by = NULL, - updated_at = NOW() - WHERE id = $1 - `, - [id] - ); - - res.json({ - success: true, - action: 'unrejected', - discoveryId: parseInt(id, 10), - message: 'Location restored to discovered status', - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in unreject:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - // ============================================================ - // SUMMARY / REPORTING - // ============================================================ - - /** - * GET /api/discovery/platforms/dt/summary - * - * Get discovery summary statistics. - */ - router.get('/summary', async (_req: Request, res: Response) => { - try { - // Total counts by status - const { rows: statusRows } = await pool.query(` - SELECT status, COUNT(*) as cnt - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE - GROUP BY status - `); - - const statusCounts: Record = {}; - let totalLocations = 0; - for (const row of statusRows) { - statusCounts[row.status] = parseInt(row.cnt, 10); - totalLocations += parseInt(row.cnt, 10); - } - - // By state - const { rows: stateRows } = await pool.query(` - SELECT - state_code, - COUNT(*) as total, - COUNT(*) FILTER (WHERE status = 'verified') as verified, - COUNT(*) FILTER (WHERE dispensary_id IS NULL AND status = 'discovered') as unlinked - FROM dutchie_discovery_locations - WHERE platform = 'dutchie' AND active = TRUE AND state_code IS NOT NULL - GROUP BY state_code - ORDER BY total DESC - `); - - res.json({ - success: true, - summary: { - total_locations: totalLocations, - discovered: statusCounts['discovered'] || 0, - verified: statusCounts['verified'] || 0, - merged: statusCounts['merged'] || 0, - rejected: statusCounts['rejected'] || 0, - }, - by_state: stateRows.map((r) => ({ - state_code: r.state_code, - total: parseInt(r.total, 10), - verified: parseInt(r.verified, 10), - unlinked: parseInt(r.unlinked, 10), - })), - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in summary:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - // ============================================================ - // CITIES - // ============================================================ - - /** - * GET /api/discovery/platforms/dt/cities - * - * List discovery cities. - */ - router.get('/cities', async (req: Request, res: Response) => { - try { - const { state_code, country_code, crawl_enabled, limit = '100', offset = '0' } = req.query; - - let whereClause = "WHERE platform = 'dutchie'"; - const params: any[] = []; - let paramIndex = 1; - - if (state_code) { - whereClause += ` AND state_code = $${paramIndex}`; - params.push(state_code); - paramIndex++; - } - - if (country_code) { - whereClause += ` AND country_code = $${paramIndex}`; - params.push(country_code); - paramIndex++; - } - - if (crawl_enabled === 'true') { - whereClause += ' AND crawl_enabled = TRUE'; - } else if (crawl_enabled === 'false') { - whereClause += ' AND crawl_enabled = FALSE'; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows } = await pool.query( - ` - SELECT - id, - platform, - city_name, - city_slug, - state_code, - country_code, - last_crawled_at, - crawl_enabled, - location_count - FROM dutchie_discovery_cities - ${whereClause} - ORDER BY country_code, state_code, city_name - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - const { rows: countRows } = await pool.query( - `SELECT COUNT(*) as total FROM dutchie_discovery_cities ${whereClause}`, - params.slice(0, -2) - ); - - res.json({ - success: true, - cities: rows.map((r) => ({ - id: r.id, - platform: r.platform, - cityName: r.city_name, - citySlug: r.city_slug, - stateCode: r.state_code, - countryCode: r.country_code, - lastCrawledAt: r.last_crawled_at, - crawlEnabled: r.crawl_enabled, - locationCount: r.location_count, - })), - total: parseInt(countRows[0]?.total || '0', 10), - }); - } catch (error: any) { - console.error('[Discovery Routes] Error fetching cities:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - // ============================================================ - // MATCH CANDIDATES - // ============================================================ - - /** - * GET /api/discovery/platforms/dt/locations/:id/match-candidates - * - * Find potential dispensary matches for a discovery location. - */ - router.get('/locations/:id/match-candidates', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Get the discovery location - const { rows: locRows } = await pool.query( - `SELECT * FROM dutchie_discovery_locations WHERE id = $1`, - [parseInt(id, 10)] - ); - - if (locRows.length === 0) { - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - const location = locRows[0]; - - // Find potential matches - const { rows: candidates } = await pool.query( - ` - SELECT - d.id, - d.name, - d.city, - d.state, - d.address, - d.menu_type, - d.platform_dispensary_id, - d.menu_url, - d.latitude, - d.longitude, - CASE - WHEN d.name ILIKE $1 THEN 'exact_name' - WHEN d.name ILIKE $2 THEN 'partial_name' - WHEN d.city ILIKE $3 AND d.state = $4 THEN 'same_city' - ELSE 'location_match' - END as match_type, - CASE - WHEN d.latitude IS NOT NULL AND d.longitude IS NOT NULL - AND $5::float IS NOT NULL AND $6::float IS NOT NULL - THEN (3959 * acos( - LEAST(1.0, GREATEST(-1.0, - cos(radians($5::float)) * cos(radians(d.latitude)) * - cos(radians(d.longitude) - radians($6::float)) + - sin(radians($5::float)) * sin(radians(d.latitude)) - )) - )) - ELSE NULL - END as distance_miles - FROM dispensaries d - WHERE d.state = $4 - AND ( - d.name ILIKE $1 - OR d.name ILIKE $2 - OR d.city ILIKE $3 - OR ( - d.latitude IS NOT NULL - AND d.longitude IS NOT NULL - AND $5::float IS NOT NULL - AND $6::float IS NOT NULL - ) - ) - ORDER BY - CASE - WHEN d.name ILIKE $1 THEN 1 - WHEN d.name ILIKE $2 THEN 2 - ELSE 3 - END, - distance_miles NULLS LAST - LIMIT 10 - `, - [ - location.name, - `%${location.name.split(' ')[0]}%`, - location.city, - location.state_code, - location.latitude, - location.longitude, - ] - ); - - res.json({ - success: true, - location: { - id: location.id, - name: location.name, - city: location.city, - stateCode: location.state_code, - }, - candidates: candidates.map((c) => ({ - id: c.id, - name: c.name, - city: c.city, - state: c.state, - address: c.address, - menuType: c.menu_type, - platformDispensaryId: c.platform_dispensary_id, - menuUrl: c.menu_url, - matchType: c.match_type, - distanceMiles: c.distance_miles ? Math.round(c.distance_miles * 10) / 10 : null, - })), - }); - } catch (error: any) { - console.error('[Discovery Routes] Error fetching match candidates:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - // ============================================================ - // GEO / NEARBY (Admin/Debug Only) - // ============================================================ - - /** - * GET /api/discovery/platforms/dt/nearby - * - * Find discovery locations near a given coordinate. - * This is an internal/debug endpoint for admin use. - * - * Query params: - * - lat: number (required) - * - lon: number (required) - * - radiusKm: number (optional, default 50) - * - limit: number (optional, default 20) - * - status: string (optional, filter by status) - */ - router.get('/nearby', async (req: Request, res: Response) => { - try { - const { lat, lon, radiusKm = '50', limit = '20', status } = req.query; - - // Validate required params - if (!lat || !lon) { - return res.status(400).json({ - success: false, - error: 'lat and lon are required query parameters', - }); - } - - const latNum = parseFloat(lat as string); - const lonNum = parseFloat(lon as string); - const radiusNum = parseFloat(radiusKm as string); - const limitNum = parseInt(limit as string, 10); - - if (isNaN(latNum) || isNaN(lonNum)) { - return res.status(400).json({ - success: false, - error: 'lat and lon must be valid numbers', - }); - } - - const geoService = new DiscoveryGeoService(pool); - - const locations = await geoService.findNearbyDiscoveryLocations(latNum, lonNum, { - radiusKm: radiusNum, - limit: limitNum, - platform: 'dutchie', - status: status as string | undefined, - }); - - res.json({ - success: true, - center: { lat: latNum, lon: lonNum }, - radiusKm: radiusNum, - count: locations.length, - locations, - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in nearby:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - /** - * GET /api/discovery/platforms/dt/geo-stats - * - * Get coordinate coverage statistics for discovery locations. - * This is an internal/debug endpoint for admin use. - */ - router.get('/geo-stats', async (_req: Request, res: Response) => { - try { - const geoService = new DiscoveryGeoService(pool); - const stats = await geoService.getCoordinateCoverageStats(); - - res.json({ - success: true, - stats, - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in geo-stats:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - /** - * GET /api/discovery/platforms/dt/locations/:id/validate-geo - * - * Validate the geographic data for a discovery location. - * This is an internal/debug endpoint for admin use. - */ - router.get('/locations/:id/validate-geo', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Get the location - const { rows } = await pool.query( - `SELECT latitude, longitude, state_code, country_code, name - FROM dutchie_discovery_locations WHERE id = $1`, - [parseInt(id, 10)] - ); - - if (rows.length === 0) { - return res.status(404).json({ success: false, error: 'Location not found' }); - } - - const location = rows[0]; - const geoValidation = new GeoValidationService(); - const result = geoValidation.validateLocationState({ - latitude: location.latitude, - longitude: location.longitude, - state_code: location.state_code, - country_code: location.country_code, - }); - - res.json({ - success: true, - location: { - id: parseInt(id, 10), - name: location.name, - latitude: location.latitude, - longitude: location.longitude, - stateCode: location.state_code, - countryCode: location.country_code, - }, - validation: result, - }); - } catch (error: any) { - console.error('[Discovery Routes] Error in validate-geo:', error); - res.status(500).json({ success: false, error: error.message }); - } - }); - - return router; -} - -export default createDutchieDiscoveryRoutes; diff --git a/backend/src/dutchie-az/index.ts b/backend/src/dutchie-az/index.ts deleted file mode 100644 index 939483ce..00000000 --- a/backend/src/dutchie-az/index.ts +++ /dev/null @@ -1,92 +0,0 @@ -/** - * Dutchie AZ Data Pipeline - * - * Isolated data pipeline for crawling and storing Dutchie Arizona dispensary data. - * This module is completely separate from the main application database. - * - * Features: - * - Two-mode crawling (Mode A: UI parity, Mode B: MAX COVERAGE) - * - Derived stockStatus field (in_stock, out_of_stock, unknown) - * - Full raw payload storage for 100% data preservation - * - AZDHS dispensary list as canonical source - */ - -// Types -export * from './types'; - -// Database -export { - getDutchieAZPool, - query, - getClient, - closePool, - healthCheck, -} from './db/connection'; - -export { - createSchema, - dropSchema, - schemaExists, - ensureSchema, -} from './db/schema'; - -// Services - GraphQL Client -export { - GRAPHQL_HASHES, - ARIZONA_CENTERPOINTS, - resolveDispensaryId, - fetchAllProducts, - fetchAllProductsBothModes, - discoverArizonaDispensaries, - // Alias for backward compatibility - discoverArizonaDispensaries as discoverDispensaries, -} from './services/graphql-client'; - -// Services - Discovery -export { - importFromExistingDispensaries, - discoverDispensaries as discoverAndSaveDispensaries, - resolvePlatformDispensaryIds, - getAllDispensaries, - getDispensaryById, - getDispensariesWithPlatformIds, -} from './services/discovery'; - -// Services - Product Crawler -export { - normalizeProduct, - normalizeSnapshot, - crawlDispensaryProducts, - crawlAllArizonaDispensaries, -} from './services/product-crawler'; - -export type { CrawlResult } from './services/product-crawler'; - -// Services - Scheduler -export { - startScheduler, - stopScheduler, - triggerImmediateCrawl, - getSchedulerStatus, - crawlSingleDispensary, - // Schedule config CRUD - getAllSchedules, - getScheduleById, - createSchedule, - updateSchedule, - deleteSchedule, - triggerScheduleNow, - initializeDefaultSchedules, - // Run logs - getRunLogs, -} from './services/scheduler'; - -// Services - AZDHS Import -export { - importAZDHSDispensaries, - importFromJSON, - getImportStats, -} from './services/azdhs-import'; - -// Routes -export { default as dutchieAZRouter } from './routes'; diff --git a/backend/src/dutchie-az/routes/analytics.ts b/backend/src/dutchie-az/routes/analytics.ts deleted file mode 100644 index 549e919a..00000000 --- a/backend/src/dutchie-az/routes/analytics.ts +++ /dev/null @@ -1,682 +0,0 @@ -/** - * Analytics API Routes - * - * Provides REST API endpoints for all analytics services. - * All routes are prefixed with /api/analytics - * - * Phase 3: Analytics Dashboards - */ - -import { Router, Request, Response } from 'express'; -import { Pool } from 'pg'; -import { - AnalyticsCache, - PriceTrendService, - PenetrationService, - CategoryAnalyticsService, - StoreChangeService, - BrandOpportunityService, -} from '../services/analytics'; - -export function createAnalyticsRouter(pool: Pool): Router { - const router = Router(); - - // Initialize services - const cache = new AnalyticsCache(pool, { defaultTtlMinutes: 15 }); - const priceService = new PriceTrendService(pool, cache); - const penetrationService = new PenetrationService(pool, cache); - const categoryService = new CategoryAnalyticsService(pool, cache); - const storeService = new StoreChangeService(pool, cache); - const brandOpportunityService = new BrandOpportunityService(pool, cache); - - // ============================================================ - // PRICE ANALYTICS - // ============================================================ - - /** - * GET /api/analytics/price/product/:id - * Get price trend for a specific product - */ - router.get('/price/product/:id', async (req: Request, res: Response) => { - try { - const productId = parseInt(req.params.id); - const storeId = req.query.storeId ? parseInt(req.query.storeId as string) : undefined; - const days = req.query.days ? parseInt(req.query.days as string) : 30; - - const result = await priceService.getProductPriceTrend(productId, storeId, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Price product error:', error); - res.status(500).json({ error: 'Failed to fetch product price trend' }); - } - }); - - /** - * GET /api/analytics/price/brand/:name - * Get price trend for a brand - */ - router.get('/price/brand/:name', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.name); - const filters = { - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - category: req.query.category as string | undefined, - state: req.query.state as string | undefined, - days: req.query.days ? parseInt(req.query.days as string) : 30, - }; - - const result = await priceService.getBrandPriceTrend(brandName, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Price brand error:', error); - res.status(500).json({ error: 'Failed to fetch brand price trend' }); - } - }); - - /** - * GET /api/analytics/price/category/:name - * Get price trend for a category - */ - router.get('/price/category/:name', async (req: Request, res: Response) => { - try { - const category = decodeURIComponent(req.params.name); - const filters = { - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - brandName: req.query.brand as string | undefined, - state: req.query.state as string | undefined, - days: req.query.days ? parseInt(req.query.days as string) : 30, - }; - - const result = await priceService.getCategoryPriceTrend(category, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Price category error:', error); - res.status(500).json({ error: 'Failed to fetch category price trend' }); - } - }); - - /** - * GET /api/analytics/price/summary - * Get price summary statistics - */ - router.get('/price/summary', async (req: Request, res: Response) => { - try { - const filters = { - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - brandName: req.query.brand as string | undefined, - category: req.query.category as string | undefined, - state: req.query.state as string | undefined, - }; - - const result = await priceService.getPriceSummary(filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Price summary error:', error); - res.status(500).json({ error: 'Failed to fetch price summary' }); - } - }); - - /** - * GET /api/analytics/price/compression/:category - * Get price compression analysis for a category - */ - router.get('/price/compression/:category', async (req: Request, res: Response) => { - try { - const category = decodeURIComponent(req.params.category); - const state = req.query.state as string | undefined; - - const result = await priceService.detectPriceCompression(category, state); - res.json(result); - } catch (error) { - console.error('[Analytics] Price compression error:', error); - res.status(500).json({ error: 'Failed to analyze price compression' }); - } - }); - - /** - * GET /api/analytics/price/global - * Get global price statistics - */ - router.get('/price/global', async (_req: Request, res: Response) => { - try { - const result = await priceService.getGlobalPriceStats(); - res.json(result); - } catch (error) { - console.error('[Analytics] Global price error:', error); - res.status(500).json({ error: 'Failed to fetch global price stats' }); - } - }); - - // ============================================================ - // PENETRATION ANALYTICS - // ============================================================ - - /** - * GET /api/analytics/penetration/brand/:name - * Get penetration data for a brand - */ - router.get('/penetration/brand/:name', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.name); - const filters = { - state: req.query.state as string | undefined, - category: req.query.category as string | undefined, - }; - - const result = await penetrationService.getBrandPenetration(brandName, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Brand penetration error:', error); - res.status(500).json({ error: 'Failed to fetch brand penetration' }); - } - }); - - /** - * GET /api/analytics/penetration/top - * Get top brands by penetration - */ - router.get('/penetration/top', async (req: Request, res: Response) => { - try { - const limit = req.query.limit ? parseInt(req.query.limit as string) : 20; - const filters = { - state: req.query.state as string | undefined, - category: req.query.category as string | undefined, - minStores: req.query.minStores ? parseInt(req.query.minStores as string) : 2, - minSkus: req.query.minSkus ? parseInt(req.query.minSkus as string) : 5, - }; - - const result = await penetrationService.getTopBrandsByPenetration(limit, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Top penetration error:', error); - res.status(500).json({ error: 'Failed to fetch top brands' }); - } - }); - - /** - * GET /api/analytics/penetration/trend/:brand - * Get penetration trend for a brand - */ - router.get('/penetration/trend/:brand', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.brand); - const days = req.query.days ? parseInt(req.query.days as string) : 30; - - const result = await penetrationService.getPenetrationTrend(brandName, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Penetration trend error:', error); - res.status(500).json({ error: 'Failed to fetch penetration trend' }); - } - }); - - /** - * GET /api/analytics/penetration/shelf-share/:brand - * Get shelf share by category for a brand - */ - router.get('/penetration/shelf-share/:brand', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.brand); - const result = await penetrationService.getShelfShareByCategory(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Shelf share error:', error); - res.status(500).json({ error: 'Failed to fetch shelf share' }); - } - }); - - /** - * GET /api/analytics/penetration/by-state/:brand - * Get brand presence by state - */ - router.get('/penetration/by-state/:brand', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.brand); - const result = await penetrationService.getBrandPresenceByState(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Brand by state error:', error); - res.status(500).json({ error: 'Failed to fetch brand presence by state' }); - } - }); - - /** - * GET /api/analytics/penetration/stores/:brand - * Get stores carrying a brand - */ - router.get('/penetration/stores/:brand', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.brand); - const result = await penetrationService.getStoresCarryingBrand(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Stores carrying brand error:', error); - res.status(500).json({ error: 'Failed to fetch stores' }); - } - }); - - /** - * GET /api/analytics/penetration/heatmap - * Get penetration heatmap data - */ - router.get('/penetration/heatmap', async (req: Request, res: Response) => { - try { - const brandName = req.query.brand as string | undefined; - const result = await penetrationService.getPenetrationHeatmap(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Heatmap error:', error); - res.status(500).json({ error: 'Failed to fetch heatmap data' }); - } - }); - - // ============================================================ - // CATEGORY ANALYTICS - // ============================================================ - - /** - * GET /api/analytics/category/summary - * Get category summary - */ - router.get('/category/summary', async (req: Request, res: Response) => { - try { - const category = req.query.category as string | undefined; - const filters = { - state: req.query.state as string | undefined, - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - }; - - const result = await categoryService.getCategorySummary(category, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Category summary error:', error); - res.status(500).json({ error: 'Failed to fetch category summary' }); - } - }); - - /** - * GET /api/analytics/category/growth - * Get category growth data - */ - router.get('/category/growth', async (req: Request, res: Response) => { - try { - const days = req.query.days ? parseInt(req.query.days as string) : 7; - const filters = { - state: req.query.state as string | undefined, - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - minSkus: req.query.minSkus ? parseInt(req.query.minSkus as string) : 10, - }; - - const result = await categoryService.getCategoryGrowth(days, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Category growth error:', error); - res.status(500).json({ error: 'Failed to fetch category growth' }); - } - }); - - /** - * GET /api/analytics/category/trend/:category - * Get category growth trend over time - */ - router.get('/category/trend/:category', async (req: Request, res: Response) => { - try { - const category = decodeURIComponent(req.params.category); - const days = req.query.days ? parseInt(req.query.days as string) : 90; - - const result = await categoryService.getCategoryGrowthTrend(category, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Category trend error:', error); - res.status(500).json({ error: 'Failed to fetch category trend' }); - } - }); - - /** - * GET /api/analytics/category/heatmap - * Get category heatmap data - */ - router.get('/category/heatmap', async (req: Request, res: Response) => { - try { - const metric = (req.query.metric as 'skus' | 'growth' | 'price') || 'skus'; - const periods = req.query.periods ? parseInt(req.query.periods as string) : 12; - - const result = await categoryService.getCategoryHeatmap(metric, periods); - res.json(result); - } catch (error) { - console.error('[Analytics] Category heatmap error:', error); - res.status(500).json({ error: 'Failed to fetch heatmap' }); - } - }); - - /** - * GET /api/analytics/category/top-movers - * Get top growing and declining categories - */ - router.get('/category/top-movers', async (req: Request, res: Response) => { - try { - const limit = req.query.limit ? parseInt(req.query.limit as string) : 5; - const days = req.query.days ? parseInt(req.query.days as string) : 30; - - const result = await categoryService.getTopMovers(limit, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Top movers error:', error); - res.status(500).json({ error: 'Failed to fetch top movers' }); - } - }); - - /** - * GET /api/analytics/category/:category/subcategories - * Get subcategory breakdown - */ - router.get('/category/:category/subcategories', async (req: Request, res: Response) => { - try { - const category = decodeURIComponent(req.params.category); - const result = await categoryService.getSubcategoryBreakdown(category); - res.json(result); - } catch (error) { - console.error('[Analytics] Subcategory error:', error); - res.status(500).json({ error: 'Failed to fetch subcategories' }); - } - }); - - // ============================================================ - // STORE CHANGE TRACKING - // ============================================================ - - /** - * GET /api/analytics/store/:id/summary - * Get change summary for a store - */ - router.get('/store/:id/summary', async (req: Request, res: Response) => { - try { - const storeId = parseInt(req.params.id); - const result = await storeService.getStoreChangeSummary(storeId); - - if (!result) { - return res.status(404).json({ error: 'Store not found' }); - } - - res.json(result); - } catch (error) { - console.error('[Analytics] Store summary error:', error); - res.status(500).json({ error: 'Failed to fetch store summary' }); - } - }); - - /** - * GET /api/analytics/store/:id/events - * Get recent change events for a store - */ - router.get('/store/:id/events', async (req: Request, res: Response) => { - try { - const storeId = parseInt(req.params.id); - const filters = { - eventType: req.query.type as string | undefined, - days: req.query.days ? parseInt(req.query.days as string) : 30, - limit: req.query.limit ? parseInt(req.query.limit as string) : 100, - }; - - const result = await storeService.getStoreChangeEvents(storeId, filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Store events error:', error); - res.status(500).json({ error: 'Failed to fetch store events' }); - } - }); - - /** - * GET /api/analytics/store/:id/brands/new - * Get new brands added to a store - */ - router.get('/store/:id/brands/new', async (req: Request, res: Response) => { - try { - const storeId = parseInt(req.params.id); - const days = req.query.days ? parseInt(req.query.days as string) : 30; - - const result = await storeService.getNewBrands(storeId, days); - res.json(result); - } catch (error) { - console.error('[Analytics] New brands error:', error); - res.status(500).json({ error: 'Failed to fetch new brands' }); - } - }); - - /** - * GET /api/analytics/store/:id/brands/lost - * Get brands lost from a store - */ - router.get('/store/:id/brands/lost', async (req: Request, res: Response) => { - try { - const storeId = parseInt(req.params.id); - const days = req.query.days ? parseInt(req.query.days as string) : 30; - - const result = await storeService.getLostBrands(storeId, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Lost brands error:', error); - res.status(500).json({ error: 'Failed to fetch lost brands' }); - } - }); - - /** - * GET /api/analytics/store/:id/products/changes - * Get product changes for a store - */ - router.get('/store/:id/products/changes', async (req: Request, res: Response) => { - try { - const storeId = parseInt(req.params.id); - const changeType = req.query.type as 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock' | undefined; - const days = req.query.days ? parseInt(req.query.days as string) : 7; - - const result = await storeService.getProductChanges(storeId, changeType, days); - res.json(result); - } catch (error) { - console.error('[Analytics] Product changes error:', error); - res.status(500).json({ error: 'Failed to fetch product changes' }); - } - }); - - /** - * GET /api/analytics/store/leaderboard/:category - * Get category leaderboard across stores - */ - router.get('/store/leaderboard/:category', async (req: Request, res: Response) => { - try { - const category = decodeURIComponent(req.params.category); - const limit = req.query.limit ? parseInt(req.query.limit as string) : 20; - - const result = await storeService.getCategoryLeaderboard(category, limit); - res.json(result); - } catch (error) { - console.error('[Analytics] Leaderboard error:', error); - res.status(500).json({ error: 'Failed to fetch leaderboard' }); - } - }); - - /** - * GET /api/analytics/store/most-active - * Get most active stores (by changes) - */ - router.get('/store/most-active', async (req: Request, res: Response) => { - try { - const days = req.query.days ? parseInt(req.query.days as string) : 7; - const limit = req.query.limit ? parseInt(req.query.limit as string) : 10; - - const result = await storeService.getMostActiveStores(days, limit); - res.json(result); - } catch (error) { - console.error('[Analytics] Most active error:', error); - res.status(500).json({ error: 'Failed to fetch active stores' }); - } - }); - - /** - * GET /api/analytics/store/compare - * Compare two stores - */ - router.get('/store/compare', async (req: Request, res: Response) => { - try { - const store1 = parseInt(req.query.store1 as string); - const store2 = parseInt(req.query.store2 as string); - - if (!store1 || !store2) { - return res.status(400).json({ error: 'Both store1 and store2 are required' }); - } - - const result = await storeService.compareStores(store1, store2); - res.json(result); - } catch (error) { - console.error('[Analytics] Compare stores error:', error); - res.status(500).json({ error: 'Failed to compare stores' }); - } - }); - - // ============================================================ - // BRAND OPPORTUNITY / RISK - // ============================================================ - - /** - * GET /api/analytics/brand/:name/opportunity - * Get full opportunity analysis for a brand - */ - router.get('/brand/:name/opportunity', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.name); - const result = await brandOpportunityService.getBrandOpportunity(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Brand opportunity error:', error); - res.status(500).json({ error: 'Failed to fetch brand opportunity' }); - } - }); - - /** - * GET /api/analytics/brand/:name/position - * Get market position summary for a brand - */ - router.get('/brand/:name/position', async (req: Request, res: Response) => { - try { - const brandName = decodeURIComponent(req.params.name); - const result = await brandOpportunityService.getMarketPositionSummary(brandName); - res.json(result); - } catch (error) { - console.error('[Analytics] Brand position error:', error); - res.status(500).json({ error: 'Failed to fetch brand position' }); - } - }); - - // ============================================================ - // ALERTS - // ============================================================ - - /** - * GET /api/analytics/alerts - * Get analytics alerts - */ - router.get('/alerts', async (req: Request, res: Response) => { - try { - const filters = { - brandName: req.query.brand as string | undefined, - storeId: req.query.storeId ? parseInt(req.query.storeId as string) : undefined, - alertType: req.query.type as string | undefined, - unreadOnly: req.query.unreadOnly === 'true', - limit: req.query.limit ? parseInt(req.query.limit as string) : 50, - }; - - const result = await brandOpportunityService.getAlerts(filters); - res.json(result); - } catch (error) { - console.error('[Analytics] Alerts error:', error); - res.status(500).json({ error: 'Failed to fetch alerts' }); - } - }); - - /** - * POST /api/analytics/alerts/mark-read - * Mark alerts as read - */ - router.post('/alerts/mark-read', async (req: Request, res: Response) => { - try { - const { alertIds } = req.body; - - if (!Array.isArray(alertIds)) { - return res.status(400).json({ error: 'alertIds must be an array' }); - } - - await brandOpportunityService.markAlertsRead(alertIds); - res.json({ success: true }); - } catch (error) { - console.error('[Analytics] Mark read error:', error); - res.status(500).json({ error: 'Failed to mark alerts as read' }); - } - }); - - // ============================================================ - // CACHE MANAGEMENT - // ============================================================ - - /** - * GET /api/analytics/cache/stats - * Get cache statistics - */ - router.get('/cache/stats', async (_req: Request, res: Response) => { - try { - const stats = await cache.getStats(); - res.json(stats); - } catch (error) { - console.error('[Analytics] Cache stats error:', error); - res.status(500).json({ error: 'Failed to get cache stats' }); - } - }); - - /** - * POST /api/analytics/cache/clear - * Clear cache (admin only) - */ - router.post('/cache/clear', async (req: Request, res: Response) => { - try { - const pattern = req.query.pattern as string | undefined; - - if (pattern) { - const cleared = await cache.invalidatePattern(pattern); - res.json({ success: true, clearedCount: cleared }); - } else { - await cache.cleanExpired(); - res.json({ success: true, message: 'Expired entries cleaned' }); - } - } catch (error) { - console.error('[Analytics] Cache clear error:', error); - res.status(500).json({ error: 'Failed to clear cache' }); - } - }); - - // ============================================================ - // SNAPSHOT CAPTURE (for cron/scheduled jobs) - // ============================================================ - - /** - * POST /api/analytics/snapshots/capture - * Capture daily snapshots (run by scheduler) - */ - router.post('/snapshots/capture', async (_req: Request, res: Response) => { - try { - const [brandResult, categoryResult] = await Promise.all([ - pool.query('SELECT capture_brand_snapshots() as count'), - pool.query('SELECT capture_category_snapshots() as count'), - ]); - - res.json({ - success: true, - brandSnapshots: parseInt(brandResult.rows[0]?.count || '0'), - categorySnapshots: parseInt(categoryResult.rows[0]?.count || '0'), - }); - } catch (error) { - console.error('[Analytics] Snapshot capture error:', error); - res.status(500).json({ error: 'Failed to capture snapshots' }); - } - }); - - return router; -} diff --git a/backend/src/dutchie-az/routes/index.ts b/backend/src/dutchie-az/routes/index.ts deleted file mode 100644 index 3613a30d..00000000 --- a/backend/src/dutchie-az/routes/index.ts +++ /dev/null @@ -1,2716 +0,0 @@ -/** - * Market Data API Routes - * - * Express routes for the cannabis market data pipeline. - * Provides API endpoints for stores, products, categories, and dashboard. - * Mounted at /api/markets (with legacy aliases at /api/az and /api/dutchie-az) - */ - -import { Router, Request, Response } from 'express'; -import { query } from '../db/connection'; -import { ensureSchema } from '../db/schema'; -import { - importAZDHSDispensaries, - importFromJSON, - getImportStats, -} from '../services/azdhs-import'; -import { - discoverDispensaries, - resolvePlatformDispensaryIds, - getAllDispensaries, - getDispensaryById, -} from '../services/discovery'; -import { crawlDispensaryProducts } from '../services/product-crawler'; - -// Use shared dispensary columns (handles optional columns like provider_detection_data) -import { DISPENSARY_COLUMNS_WITH_PROFILE as DISPENSARY_COLUMNS } from '../db/dispensary-columns'; -import { - startScheduler, - stopScheduler, - triggerImmediateCrawl, - getSchedulerStatus, - crawlSingleDispensary, - getAllSchedules, - getScheduleById, - createSchedule, - updateSchedule, - deleteSchedule, - triggerScheduleNow, - initializeDefaultSchedules, - getRunLogs, -} from '../services/scheduler'; -import { StockStatus } from '../types'; -import { getProviderDisplayName } from '../../utils/provider-display'; - -const router = Router(); - -// ============================================================ -// DASHBOARD -// ============================================================ - -/** - * GET /api/dutchie-az/dashboard - * Dashboard stats overview - */ -router.get('/dashboard', async (_req: Request, res: Response) => { - try { - const { rows } = await query<{ - dispensary_count: string; - product_count: string; - snapshots_24h: string; - last_crawl_time: Date; - failed_jobs_24h: string; - brand_count: string; - category_count: string; - }>(`SELECT * FROM v_dashboard_stats`); - - const stats = rows[0] || {}; - res.json({ - dispensaryCount: parseInt(stats.dispensary_count || '0', 10), - productCount: parseInt(stats.product_count || '0', 10), - snapshotCount24h: parseInt(stats.snapshots_24h || '0', 10), - lastCrawlTime: stats.last_crawl_time, - failedJobCount: parseInt(stats.failed_jobs_24h || '0', 10), - brandCount: parseInt(stats.brand_count || '0', 10), - categoryCount: parseInt(stats.category_count || '0', 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// DISPENSARIES (STORES) -// ============================================================ - -/** - * GET /api/dutchie-az/stores - * List all stores with optional filters - */ -router.get('/stores', async (req: Request, res: Response) => { - try { - const { city, hasPlatformId, limit = '100', offset = '0' } = req.query; - - let whereClause = 'WHERE state = \'AZ\''; - const params: any[] = []; - let paramIndex = 1; - - if (city) { - whereClause += ` AND city = $${paramIndex}`; - params.push(city); - paramIndex++; - } - - if (hasPlatformId === 'true') { - whereClause += ' AND platform_dispensary_id IS NOT NULL'; - } else if (hasPlatformId === 'false') { - whereClause += ' AND platform_dispensary_id IS NULL'; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows, rowCount } = await query( - ` - SELECT ${DISPENSARY_COLUMNS}, - (SELECT COUNT(*) FROM dutchie_products WHERE dispensary_id = dispensaries.id) as product_count, - dcp.status as crawler_status, - dcp.profile_key as crawler_profile_key, - dcp.next_retry_at, - dcp.sandbox_attempt_count - FROM dispensaries - LEFT JOIN dispensary_crawler_profiles dcp - ON dcp.dispensary_id = dispensaries.id AND dcp.enabled = true - ${whereClause} - ORDER BY dispensaries.name - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - // Get total count - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM dispensaries ${whereClause}`, - params.slice(0, -2) - ); - - // Transform stores to include provider_display - const transformedStores = rows.map((store: any) => ({ - ...store, - provider_raw: store.menu_type, - provider_display: getProviderDisplayName(store.menu_type), - })); - - res.json({ - stores: transformedStores, - total: parseInt(countRows[0]?.total || '0', 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/slug/:slug - * Resolve a store by slug (case-insensitive) or platform_dispensary_id - */ -router.get('/stores/slug/:slug', async (req: Request, res: Response) => { - try { - const { slug } = req.params; - const normalized = slug.toLowerCase(); - - const { rows } = await query( - ` - SELECT ${DISPENSARY_COLUMNS} - FROM dispensaries - WHERE lower(slug) = $1 - OR lower(platform_dispensary_id) = $1 - LIMIT 1 - `, - [normalized] - ); - - if (!rows || rows.length === 0) { - return res.status(404).json({ error: 'Store not found' }); - } - - res.json(rows[0]); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/:id - * Get a single store by ID - */ -router.get('/stores/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const store = await getDispensaryById(parseInt(id, 10)); - - if (!store) { - return res.status(404).json({ error: 'Store not found' }); - } - - res.json(store); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/:id/summary - * Get store summary with product count, categories, and brands - * This is the main endpoint for the DispensaryDetail panel - * OPTIMIZED: Combined 5 sequential queries into 2 parallel queries - */ -router.get('/stores/:id/summary', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const dispensaryId = parseInt(id, 10); - - // Run all queries in parallel using Promise.all - const [dispensaryResult, aggregateResult] = await Promise.all([ - // Query 1: Get dispensary info - query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`, - [dispensaryId] - ), - - // Query 2: All product aggregations in one query using CTEs - query( - ` - WITH stock_counts AS ( - SELECT - COUNT(*) as total_products, - COUNT(*) FILTER (WHERE stock_status = 'in_stock') as in_stock_count, - COUNT(*) FILTER (WHERE stock_status = 'out_of_stock') as out_of_stock_count, - COUNT(*) FILTER (WHERE stock_status = 'unknown') as unknown_count, - COUNT(*) FILTER (WHERE stock_status = 'missing_from_feed') as missing_count - FROM dutchie_products - WHERE dispensary_id = $1 - ), - category_agg AS ( - SELECT jsonb_agg( - jsonb_build_object('type', type, 'subcategory', subcategory, 'product_count', cnt) - ORDER BY type, subcategory - ) as categories - FROM ( - SELECT type, subcategory, COUNT(*) as cnt - FROM dutchie_products - WHERE dispensary_id = $1 AND type IS NOT NULL - GROUP BY type, subcategory - ) cat - ), - brand_agg AS ( - SELECT jsonb_agg( - jsonb_build_object('brand_name', brand_name, 'product_count', cnt) - ORDER BY cnt DESC - ) as brands - FROM ( - SELECT brand_name, COUNT(*) as cnt - FROM dutchie_products - WHERE dispensary_id = $1 AND brand_name IS NOT NULL - GROUP BY brand_name - ) br - ), - last_crawl AS ( - SELECT - id, status, started_at, completed_at, - products_found, products_new, products_updated, error_message - FROM dispensary_crawl_jobs - WHERE dispensary_id = $1 - ORDER BY created_at DESC - LIMIT 1 - ) - SELECT - sc.total_products, sc.in_stock_count, sc.out_of_stock_count, sc.unknown_count, sc.missing_count, - COALESCE(ca.categories, '[]'::jsonb) as categories, - COALESCE(ba.brands, '[]'::jsonb) as brands, - lc.id as last_crawl_id, lc.status as last_crawl_status, - lc.started_at as last_crawl_started, lc.completed_at as last_crawl_completed, - lc.products_found, lc.products_new, lc.products_updated, lc.error_message - FROM stock_counts sc - CROSS JOIN category_agg ca - CROSS JOIN brand_agg ba - LEFT JOIN last_crawl lc ON true - `, - [dispensaryId] - ) - ]); - - if (dispensaryResult.rows.length === 0) { - return res.status(404).json({ error: 'Store not found' }); - } - - const dispensary = dispensaryResult.rows[0]; - const agg = aggregateResult.rows[0] || {}; - const categories = agg.categories || []; - const brands = agg.brands || []; - - res.json({ - dispensary, - totalProducts: parseInt(agg.total_products || '0', 10), - inStockCount: parseInt(agg.in_stock_count || '0', 10), - outOfStockCount: parseInt(agg.out_of_stock_count || '0', 10), - unknownStockCount: parseInt(agg.unknown_count || '0', 10), - missingFromFeedCount: parseInt(agg.missing_count || '0', 10), - categories, - brands, - brandCount: brands.length, - categoryCount: categories.length, - lastCrawl: agg.last_crawl_id ? { - id: agg.last_crawl_id, - status: agg.last_crawl_status, - started_at: agg.last_crawl_started, - completed_at: agg.last_crawl_completed, - products_found: agg.products_found, - products_new: agg.products_new, - products_updated: agg.products_updated, - error_message: agg.error_message - } : null, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/:id/products - * Get paginated products for a store with latest snapshot data - */ -router.get('/stores/:id/products', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { - stockStatus, - type, - subcategory, - brandName, - search, - limit = '50', - offset = '0', - } = req.query; - - let whereClause = 'WHERE p.dispensary_id = $1'; - const params: any[] = [parseInt(id, 10)]; - let paramIndex = 2; - - if (stockStatus) { - whereClause += ` AND p.stock_status = $${paramIndex}`; - params.push(stockStatus); - paramIndex++; - } - - if (type) { - whereClause += ` AND p.type = $${paramIndex}`; - params.push(type); - paramIndex++; - } - - if (subcategory) { - whereClause += ` AND p.subcategory = $${paramIndex}`; - params.push(subcategory); - paramIndex++; - } - - if (brandName) { - whereClause += ` AND p.brand_name ILIKE $${paramIndex}`; - params.push(`%${brandName}%`); - paramIndex++; - } - - if (search) { - whereClause += ` AND (p.name ILIKE $${paramIndex} OR p.brand_name ILIKE $${paramIndex})`; - params.push(`%${search}%`); - paramIndex++; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - // Get products with their latest snapshot data - const { rows: products } = await query( - ` - SELECT - p.id, - p.external_product_id, - p.name, - p.brand_name, - p.type, - p.subcategory, - p.strain_type, - p.stock_status, - p.created_at, - p.updated_at, - p.primary_image_url, - p.thc_content, - p.cbd_content, - -- Latest snapshot data (prices in cents) - s.rec_min_price_cents, - s.rec_max_price_cents, - s.med_min_price_cents, - s.med_max_price_cents, - s.rec_min_special_price_cents, - s.med_min_special_price_cents, - s.total_quantity_available, - s.options, - s.stock_status as snapshot_stock_status, - s.crawled_at as snapshot_at - FROM dutchie_products p - LEFT JOIN LATERAL ( - SELECT * FROM dutchie_product_snapshots - WHERE dutchie_product_id = p.id - ORDER BY crawled_at DESC - LIMIT 1 - ) s ON true - ${whereClause} - ORDER BY p.updated_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - // Get total count - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM dutchie_products p ${whereClause}`, - params.slice(0, -2) - ); - - // Transform products for frontend compatibility - const transformedProducts = products.map((p) => ({ - id: p.id, - external_id: p.external_product_id, - name: p.name, - brand: p.brand_name, - type: p.type, - subcategory: p.subcategory, - strain_type: p.strain_type, - stock_status: p.snapshot_stock_status || p.stock_status, - in_stock: (p.snapshot_stock_status || p.stock_status) === 'in_stock', - // Prices from latest snapshot (convert cents to dollars) - regular_price: p.rec_min_price_cents ? p.rec_min_price_cents / 100 : null, - regular_price_max: p.rec_max_price_cents ? p.rec_max_price_cents / 100 : null, - sale_price: p.rec_min_special_price_cents ? p.rec_min_special_price_cents / 100 : null, - med_price: p.med_min_price_cents ? p.med_min_price_cents / 100 : null, - med_price_max: p.med_max_price_cents ? p.med_max_price_cents / 100 : null, - med_sale_price: p.med_min_special_price_cents ? p.med_min_special_price_cents / 100 : null, - // Potency from products table - thc_percentage: p.thc_content, - cbd_percentage: p.cbd_content, - // Images from products table - image_url: p.primary_image_url, - // Other - options: p.options, - total_quantity: p.total_quantity_available, - // Timestamps - created_at: p.created_at, - updated_at: p.updated_at, - snapshot_at: p.snapshot_at, - })); - - res.json({ - products: transformedProducts, - total: parseInt(countRows[0]?.total || '0', 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/:id/brands - * Get brands for a specific store - */ -router.get('/stores/:id/brands', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - const { rows: brands } = await query( - ` - SELECT - brand_name as brand, - COUNT(*) as product_count - FROM dutchie_products - WHERE dispensary_id = $1 AND brand_name IS NOT NULL - GROUP BY brand_name - ORDER BY product_count DESC - `, - [parseInt(id, 10)] - ); - - res.json({ brands }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/stores/:id/categories - * Get categories for a specific store - */ -router.get('/stores/:id/categories', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - const { rows: categories } = await query( - ` - SELECT - type, - subcategory, - COUNT(*) as product_count - FROM dutchie_products - WHERE dispensary_id = $1 AND type IS NOT NULL - GROUP BY type, subcategory - ORDER BY type, subcategory - `, - [parseInt(id, 10)] - ); - - res.json({ categories }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// PRODUCTS -// ============================================================ - -/** - * GET /api/dutchie-az/products - * List products with filtering on our own DB - */ -router.get('/products', async (req: Request, res: Response) => { - try { - const { - storeId, - stockStatus, - type, - subcategory, - brandName, - search, - limit = '50', - offset = '0', - } = req.query; - - let whereClause = 'WHERE 1=1'; - const params: any[] = []; - let paramIndex = 1; - - if (storeId) { - whereClause += ` AND dispensary_id = $${paramIndex}`; - params.push(parseInt(storeId as string, 10)); - paramIndex++; - } - - if (stockStatus) { - whereClause += ` AND stock_status = $${paramIndex}`; - params.push(stockStatus); - paramIndex++; - } - - if (type) { - whereClause += ` AND type = $${paramIndex}`; - params.push(type); - paramIndex++; - } - - if (subcategory) { - whereClause += ` AND subcategory = $${paramIndex}`; - params.push(subcategory); - paramIndex++; - } - - if (brandName) { - whereClause += ` AND brand_name ILIKE $${paramIndex}`; - params.push(`%${brandName}%`); - paramIndex++; - } - - if (search) { - whereClause += ` AND (name ILIKE $${paramIndex} OR brand_name ILIKE $${paramIndex})`; - params.push(`%${search}%`); - paramIndex++; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows } = await query( - ` - SELECT - p.*, - d.name as store_name, - d.city as store_city - FROM dutchie_products p - JOIN dispensaries d ON p.dispensary_id = d.id - ${whereClause} - ORDER BY p.updated_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - // Get total count - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM dutchie_products ${whereClause}`, - params.slice(0, -2) - ); - - res.json({ - products: rows, - total: parseInt(countRows[0]?.total || '0', 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/products/:id - * Get a single product with its latest snapshot - */ -router.get('/products/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - const { rows: productRows } = await query( - ` - SELECT - p.*, - d.name as store_name, - d.city as store_city, - d.slug as store_slug - FROM dutchie_products p - JOIN dispensaries d ON p.dispensary_id = d.id - WHERE p.id = $1 - `, - [id] - ); - - if (productRows.length === 0) { - return res.status(404).json({ error: 'Product not found' }); - } - - // Get latest snapshot - const { rows: snapshotRows } = await query( - ` - SELECT * FROM dutchie_product_snapshots - WHERE dutchie_product_id = $1 - ORDER BY crawled_at DESC - LIMIT 1 - `, - [id] - ); - - res.json({ - product: productRows[0], - latestSnapshot: snapshotRows[0] || null, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/products/:id/snapshots - * Get snapshot history for a product - */ -router.get('/products/:id/snapshots', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { limit = '50', offset = '0' } = req.query; - - const { rows } = await query( - ` - SELECT * FROM dutchie_product_snapshots - WHERE dutchie_product_id = $1 - ORDER BY crawled_at DESC - LIMIT $2 OFFSET $3 - `, - [id, parseInt(limit as string, 10), parseInt(offset as string, 10)] - ); - - res.json({ snapshots: rows }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/products/:id/similar - * Get similar products (same brand + category), limited to 4 - * Returns products with lowest prices first - */ -router.get('/products/:id/similar', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Use the exact SQL query provided - const { rows } = await query<{ - product_id: number; - name: string; - brand_name: string; - image_url: string; - rec_min_price_cents: number; - }>( - ` - WITH base AS ( - SELECT id AS base_product_id, brand_name, category - FROM dutchie_products WHERE id = $1 - ), - latest_prices AS ( - SELECT DISTINCT ON (dps.dutchie_product_id) - dps.dutchie_product_id, dps.rec_min_price_cents - FROM dutchie_product_snapshots dps - ORDER BY dps.dutchie_product_id, dps.crawled_at DESC - ) - SELECT p.id AS product_id, p.name, p.brand_name, p.primary_image_url as image_url, lp.rec_min_price_cents - FROM dutchie_products p - JOIN base b ON p.category = b.category AND p.brand_name = b.brand_name - JOIN latest_prices lp ON lp.dutchie_product_id = p.id - WHERE p.id <> b.base_product_id AND lp.rec_min_price_cents IS NOT NULL - ORDER BY lp.rec_min_price_cents ASC - LIMIT 4 - `, - [id] - ); - - // Transform to the expected response format - const similarProducts = rows.map((row) => ({ - productId: row.product_id, - name: row.name, - brandName: row.brand_name, - imageUrl: row.image_url, - price: row.rec_min_price_cents ? row.rec_min_price_cents / 100 : null, - })); - - res.json({ similarProducts }); - } catch (error: any) { - console.error('Error fetching similar products:', error); - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/products/:id/availability - * Get dispensaries that carry this product, with distance from user location - * Query params: - * - lat: User latitude (required) - * - lng: User longitude (required) - * - max_radius_miles: Maximum search radius in miles (optional, default 50) - */ -router.get('/products/:id/availability', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { lat, lng, max_radius_miles = '50' } = req.query; - - // Validate required params - if (!lat || !lng) { - return res.status(400).json({ error: 'lat and lng query parameters are required' }); - } - - const userLat = parseFloat(lat as string); - const userLng = parseFloat(lng as string); - const maxRadius = parseFloat(max_radius_miles as string); - - if (isNaN(userLat) || isNaN(userLng)) { - return res.status(400).json({ error: 'lat and lng must be valid numbers' }); - } - - // First get the product to find its external_product_id - const { rows: productRows } = await query( - `SELECT external_product_id, name, brand_name FROM dutchie_products WHERE id = $1`, - [id] - ); - - if (productRows.length === 0) { - return res.status(404).json({ error: 'Product not found' }); - } - - const externalProductId = productRows[0].external_product_id; - - // Find all dispensaries carrying this product (by external_product_id match) - // with distance calculation using Haversine formula - const { rows: offers } = await query<{ - dispensary_id: number; - dispensary_name: string; - city: string; - state: string; - address: string; - latitude: number; - longitude: number; - menu_url: string; - stock_status: string; - rec_min_price_cents: number; - distance_miles: number; - }>( - ` - WITH latest_snapshots AS ( - SELECT DISTINCT ON (s.dutchie_product_id) - s.dutchie_product_id, - s.dispensary_id, - s.stock_status, - s.rec_min_price_cents, - s.crawled_at - FROM dutchie_product_snapshots s - JOIN dutchie_products p ON s.dutchie_product_id = p.id - WHERE p.external_product_id = $1 - ORDER BY s.dutchie_product_id, s.crawled_at DESC - ) - SELECT - d.id as dispensary_id, - d.name as dispensary_name, - d.city, - d.state, - d.address, - d.latitude, - d.longitude, - d.menu_url, - ls.stock_status, - ls.rec_min_price_cents, - -- Haversine distance formula (in miles) - (3959 * acos( - cos(radians($2)) * cos(radians(d.latitude)) * - cos(radians(d.longitude) - radians($3)) + - sin(radians($2)) * sin(radians(d.latitude)) - )) as distance_miles - FROM latest_snapshots ls - JOIN dispensaries d ON ls.dispensary_id = d.id - WHERE d.latitude IS NOT NULL - AND d.longitude IS NOT NULL - HAVING (3959 * acos( - cos(radians($2)) * cos(radians(d.latitude)) * - cos(radians(d.longitude) - radians($3)) + - sin(radians($2)) * sin(radians(d.latitude)) - )) <= $4 - ORDER BY distance_miles ASC - `, - [externalProductId, userLat, userLng, maxRadius] - ); - - // Find the best (lowest) price for isBestPrice flag - const validPrices = offers - .filter(o => o.rec_min_price_cents && o.rec_min_price_cents > 0) - .map(o => o.rec_min_price_cents); - const bestPrice = validPrices.length > 0 ? Math.min(...validPrices) : null; - - // Transform for frontend - const availability = offers.map(o => ({ - dispensaryId: o.dispensary_id, - dispensaryName: o.dispensary_name, - city: o.city, - state: o.state, - address: o.address, - latitude: o.latitude, - longitude: o.longitude, - menuUrl: o.menu_url, - stockStatus: o.stock_status || 'unknown', - price: o.rec_min_price_cents ? o.rec_min_price_cents / 100 : null, - distanceMiles: Math.round(o.distance_miles * 10) / 10, // Round to 1 decimal - isBestPrice: bestPrice !== null && o.rec_min_price_cents === bestPrice, - })); - - res.json({ - productId: parseInt(id, 10), - productName: productRows[0].name, - brandName: productRows[0].brand_name, - totalCount: availability.length, - offers: availability, - }); - } catch (error: any) { - console.error('Error fetching product availability:', error); - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// CATEGORIES -// ============================================================ - -/** - * GET /api/dutchie-az/categories - * Get all categories with counts - */ -router.get('/categories', async (_req: Request, res: Response) => { - try { - const { rows } = await query(`SELECT * FROM v_categories`); - res.json({ categories: rows }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// BRANDS -// ============================================================ - -/** - * GET /api/dutchie-az/brands - * Get all brands with counts - */ -router.get('/brands', async (req: Request, res: Response) => { - try { - const { limit = '100', offset = '0' } = req.query; - - const { rows } = await query( - ` - SELECT * FROM v_brands - LIMIT $1 OFFSET $2 - `, - [parseInt(limit as string, 10), parseInt(offset as string, 10)] - ); - - res.json({ brands: rows }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// ADMIN ACTIONS -// ============================================================ - -/** - * POST /api/dutchie-az/admin/init-schema - * Initialize the database schema - */ -router.post('/admin/init-schema', async (_req: Request, res: Response) => { - try { - await ensureSchema(); - res.json({ success: true, message: 'Schema initialized' }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/import-azdhs - * Import dispensaries from AZDHS (main database) - */ -router.post('/admin/import-azdhs', async (_req: Request, res: Response) => { - try { - const result = await importAZDHSDispensaries(); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/resolve-platform-ids - * Resolve Dutchie platform IDs for all dispensaries - */ -router.post('/admin/resolve-platform-ids', async (_req: Request, res: Response) => { - try { - const result = await resolvePlatformDispensaryIds(); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/crawl-store/:id - * Crawl a single store - */ -router.post('/admin/crawl-store/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { pricingType = 'rec', useBothModes = true } = req.body; - - const dispensary = await getDispensaryById(parseInt(id, 10)); - if (!dispensary) { - return res.status(404).json({ error: 'Store not found' }); - } - - const result = await crawlDispensaryProducts(dispensary, pricingType, { useBothModes }); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/stats - * Get import and crawl statistics - */ -router.get('/admin/stats', async (_req: Request, res: Response) => { - try { - const importStats = await getImportStats(); - - // Get stock status distribution - const { rows: stockStats } = await query(` - SELECT - stock_status, - COUNT(*) as count - FROM dutchie_products - GROUP BY stock_status - `); - - // Get recent crawl jobs - const { rows: recentJobs } = await query(` - SELECT * FROM dispensary_crawl_jobs - ORDER BY created_at DESC - LIMIT 10 - `); - - res.json({ - import: importStats, - stockDistribution: stockStats, - recentJobs, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// SCHEDULER ADMIN -// ============================================================ - -/** - * GET /api/dutchie-az/admin/scheduler/status - * Get scheduler status - */ -router.get('/admin/scheduler/status', async (_req: Request, res: Response) => { - try { - const status = getSchedulerStatus(); - res.json(status); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/scheduler/start - * Start the scheduler - */ -router.post('/admin/scheduler/start', async (_req: Request, res: Response) => { - try { - startScheduler(); - res.json({ success: true, message: 'Scheduler started' }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/scheduler/stop - * Stop the scheduler - */ -router.post('/admin/scheduler/stop', async (_req: Request, res: Response) => { - try { - stopScheduler(); - res.json({ success: true, message: 'Scheduler stopped' }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/scheduler/trigger - * Trigger an immediate crawl cycle - */ -router.post('/admin/scheduler/trigger', async (_req: Request, res: Response) => { - try { - const result = await triggerImmediateCrawl(); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/az/admin/crawl/:id - * Crawl a single dispensary with job tracking - * - * @deprecated Use POST /api/admin/crawl/:dispensaryId instead. - * This route is kept for backward compatibility only. - * The canonical crawl endpoint is now /api/admin/crawl/:dispensaryId - */ -router.post('/admin/crawl/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { pricingType = 'rec', useBothModes = true } = req.body; - - // Fetch the dispensary first - const dispensary = await getDispensaryById(parseInt(id, 10)); - if (!dispensary) { - return res.status(404).json({ error: 'Dispensary not found' }); - } - - const result = await crawlSingleDispensary(dispensary, pricingType, { useBothModes }); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -import { bulkEnqueueJobs, getQueueStats as getJobQueueStats } from '../services/job-queue'; - -/** - * GET /api/markets/admin/crawlable-stores - * Get all crawlable stores with their crawl status - * OPTIMIZED: Replaced correlated subqueries with LEFT JOINs - */ -router.get('/admin/crawlable-stores', async (_req: Request, res: Response) => { - try { - const { rows } = await query(` - WITH product_counts AS ( - SELECT dispensary_id, COUNT(*) as product_count - FROM dutchie_products - GROUP BY dispensary_id - ), - snapshot_times AS ( - SELECT p.dispensary_id, MAX(s.crawled_at) as last_snapshot_at - FROM dutchie_product_snapshots s - JOIN dutchie_products p ON s.dutchie_product_id = p.id - GROUP BY p.dispensary_id - ) - SELECT - d.id, - d.name, - d.city, - d.state, - d.menu_type, - d.platform_dispensary_id, - d.menu_url, - d.website, - d.last_crawl_at, - d.consecutive_failures, - d.failed_at, - COALESCE(pc.product_count, 0) as product_count, - st.last_snapshot_at - FROM dispensaries d - LEFT JOIN product_counts pc ON pc.dispensary_id = d.id - LEFT JOIN snapshot_times st ON st.dispensary_id = d.id - WHERE d.menu_type = 'dutchie' - AND d.state = 'AZ' - ORDER BY d.name - `); - - const ready = rows.filter((r: any) => r.platform_dispensary_id && !r.failed_at); - const needsPlatformId = rows.filter((r: any) => !r.platform_dispensary_id && !r.failed_at); - const failed = rows.filter((r: any) => r.failed_at); - - res.json({ - total: rows.length, - ready: ready.length, - needsPlatformId: needsPlatformId.length, - failed: failed.length, - stores: rows.map((r: any) => ({ - id: r.id, - name: r.name, - city: r.city, - state: r.state, - menuType: r.menu_type, - platformDispensaryId: r.platform_dispensary_id, - menuUrl: r.menu_url, - website: r.website, - lastCrawlAt: r.last_crawl_at, - productCount: parseInt(r.product_count || '0', 10), - lastSnapshotAt: r.last_snapshot_at, - status: r.failed_at - ? 'failed' - : r.platform_dispensary_id - ? 'ready' - : 'needs_platform_id', - })), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// Legacy alias (deprecated - use /admin/crawlable-stores) -router.get('/admin/dutchie-stores', (req: Request, res: Response) => { - res.redirect(307, '/api/markets/admin/crawlable-stores'); -}); - -/** - * POST /api/markets/admin/crawl-all - * Enqueue crawl jobs for ALL ready stores - * This is a convenience endpoint to queue all stores without triggering the scheduler - */ -router.post('/admin/crawl-all', async (req: Request, res: Response) => { - try { - const { pricingType = 'rec', useBothModes = true } = req.body; - - // Get all "ready" dispensaries (menu_type='dutchie' AND platform_dispensary_id IS NOT NULL AND not failed) - const { rows: rawRows } = await query( - ` - SELECT id, name, platform_dispensary_id FROM dispensaries - WHERE state = 'AZ' - AND menu_type = 'dutchie' - AND platform_dispensary_id IS NOT NULL - AND failed_at IS NULL - ORDER BY last_crawl_at ASC NULLS FIRST - ` - ); - - if (rawRows.length === 0) { - return res.json({ - success: true, - message: 'No ready dispensaries to crawl. Run menu detection first.', - enqueued: 0, - skipped: 0, - dispensaries: [], - }); - } - - const dispensaryIds = rawRows.map((r: any) => r.id); - - // Bulk enqueue jobs (skips dispensaries that already have pending/running jobs) - const { enqueued, skipped } = await bulkEnqueueJobs( - 'dutchie_product_crawl', - dispensaryIds, - { - priority: 0, - metadata: { pricingType, useBothModes }, - } - ); - - // Get current queue stats - const queueStats = await getJobQueueStats(); - - res.json({ - success: true, - message: `Enqueued ${enqueued} crawl jobs for Dutchie stores`, - totalReady: rawRows.length, - enqueued, - skipped, - queueStats, - dispensaries: rawRows.map((r: any) => ({ - id: r.id, - name: r.name, - platformDispensaryId: r.platform_dispensary_id, - })), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/jobs - * Get crawl job history - */ -router.get('/admin/jobs', async (req: Request, res: Response) => { - try { - const { status, dispensaryId, limit = '50', offset = '0' } = req.query; - - let whereClause = 'WHERE 1=1'; - const params: any[] = []; - let paramIndex = 1; - - if (status) { - whereClause += ` AND status = $${paramIndex}`; - params.push(status); - paramIndex++; - } - - if (dispensaryId) { - whereClause += ` AND dispensary_id = $${paramIndex}`; - params.push(parseInt(dispensaryId as string, 10)); - paramIndex++; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows } = await query( - ` - SELECT - cj.*, - d.name as dispensary_name, - d.slug as dispensary_slug - FROM dispensary_crawl_jobs cj - LEFT JOIN dispensaries d ON cj.dispensary_id = d.id - ${whereClause} - ORDER BY cj.created_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM dispensary_crawl_jobs ${whereClause}`, - params.slice(0, -2) - ); - - res.json({ - jobs: rows, - total: parseInt(countRows[0]?.total || '0', 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// SCHEDULES (CONFIG CRUD) -// ============================================================ - -/** - * GET /api/dutchie-az/admin/schedules - * Get all schedule configurations - */ -router.get('/admin/schedules', async (_req: Request, res: Response) => { - try { - const schedules = await getAllSchedules(); - res.json({ schedules }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/schedules/:id - * Get a single schedule by ID - */ -router.get('/admin/schedules/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const schedule = await getScheduleById(parseInt(id, 10)); - - if (!schedule) { - return res.status(404).json({ error: 'Schedule not found' }); - } - - res.json(schedule); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/schedules - * Create a new schedule - */ -router.post('/admin/schedules', async (req: Request, res: Response) => { - try { - const { - jobName, - description, - enabled = true, - baseIntervalMinutes, - jitterMinutes, - jobConfig, - startImmediately = false, - } = req.body; - - if (!jobName || typeof baseIntervalMinutes !== 'number' || typeof jitterMinutes !== 'number') { - return res.status(400).json({ - error: 'jobName, baseIntervalMinutes, and jitterMinutes are required', - }); - } - - const schedule = await createSchedule({ - jobName, - description, - enabled, - baseIntervalMinutes, - jitterMinutes, - jobConfig, - startImmediately, - }); - - res.status(201).json(schedule); - } catch (error: any) { - // Handle unique constraint violation - if (error.code === '23505') { - return res.status(409).json({ error: `Schedule "${req.body.jobName}" already exists` }); - } - res.status(500).json({ error: error.message }); - } -}); - -/** - * PUT /api/dutchie-az/admin/schedules/:id - * Update a schedule - */ -router.put('/admin/schedules/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { description, enabled, baseIntervalMinutes, jitterMinutes, jobConfig } = req.body; - - const schedule = await updateSchedule(parseInt(id, 10), { - description, - enabled, - baseIntervalMinutes, - jitterMinutes, - jobConfig, - }); - - if (!schedule) { - return res.status(404).json({ error: 'Schedule not found' }); - } - - res.json(schedule); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * DELETE /api/dutchie-az/admin/schedules/:id - * Delete a schedule - */ -router.delete('/admin/schedules/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const deleted = await deleteSchedule(parseInt(id, 10)); - - if (!deleted) { - return res.status(404).json({ error: 'Schedule not found' }); - } - - res.json({ success: true, message: 'Schedule deleted' }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/schedules/:id/trigger - * Trigger immediate execution of a schedule - */ -router.post('/admin/schedules/:id/trigger', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const result = await triggerScheduleNow(parseInt(id, 10)); - - if (!result.success) { - return res.status(400).json({ error: result.message }); - } - - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/schedules/init - * Initialize default schedules if they don't exist - */ -router.post('/admin/schedules/init', async (_req: Request, res: Response) => { - try { - await initializeDefaultSchedules(); - const schedules = await getAllSchedules(); - res.json({ success: true, schedules }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/schedules/:id/logs - * Get run logs for a specific schedule - */ -router.get('/admin/schedules/:id/logs', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { limit = '50', offset = '0' } = req.query; - - const result = await getRunLogs({ - scheduleId: parseInt(id, 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/run-logs - * Get all run logs with filtering - */ -router.get('/admin/run-logs', async (req: Request, res: Response) => { - try { - const { scheduleId, jobName, limit = '50', offset = '0' } = req.query; - - const result = await getRunLogs({ - scheduleId: scheduleId ? parseInt(scheduleId as string, 10) : undefined, - jobName: jobName as string | undefined, - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// DEBUG ROUTES -// ============================================================ - -/** - * GET /api/dutchie-az/debug/summary - * Get overall system summary for debugging - */ -router.get('/debug/summary', async (_req: Request, res: Response) => { - try { - // Get table counts - const { rows: tableCounts } = await query(` - SELECT - (SELECT COUNT(*) FROM dispensaries) as dispensary_count, - (SELECT COUNT(*) FROM dispensaries WHERE platform_dispensary_id IS NOT NULL) as dispensaries_with_platform_id, - (SELECT COUNT(*) FROM dutchie_products) as product_count, - (SELECT COUNT(*) FROM dutchie_product_snapshots) as snapshot_count, - (SELECT COUNT(*) FROM dispensary_crawl_jobs) as job_count, - (SELECT COUNT(*) FROM dispensary_crawl_jobs WHERE status = 'completed') as completed_jobs, - (SELECT COUNT(*) FROM dispensary_crawl_jobs WHERE status = 'failed') as failed_jobs - `); - - // Get stock status distribution - const { rows: stockDistribution } = await query(` - SELECT - stock_status, - COUNT(*) as count - FROM dutchie_products - GROUP BY stock_status - ORDER BY count DESC - `); - - // Get products by dispensary - const { rows: productsByDispensary } = await query(` - SELECT - d.id, - d.name, - d.slug, - d.platform_dispensary_id, - COUNT(p.id) as product_count, - MAX(p.updated_at) as last_product_update - FROM dispensaries d - LEFT JOIN dutchie_products p ON d.id = p.dispensary_id - WHERE d.state = 'AZ' - GROUP BY d.id, d.name, d.slug, d.platform_dispensary_id - ORDER BY product_count DESC - LIMIT 20 - `); - - // Get recent snapshots - const { rows: recentSnapshots } = await query(` - SELECT - s.id, - s.dutchie_product_id, - p.name as product_name, - d.name as dispensary_name, - s.crawled_at - FROM dutchie_product_snapshots s - JOIN dutchie_products p ON s.dutchie_product_id = p.id - JOIN dispensaries d ON p.dispensary_id = d.id - ORDER BY s.crawled_at DESC - LIMIT 10 - `); - - res.json({ - tableCounts: tableCounts[0], - stockDistribution, - productsByDispensary, - recentSnapshots, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/debug/store/:id - * Get detailed debug info for a specific store - */ -router.get('/debug/store/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Get dispensary info - const { rows: dispensaryRows } = await query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`, - [parseInt(id, 10)] - ); - - if (dispensaryRows.length === 0) { - return res.status(404).json({ error: 'Store not found' }); - } - - const dispensary = dispensaryRows[0]; - - // Get product stats - const { rows: productStats } = await query( - ` - SELECT - COUNT(*) as total_products, - COUNT(*) FILTER (WHERE stock_status = 'in_stock') as in_stock, - COUNT(*) FILTER (WHERE stock_status = 'out_of_stock') as out_of_stock, - COUNT(*) FILTER (WHERE stock_status = 'unknown') as unknown, - COUNT(*) FILTER (WHERE stock_status = 'missing_from_feed') as missing_from_feed, - MIN(first_seen_at) as earliest_product, - MAX(last_seen_at) as latest_product, - MAX(updated_at) as last_update - FROM dutchie_products - WHERE dispensary_id = $1 - `, - [id] - ); - - // Get snapshot stats - const { rows: snapshotStats } = await query( - ` - SELECT - COUNT(*) as total_snapshots, - MIN(crawled_at) as earliest_snapshot, - MAX(crawled_at) as latest_snapshot, - COUNT(DISTINCT dutchie_product_id) as products_with_snapshots - FROM dutchie_product_snapshots s - JOIN dutchie_products p ON s.dutchie_product_id = p.id - WHERE p.dispensary_id = $1 - `, - [id] - ); - - // Get crawl job history - const { rows: recentJobs } = await query( - ` - SELECT - id, - status, - started_at, - completed_at, - products_found, - products_new, - products_updated, - error_message, - created_at - FROM dispensary_crawl_jobs - WHERE dispensary_id = $1 - ORDER BY created_at DESC - LIMIT 10 - `, - [id] - ); - - // Get sample products (5 in-stock, 5 out-of-stock) - const { rows: sampleInStock } = await query( - ` - SELECT - p.id, - p.name, - p.brand_name, - p.type, - p.stock_status, - p.updated_at - FROM dutchie_products p - WHERE p.dispensary_id = $1 AND p.stock_status = 'in_stock' - ORDER BY p.updated_at DESC - LIMIT 5 - `, - [id] - ); - - const { rows: sampleOutOfStock } = await query( - ` - SELECT - p.id, - p.name, - p.brand_name, - p.type, - p.stock_status, - p.updated_at - FROM dutchie_products p - WHERE p.dispensary_id = $1 AND p.stock_status = 'out_of_stock' - ORDER BY p.updated_at DESC - LIMIT 5 - `, - [id] - ); - - // Get categories breakdown - const { rows: categories } = await query( - ` - SELECT - type, - subcategory, - COUNT(*) as count - FROM dutchie_products - WHERE dispensary_id = $1 - GROUP BY type, subcategory - ORDER BY count DESC - `, - [id] - ); - - res.json({ - dispensary, - productStats: productStats[0], - snapshotStats: snapshotStats[0], - recentJobs, - sampleProducts: { - inStock: sampleInStock, - outOfStock: sampleOutOfStock, - }, - categories, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// LIVE CRAWLER STATUS ROUTES -// ============================================================ - -import { - getQueueStats, - getActiveWorkers, - getRunningJobs, - recoverStaleJobs, -} from '../services/job-queue'; - -/** - * GET /api/dutchie-az/monitor/active-jobs - * Get all currently running jobs with real-time status including worker info - * OPTIMIZED: Run all queries in parallel - */ -router.get('/monitor/active-jobs', async (_req: Request, res: Response) => { - try { - // Run all queries in parallel for better performance - const [scheduledJobsResult, crawlJobsResult, queueStats, activeWorkers] = await Promise.all([ - // Query 1: Running scheduled jobs from job_run_logs - query(` - SELECT - jrl.id, - jrl.schedule_id, - jrl.job_name, - jrl.status, - jrl.started_at, - jrl.items_processed, - jrl.items_succeeded, - jrl.items_failed, - jrl.metadata, - jrl.worker_name, - jrl.run_role, - js.description as job_description, - js.worker_name as schedule_worker_name, - js.worker_role as schedule_worker_role, - EXTRACT(EPOCH FROM (NOW() - jrl.started_at)) as duration_seconds - FROM job_run_logs jrl - LEFT JOIN job_schedules js ON jrl.schedule_id = js.id - WHERE jrl.status = 'running' - ORDER BY jrl.started_at DESC - `), - - // Query 2: Running crawl jobs with dispensary info - query(` - SELECT - cj.id, - cj.job_type, - cj.dispensary_id, - d.name as dispensary_name, - d.city, - d.platform_dispensary_id, - cj.status, - cj.started_at, - cj.claimed_by as worker_id, - cj.worker_hostname, - cj.claimed_at, - cj.enqueued_by_worker, - cj.products_found, - cj.products_upserted, - cj.snapshots_created, - cj.current_page, - cj.total_pages, - cj.last_heartbeat_at, - cj.retry_count, - EXTRACT(EPOCH FROM (NOW() - cj.started_at)) as duration_seconds - FROM dispensary_crawl_jobs cj - LEFT JOIN dispensaries d ON cj.dispensary_id = d.id - WHERE cj.status = 'running' - ORDER BY cj.started_at DESC - `), - - // Query 3: Queue stats - getQueueStats(), - - // Query 4: Active workers - getActiveWorkers() - ]); - - const runningScheduledJobs = scheduledJobsResult.rows; - const runningCrawlJobs = crawlJobsResult.rows; - - // Also get in-memory scrapers if any (from the legacy system) - let inMemoryScrapers: any[] = []; - try { - const { activeScrapers } = await import('../../routes/scraper-monitor'); - inMemoryScrapers = Array.from(activeScrapers.values()).map(scraper => ({ - ...scraper, - source: 'in_memory', - duration_seconds: (Date.now() - scraper.startTime.getTime()) / 1000, - })); - } catch { - // Legacy scraper monitor not available - } - - res.json({ - scheduledJobs: runningScheduledJobs, - crawlJobs: runningCrawlJobs, - inMemoryScrapers, - activeWorkers, - queueStats, - totalActive: runningScheduledJobs.length + runningCrawlJobs.length + inMemoryScrapers.length, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/monitor/recent-jobs - * Get recent completed jobs - */ -router.get('/monitor/recent-jobs', async (req: Request, res: Response) => { - try { - const { limit = '50' } = req.query; - const limitNum = Math.min(parseInt(limit as string, 10), 200); - - // Recent job run logs - const { rows: recentJobLogs } = await query(` - SELECT - jrl.id, - jrl.schedule_id, - jrl.job_name, - jrl.status, - jrl.started_at, - jrl.completed_at, - jrl.duration_ms, - jrl.error_message, - jrl.items_processed, - jrl.items_succeeded, - jrl.items_failed, - jrl.metadata, - jrl.worker_name, - jrl.run_role, - js.description as job_description, - js.worker_name as schedule_worker_name, - js.worker_role as schedule_worker_role - FROM job_run_logs jrl - LEFT JOIN job_schedules js ON jrl.schedule_id = js.id - ORDER BY jrl.created_at DESC - LIMIT $1 - `, [limitNum]); - - // Recent crawl jobs (includes enqueued_by_worker for named workforce tracking) - const { rows: recentCrawlJobs } = await query(` - SELECT - cj.id, - cj.job_type, - cj.dispensary_id, - d.name as dispensary_name, - d.city, - cj.status, - cj.started_at, - cj.completed_at, - cj.error_message, - cj.products_found, - cj.snapshots_created, - cj.metadata, - cj.enqueued_by_worker, - EXTRACT(EPOCH FROM (COALESCE(cj.completed_at, NOW()) - cj.started_at)) * 1000 as duration_ms - FROM dispensary_crawl_jobs cj - LEFT JOIN dispensaries d ON cj.dispensary_id = d.id - ORDER BY cj.created_at DESC - LIMIT $1 - `, [limitNum]); - - res.json({ - jobLogs: recentJobLogs, - crawlJobs: recentCrawlJobs, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/monitor/errors - * Get recent job errors - */ -router.get('/monitor/errors', async (req: Request, res: Response) => { - try { - const { limit = '20', hours = '24' } = req.query; - const limitNum = Math.min(parseInt(limit as string, 10), 100); - const hoursNum = Math.min(parseInt(hours as string, 10), 168); - - // Errors from job_run_logs - const { rows: jobErrors } = await query(` - SELECT - 'job_run_log' as source, - jrl.id, - jrl.job_name, - jrl.status, - jrl.started_at, - jrl.completed_at, - jrl.error_message, - jrl.items_processed, - jrl.items_failed, - jrl.metadata - FROM job_run_logs jrl - WHERE jrl.status IN ('error', 'partial') - AND jrl.created_at > NOW() - INTERVAL '${hoursNum} hours' - ORDER BY jrl.created_at DESC - LIMIT $1 - `, [limitNum]); - - // Errors from dispensary_crawl_jobs - const { rows: crawlErrors } = await query(` - SELECT - 'crawl_job' as source, - cj.id, - cj.job_type as job_name, - d.name as dispensary_name, - cj.status, - cj.started_at, - cj.completed_at, - cj.error_message, - cj.products_found as items_processed, - cj.metadata - FROM dispensary_crawl_jobs cj - LEFT JOIN dispensaries d ON cj.dispensary_id = d.id - WHERE cj.status = 'failed' - AND cj.created_at > NOW() - INTERVAL '${hoursNum} hours' - ORDER BY cj.created_at DESC - LIMIT $1 - `, [limitNum]); - - res.json({ - errors: [...jobErrors, ...crawlErrors].sort((a, b) => - new Date(b.started_at || b.created_at).getTime() - - new Date(a.started_at || a.created_at).getTime() - ).slice(0, limitNum), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/monitor/summary - * Get overall monitoring summary - */ -router.get('/monitor/summary', async (_req: Request, res: Response) => { - try { - const { rows: stats } = await query(` - SELECT - (SELECT COUNT(*) FROM job_run_logs WHERE status = 'running') as running_scheduled_jobs, - (SELECT COUNT(*) FROM dispensary_crawl_jobs WHERE status = 'running') as running_dispensary_crawl_jobs, - (SELECT COUNT(*) FROM job_run_logs WHERE status = 'success' AND created_at > NOW() - INTERVAL '24 hours') as successful_jobs_24h, - (SELECT COUNT(*) FROM job_run_logs WHERE status IN ('error', 'partial') AND created_at > NOW() - INTERVAL '24 hours') as failed_jobs_24h, - (SELECT COUNT(*) FROM dispensary_crawl_jobs WHERE status = 'completed' AND created_at > NOW() - INTERVAL '24 hours') as successful_crawls_24h, - (SELECT COUNT(*) FROM dispensary_crawl_jobs WHERE status = 'failed' AND created_at > NOW() - INTERVAL '24 hours') as failed_crawls_24h, - (SELECT SUM(products_found) FROM dispensary_crawl_jobs WHERE status = 'completed' AND created_at > NOW() - INTERVAL '24 hours') as products_found_24h, - (SELECT SUM(snapshots_created) FROM dispensary_crawl_jobs WHERE status = 'completed' AND created_at > NOW() - INTERVAL '24 hours') as snapshots_created_24h, - (SELECT MAX(started_at) FROM job_run_logs) as last_job_started, - (SELECT MAX(completed_at) FROM job_run_logs WHERE status = 'success') as last_job_completed - `); - - // Get next scheduled runs (with worker names) - const { rows: nextRuns } = await query(` - SELECT - id, - job_name, - description, - worker_name, - worker_role, - enabled, - next_run_at, - last_status, - last_run_at - FROM job_schedules - WHERE enabled = true AND next_run_at IS NOT NULL - ORDER BY next_run_at ASC - LIMIT 5 - `); - - res.json({ - ...(stats[0] || {}), - nextRuns, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// MENU DETECTION ROUTES -// ============================================================ - -import { - detectAndResolveDispensary, - runBulkDetection, - getDetectionStats, - getDispensariesNeedingDetection, -} from '../services/menu-detection'; - -/** - * GET /api/dutchie-az/admin/detection/stats - * Get menu detection statistics - */ -router.get('/admin/detection/stats', async (_req: Request, res: Response) => { - try { - const stats = await getDetectionStats(); - res.json(stats); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/detection/pending - * Get dispensaries that need menu detection - */ -router.get('/admin/detection/pending', async (req: Request, res: Response) => { - try { - const { state = 'AZ', limit = '100' } = req.query; - const dispensaries = await getDispensariesNeedingDetection({ - state: state as string, - limit: parseInt(limit as string, 10), - }); - res.json({ dispensaries, total: dispensaries.length }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/detection/detect/:id - * Detect menu provider and resolve platform ID for a single dispensary - */ -router.post('/admin/detection/detect/:id', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const result = await detectAndResolveDispensary(parseInt(id, 10)); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/detection/detect-all - * Run bulk menu detection on all dispensaries needing it - */ -router.post('/admin/detection/detect-all', async (req: Request, res: Response) => { - try { - const { state = 'AZ', onlyUnknown = true, onlyMissingPlatformId = false, limit } = req.body; - - const result = await runBulkDetection({ - state, - onlyUnknown, - onlyMissingPlatformId, - limit, - }); - - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/detection/trigger - * Trigger the menu detection scheduled job immediately - */ -router.post('/admin/detection/trigger', async (_req: Request, res: Response) => { - try { - // Find the menu detection schedule and trigger it - const schedules = await getAllSchedules(); - const menuDetection = schedules.find(s => s.jobName === 'dutchie_az_menu_detection'); - - if (!menuDetection) { - return res.status(404).json({ error: 'Menu detection schedule not found. Run /admin/schedules/init first.' }); - } - - const result = await triggerScheduleNow(menuDetection.id); - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// CRAWLER RELIABILITY / HEALTH ENDPOINTS (Phase 1) -// ============================================================ - -/** - * GET /api/dutchie-az/admin/crawler/health - * Get overall crawler health metrics - */ -router.get('/admin/crawler/health', async (_req: Request, res: Response) => { - try { - const { rows } = await query(`SELECT * FROM v_crawl_health`); - res.json(rows[0] || { - active_crawlers: 0, - degraded_crawlers: 0, - paused_crawlers: 0, - failed_crawlers: 0, - due_now: 0, - stores_with_failures: 0, - avg_consecutive_failures: 0, - successful_last_24h: 0, - }); - } catch (error: any) { - // View might not exist yet - res.json({ - active_crawlers: 0, - degraded_crawlers: 0, - paused_crawlers: 0, - failed_crawlers: 0, - due_now: 0, - error: 'View not available - run migration 046', - }); - } -}); - -/** - * GET /api/dutchie-az/admin/crawler/error-summary - * Get error summary by code over last 7 days - */ -router.get('/admin/crawler/error-summary', async (_req: Request, res: Response) => { - try { - const { rows } = await query(`SELECT * FROM v_crawl_error_summary`); - res.json({ errors: rows }); - } catch (error: any) { - res.json({ errors: [], error: 'View not available - run migration 046' }); - } -}); - -/** - * GET /api/dutchie-az/admin/crawler/status - * Get detailed status for all crawlers - */ -router.get('/admin/crawler/status', async (req: Request, res: Response) => { - try { - const { status, limit = '100', offset = '0' } = req.query; - - let whereClause = ''; - const params: any[] = []; - let paramIndex = 1; - - if (status) { - whereClause = `WHERE crawl_status = $${paramIndex}`; - params.push(status); - paramIndex++; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows } = await query( - `SELECT * FROM v_crawler_status - ${whereClause} - ORDER BY consecutive_failures DESC, name ASC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1}`, - params - ); - - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM v_crawler_status ${whereClause}`, - params.slice(0, -2) - ); - - res.json({ - stores: rows, - total: parseInt(countRows[0]?.total || '0', 10), - limit: parseInt(limit as string, 10), - offset: parseInt(offset as string, 10), - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/crawler/attempts - * Get recent crawl attempts (for debugging) - */ -router.get('/admin/crawler/attempts', async (req: Request, res: Response) => { - try { - const { dispensaryId, errorCode, limit = '50', offset = '0' } = req.query; - - let whereClause = 'WHERE 1=1'; - const params: any[] = []; - let paramIndex = 1; - - if (dispensaryId) { - whereClause += ` AND ca.dispensary_id = $${paramIndex}`; - params.push(parseInt(dispensaryId as string, 10)); - paramIndex++; - } - - if (errorCode) { - whereClause += ` AND ca.error_code = $${paramIndex}`; - params.push(errorCode); - paramIndex++; - } - - params.push(parseInt(limit as string, 10), parseInt(offset as string, 10)); - - const { rows } = await query( - `SELECT - ca.*, - d.name as dispensary_name, - d.city - FROM crawl_attempts ca - LEFT JOIN dispensaries d ON ca.dispensary_id = d.id - ${whereClause} - ORDER BY ca.started_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1}`, - params - ); - - res.json({ attempts: rows }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/dispensaries/:id/pause - * Pause crawling for a dispensary - */ -router.post('/admin/dispensaries/:id/pause', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - await query(` - UPDATE dispensaries - SET crawl_status = 'paused', - next_crawl_at = NULL, - updated_at = NOW() - WHERE id = $1 - `, [id]); - - res.json({ success: true, message: `Crawling paused for dispensary ${id}` }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/dispensaries/:id/resume - * Resume crawling for a paused/degraded dispensary - */ -router.post('/admin/dispensaries/:id/resume', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - // Reset to active and schedule next crawl - await query(` - UPDATE dispensaries - SET crawl_status = 'active', - consecutive_failures = 0, - backoff_multiplier = 1.0, - next_crawl_at = NOW() + INTERVAL '5 minutes', - updated_at = NOW() - WHERE id = $1 - `, [id]); - - res.json({ success: true, message: `Crawling resumed for dispensary ${id}` }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// FAILED DISPENSARIES ROUTES -// ============================================================ - -/** - * GET /api/dutchie-az/admin/dispensaries/failed - * Get all dispensaries flagged as failed (for admin review) - */ -router.get('/admin/dispensaries/failed', async (_req: Request, res: Response) => { - try { - const { rows } = await query(` - SELECT - id, - name, - city, - state, - menu_url, - menu_type, - platform_dispensary_id, - consecutive_failures, - last_failure_at, - last_failure_reason, - failed_at, - failure_notes, - last_crawl_at, - updated_at - FROM dispensaries - WHERE failed_at IS NOT NULL - ORDER BY failed_at DESC - `); - - res.json({ - failed: rows, - total: rows.length, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/dispensaries/at-risk - * Get dispensaries with high failure counts (but not yet flagged as failed) - */ -router.get('/admin/dispensaries/at-risk', async (_req: Request, res: Response) => { - try { - const { rows } = await query(` - SELECT - id, - name, - city, - state, - menu_url, - menu_type, - consecutive_failures, - last_failure_at, - last_failure_reason, - last_crawl_at - FROM dispensaries - WHERE consecutive_failures >= 1 - AND failed_at IS NULL - ORDER BY consecutive_failures DESC, last_failure_at DESC - `); - - res.json({ - atRisk: rows, - total: rows.length, - }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/dispensaries/:id/unfail - * Restore a failed dispensary - clears failed status and resets for re-detection - */ -router.post('/admin/dispensaries/:id/unfail', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - await query(` - UPDATE dispensaries - SET failed_at = NULL, - consecutive_failures = 0, - last_failure_at = NULL, - last_failure_reason = NULL, - failure_notes = NULL, - menu_type = NULL, - platform_dispensary_id = NULL, - updated_at = NOW() - WHERE id = $1 - `, [id]); - - res.json({ success: true, message: `Dispensary ${id} restored for re-detection` }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * POST /api/dutchie-az/admin/dispensaries/:id/reset-failures - * Reset failure counter for a dispensary (without unflagging) - */ -router.post('/admin/dispensaries/:id/reset-failures', async (req: Request, res: Response) => { - try { - const { id } = req.params; - - await query(` - UPDATE dispensaries - SET consecutive_failures = 0, - last_failure_at = NULL, - last_failure_reason = NULL, - updated_at = NOW() - WHERE id = $1 - `, [id]); - - res.json({ success: true, message: `Failure counter reset for dispensary ${id}` }); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/dispensaries/health-summary - * Get a summary of dispensary health status - */ -router.get('/admin/dispensaries/health-summary', async (_req: Request, res: Response) => { - try { - const { rows } = await query(` - SELECT - COUNT(*) as total, - COUNT(*) FILTER (WHERE state = 'AZ') as arizona_total, - COUNT(*) FILTER (WHERE failed_at IS NOT NULL) as failed, - COUNT(*) FILTER (WHERE consecutive_failures >= 1 AND failed_at IS NULL) as at_risk, - COUNT(*) FILTER (WHERE menu_type = 'dutchie' AND platform_dispensary_id IS NOT NULL AND failed_at IS NULL) as ready_to_crawl, - COUNT(*) FILTER (WHERE menu_type = 'dutchie' AND failed_at IS NULL) as dutchie_detected, - COUNT(*) FILTER (WHERE (menu_type IS NULL OR menu_type = 'unknown') AND failed_at IS NULL) as needs_detection, - COUNT(*) FILTER (WHERE menu_type NOT IN ('dutchie', 'unknown') AND menu_type IS NOT NULL AND failed_at IS NULL) as non_dutchie - FROM dispensaries - WHERE state = 'AZ' - `); - - res.json(rows[0] || {}); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// ORCHESTRATOR TRACE ROUTES -// ============================================================ - -import { - getLatestTrace, - getTraceById, - getTracesForDispensary, - getTraceByRunId, -} from '../../services/orchestrator-trace'; - -/** - * GET /api/dutchie-az/admin/dispensaries/:id/crawl-trace/latest - * Get the latest orchestrator trace for a dispensary - */ -router.get('/admin/dispensaries/:id/crawl-trace/latest', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const trace = await getLatestTrace(parseInt(id, 10)); - - if (!trace) { - return res.status(404).json({ error: 'No trace found for this dispensary' }); - } - - res.json(trace); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/dispensaries/:id/crawl-traces - * Get paginated list of orchestrator traces for a dispensary - */ -router.get('/admin/dispensaries/:id/crawl-traces', async (req: Request, res: Response) => { - try { - const { id } = req.params; - const { limit = '20', offset = '0' } = req.query; - - const result = await getTracesForDispensary( - parseInt(id, 10), - parseInt(limit as string, 10), - parseInt(offset as string, 10) - ); - - res.json(result); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/crawl-traces/:traceId - * Get a specific orchestrator trace by ID - */ -router.get('/admin/crawl-traces/:traceId', async (req: Request, res: Response) => { - try { - const { traceId } = req.params; - const trace = await getTraceById(parseInt(traceId, 10)); - - if (!trace) { - return res.status(404).json({ error: 'Trace not found' }); - } - - res.json(trace); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -/** - * GET /api/dutchie-az/admin/crawl-traces/run/:runId - * Get a specific orchestrator trace by run ID - */ -router.get('/admin/crawl-traces/run/:runId', async (req: Request, res: Response) => { - try { - const { runId } = req.params; - const trace = await getTraceByRunId(runId); - - if (!trace) { - return res.status(404).json({ error: 'Trace not found for this run ID' }); - } - - res.json(trace); - } catch (error: any) { - res.status(500).json({ error: error.message }); - } -}); - -// ============================================================ -// SCRAPER OVERVIEW DASHBOARD ENDPOINTS -// ============================================================ - -/** - * GET /api/dutchie-az/scraper/overview - * Comprehensive scraper overview for the new dashboard - * OPTIMIZED: Combined 6 queries into 4 using CTEs (was 6) - */ -router.get('/scraper/overview', async (_req: Request, res: Response) => { - try { - // Run all queries in parallel using Promise.all for better performance - const [kpiResult, workerResult, timeSeriesResult, visibilityResult] = await Promise.all([ - // Query 1: All KPI metrics in a single query using CTEs - query(` - WITH product_stats AS ( - SELECT - COUNT(*) AS total_products, - COUNT(*) FILTER (WHERE stock_status = 'in_stock') AS in_stock_products, - COUNT(*) FILTER (WHERE visibility_lost = true AND visibility_lost_at > NOW() - INTERVAL '24 hours') AS visibility_lost_24h, - COUNT(*) FILTER (WHERE visibility_restored_at > NOW() - INTERVAL '24 hours') AS visibility_restored_24h, - COUNT(*) FILTER (WHERE visibility_lost = true) AS total_visibility_lost - FROM dutchie_products - ), - dispensary_stats AS ( - SELECT - COUNT(*) FILTER (WHERE menu_type = 'dutchie' AND state = 'AZ') AS total_dispensaries, - COUNT(*) FILTER (WHERE menu_type = 'dutchie' AND state = 'AZ' AND platform_dispensary_id IS NOT NULL) AS crawlable_dispensaries - FROM dispensaries - ), - job_stats AS ( - SELECT - COUNT(*) FILTER (WHERE status IN ('error', 'partial') AND created_at > NOW() - INTERVAL '24 hours') AS errors_24h, - COUNT(*) FILTER (WHERE status = 'success' AND created_at > NOW() - INTERVAL '24 hours') AS successful_jobs_24h - FROM job_run_logs - ), - worker_stats AS ( - SELECT COUNT(*) AS active_workers FROM job_schedules WHERE enabled = true - ) - SELECT - ps.total_products, ps.in_stock_products, ps.visibility_lost_24h, ps.visibility_restored_24h, ps.total_visibility_lost, - ds.total_dispensaries, ds.crawlable_dispensaries, - js.errors_24h, js.successful_jobs_24h, - ws.active_workers - FROM product_stats ps, dispensary_stats ds, job_stats js, worker_stats ws - `), - - // Query 2: Active worker details - query(` - SELECT worker_name, worker_role, enabled, last_status, last_run_at, next_run_at - FROM job_schedules - WHERE enabled = true - ORDER BY next_run_at ASC NULLS LAST - `), - - // Query 3: Time-series data (activity + growth + recent runs) - query(` - WITH activity_by_hour AS ( - SELECT - date_trunc('hour', started_at) AS hour, - COUNT(*) FILTER (WHERE status = 'success') AS successful, - COUNT(*) FILTER (WHERE status IN ('error', 'partial')) AS failed, - COUNT(*) AS total - FROM job_run_logs - WHERE started_at > NOW() - INTERVAL '24 hours' - GROUP BY date_trunc('hour', started_at) - ), - product_growth AS ( - SELECT - date_trunc('day', created_at) AS day, - COUNT(*) AS new_products - FROM dutchie_products - WHERE created_at > NOW() - INTERVAL '7 days' - GROUP BY date_trunc('day', created_at) - ), - recent_runs AS ( - SELECT - jrl.id, - jrl.job_name, - jrl.status, - jrl.started_at, - jrl.completed_at, - jrl.items_processed, - jrl.items_succeeded, - jrl.items_failed, - jrl.metadata, - js.worker_name, - js.worker_role - FROM job_run_logs jrl - LEFT JOIN job_schedules js ON jrl.schedule_id = js.id - ORDER BY jrl.started_at DESC - LIMIT 20 - ) - SELECT - 'activity' AS query_type, - jsonb_agg(jsonb_build_object('hour', hour, 'successful', successful, 'failed', failed, 'total', total) ORDER BY hour) AS data - FROM activity_by_hour - UNION ALL - SELECT - 'growth' AS query_type, - jsonb_agg(jsonb_build_object('day', day, 'new_products', new_products) ORDER BY day) AS data - FROM product_growth - UNION ALL - SELECT - 'runs' AS query_type, - jsonb_agg(jsonb_build_object( - 'id', id, 'job_name', job_name, 'status', status, 'started_at', started_at, - 'completed_at', completed_at, 'items_processed', items_processed, - 'items_succeeded', items_succeeded, 'items_failed', items_failed, - 'metadata', metadata, 'worker_name', worker_name, 'worker_role', worker_role - ) ORDER BY started_at DESC) AS data - FROM recent_runs - `), - - // Query 4: Visibility changes by store - query(` - SELECT - d.id AS dispensary_id, - d.name AS dispensary_name, - d.state, - COUNT(dp.id) FILTER (WHERE dp.visibility_lost = true AND dp.visibility_lost_at > NOW() - INTERVAL '24 hours') AS lost_24h, - COUNT(dp.id) FILTER (WHERE dp.visibility_restored_at > NOW() - INTERVAL '24 hours') AS restored_24h, - MAX(dp.visibility_lost_at) AS latest_loss, - MAX(dp.visibility_restored_at) AS latest_restore - FROM dispensaries d - LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id - WHERE d.menu_type = 'dutchie' - GROUP BY d.id, d.name, d.state - HAVING COUNT(dp.id) FILTER (WHERE dp.visibility_lost = true AND dp.visibility_lost_at > NOW() - INTERVAL '24 hours') > 0 - OR COUNT(dp.id) FILTER (WHERE dp.visibility_restored_at > NOW() - INTERVAL '24 hours') > 0 - ORDER BY lost_24h DESC, restored_24h DESC - LIMIT 15 - `) - ]); - - // Parse results - const kpi = kpiResult.rows[0] || {}; - const workerRows = workerResult.rows; - const visibilityChanges = visibilityResult.rows; - - // Parse time-series aggregated results - const timeSeriesMap = Object.fromEntries( - timeSeriesResult.rows.map((r: any) => [r.query_type, r.data || []]) - ); - const activityRows = timeSeriesMap['activity'] || []; - const growthRows = timeSeriesMap['growth'] || []; - const recentRuns = timeSeriesMap['runs'] || []; - - res.json({ - kpi: { - totalProducts: parseInt(kpi.total_products || '0'), - inStockProducts: parseInt(kpi.in_stock_products || '0'), - totalDispensaries: parseInt(kpi.total_dispensaries || '0'), - crawlableDispensaries: parseInt(kpi.crawlable_dispensaries || '0'), - visibilityLost24h: parseInt(kpi.visibility_lost_24h || '0'), - visibilityRestored24h: parseInt(kpi.visibility_restored_24h || '0'), - totalVisibilityLost: parseInt(kpi.total_visibility_lost || '0'), - errors24h: parseInt(kpi.errors_24h || '0'), - successfulJobs24h: parseInt(kpi.successful_jobs_24h || '0'), - activeWorkers: parseInt(kpi.active_workers || '0'), - }, - workers: workerRows, - activityByHour: activityRows.map((row: any) => ({ - hour: row.hour, - successful: parseInt(row.successful || '0'), - failed: parseInt(row.failed || '0'), - total: parseInt(row.total || '0'), - })), - productGrowth: growthRows.map((row: any) => ({ - day: row.day, - newProducts: parseInt(row.new_products || '0'), - })), - recentRuns: recentRuns.map((row: any) => ({ - id: row.id, - jobName: row.job_name, - status: row.status, - startedAt: row.started_at, - completedAt: row.completed_at, - itemsProcessed: row.items_processed, - itemsSucceeded: row.items_succeeded, - itemsFailed: row.items_failed, - workerName: row.worker_name, - workerRole: row.worker_role, - visibilityLost: row.metadata?.visibilityLostCount || 0, - visibilityRestored: row.metadata?.visibilityRestoredCount || 0, - })), - visibilityChanges: visibilityChanges.map((row: any) => ({ - dispensaryId: row.dispensary_id, - dispensaryName: row.dispensary_name, - state: row.state, - lost24h: parseInt(row.lost_24h || '0'), - restored24h: parseInt(row.restored_24h || '0'), - latestLoss: row.latest_loss, - latestRestore: row.latest_restore, - })), - }); - } catch (error: any) { - console.error('Error fetching scraper overview:', error); - res.status(500).json({ error: error.message }); - } -}); - -export default router; diff --git a/backend/src/dutchie-az/scripts/stress-test.ts b/backend/src/dutchie-az/scripts/stress-test.ts deleted file mode 100644 index ad82b208..00000000 --- a/backend/src/dutchie-az/scripts/stress-test.ts +++ /dev/null @@ -1,486 +0,0 @@ -#!/usr/bin/env npx tsx -/** - * Crawler Reliability Stress Test - * - * Simulates various failure scenarios to test: - * - Retry logic with exponential backoff - * - Error taxonomy classification - * - Self-healing (proxy/UA rotation) - * - Status transitions (active -> degraded -> failed) - * - Minimum crawl gap enforcement - * - * Phase 1: Crawler Reliability & Stabilization - * - * Usage: - * DATABASE_URL="postgresql://..." npx tsx src/dutchie-az/scripts/stress-test.ts [test-name] - * - * Available tests: - * retry - Test retry manager with various error types - * backoff - Test exponential backoff calculation - * status - Test status transitions - * gap - Test minimum crawl gap enforcement - * rotation - Test proxy/UA rotation - * all - Run all tests - */ - -import { - CrawlErrorCode, - classifyError, - isRetryable, - shouldRotateProxy, - shouldRotateUserAgent, - getBackoffMultiplier, - getErrorMetadata, -} from '../services/error-taxonomy'; - -import { - RetryManager, - withRetry, - calculateNextCrawlDelay, - calculateNextCrawlAt, - determineCrawlStatus, - shouldAttemptRecovery, - sleep, -} from '../services/retry-manager'; - -import { - UserAgentRotator, - USER_AGENTS, -} from '../services/proxy-rotator'; - -import { - validateStoreConfig, - isCrawlable, - DEFAULT_CONFIG, - RawStoreConfig, -} from '../services/store-validator'; - -// ============================================================ -// TEST UTILITIES -// ============================================================ - -let testsPassed = 0; -let testsFailed = 0; - -function assert(condition: boolean, message: string): void { - if (condition) { - console.log(` āœ“ ${message}`); - testsPassed++; - } else { - console.log(` āœ— ${message}`); - testsFailed++; - } -} - -function section(name: string): void { - console.log(`\n${'='.repeat(60)}`); - console.log(`TEST: ${name}`); - console.log('='.repeat(60)); -} - -// ============================================================ -// TEST: Error Classification -// ============================================================ - -function testErrorClassification(): void { - section('Error Classification'); - - // HTTP status codes - assert(classifyError(null, 429) === CrawlErrorCode.RATE_LIMITED, '429 -> RATE_LIMITED'); - assert(classifyError(null, 407) === CrawlErrorCode.BLOCKED_PROXY, '407 -> BLOCKED_PROXY'); - assert(classifyError(null, 401) === CrawlErrorCode.AUTH_FAILED, '401 -> AUTH_FAILED'); - assert(classifyError(null, 403) === CrawlErrorCode.AUTH_FAILED, '403 -> AUTH_FAILED'); - assert(classifyError(null, 503) === CrawlErrorCode.SERVICE_UNAVAILABLE, '503 -> SERVICE_UNAVAILABLE'); - assert(classifyError(null, 500) === CrawlErrorCode.SERVER_ERROR, '500 -> SERVER_ERROR'); - - // Error messages - assert(classifyError('rate limit exceeded') === CrawlErrorCode.RATE_LIMITED, 'rate limit message -> RATE_LIMITED'); - assert(classifyError('request timed out') === CrawlErrorCode.TIMEOUT, 'timeout message -> TIMEOUT'); - assert(classifyError('proxy blocked') === CrawlErrorCode.BLOCKED_PROXY, 'proxy blocked -> BLOCKED_PROXY'); - assert(classifyError('ECONNREFUSED') === CrawlErrorCode.NETWORK_ERROR, 'ECONNREFUSED -> NETWORK_ERROR'); - assert(classifyError('ENOTFOUND') === CrawlErrorCode.DNS_ERROR, 'ENOTFOUND -> DNS_ERROR'); - assert(classifyError('selector not found') === CrawlErrorCode.HTML_CHANGED, 'selector error -> HTML_CHANGED'); - assert(classifyError('JSON parse error') === CrawlErrorCode.PARSE_ERROR, 'parse error -> PARSE_ERROR'); - assert(classifyError('0 products found') === CrawlErrorCode.NO_PRODUCTS, 'no products -> NO_PRODUCTS'); - - // Retryability - assert(isRetryable(CrawlErrorCode.RATE_LIMITED) === true, 'RATE_LIMITED is retryable'); - assert(isRetryable(CrawlErrorCode.TIMEOUT) === true, 'TIMEOUT is retryable'); - assert(isRetryable(CrawlErrorCode.HTML_CHANGED) === false, 'HTML_CHANGED is NOT retryable'); - assert(isRetryable(CrawlErrorCode.INVALID_CONFIG) === false, 'INVALID_CONFIG is NOT retryable'); - - // Rotation decisions - assert(shouldRotateProxy(CrawlErrorCode.BLOCKED_PROXY) === true, 'BLOCKED_PROXY -> rotate proxy'); - assert(shouldRotateProxy(CrawlErrorCode.RATE_LIMITED) === true, 'RATE_LIMITED -> rotate proxy'); - assert(shouldRotateUserAgent(CrawlErrorCode.AUTH_FAILED) === true, 'AUTH_FAILED -> rotate UA'); -} - -// ============================================================ -// TEST: Retry Manager -// ============================================================ - -function testRetryManager(): void { - section('Retry Manager'); - - const manager = new RetryManager({ maxRetries: 3, baseBackoffMs: 100 }); - - // Initial state - assert(manager.shouldAttempt() === true, 'Should attempt initially'); - assert(manager.getAttemptNumber() === 1, 'Attempt number starts at 1'); - - // First attempt - manager.recordAttempt(); - assert(manager.getAttemptNumber() === 2, 'Attempt number increments'); - - // Evaluate retryable error - const decision1 = manager.evaluateError(new Error('rate limit exceeded'), 429); - assert(decision1.shouldRetry === true, 'Should retry on rate limit'); - assert(decision1.errorCode === CrawlErrorCode.RATE_LIMITED, 'Error code is RATE_LIMITED'); - assert(decision1.rotateProxy === true, 'Should rotate proxy'); - assert(decision1.backoffMs > 0, 'Backoff is positive'); - - // More attempts - manager.recordAttempt(); - manager.recordAttempt(); - - // Now at max retries - const decision2 = manager.evaluateError(new Error('timeout'), 504); - assert(decision2.shouldRetry === true, 'Should still retry (at limit but not exceeded)'); - - manager.recordAttempt(); - const decision3 = manager.evaluateError(new Error('timeout')); - assert(decision3.shouldRetry === false, 'Should NOT retry after max'); - assert(decision3.reason.includes('exhausted'), 'Reason mentions exhausted'); - - // Reset - manager.reset(); - assert(manager.shouldAttempt() === true, 'Should attempt after reset'); - assert(manager.getAttemptNumber() === 1, 'Attempt number resets'); - - // Non-retryable error - const manager2 = new RetryManager({ maxRetries: 3 }); - manager2.recordAttempt(); - const nonRetryable = manager2.evaluateError(new Error('HTML structure changed')); - assert(nonRetryable.shouldRetry === false, 'Non-retryable error stops immediately'); - assert(nonRetryable.errorCode === CrawlErrorCode.HTML_CHANGED, 'Error code is HTML_CHANGED'); -} - -// ============================================================ -// TEST: Exponential Backoff -// ============================================================ - -function testExponentialBackoff(): void { - section('Exponential Backoff'); - - // Calculate next crawl delay - const delay0 = calculateNextCrawlDelay(0, 240); // No failures - const delay1 = calculateNextCrawlDelay(1, 240); // 1 failure - const delay2 = calculateNextCrawlDelay(2, 240); // 2 failures - const delay3 = calculateNextCrawlDelay(3, 240); // 3 failures - const delay5 = calculateNextCrawlDelay(5, 240); // 5 failures (should cap) - - console.log(` Delay with 0 failures: ${delay0} minutes`); - console.log(` Delay with 1 failure: ${delay1} minutes`); - console.log(` Delay with 2 failures: ${delay2} minutes`); - console.log(` Delay with 3 failures: ${delay3} minutes`); - console.log(` Delay with 5 failures: ${delay5} minutes`); - - assert(delay1 > delay0, 'Delay increases with failures'); - assert(delay2 > delay1, 'Delay keeps increasing'); - assert(delay3 > delay2, 'More delay with more failures'); - // With jitter, exact values vary but ratio should be close to 2x - assert(delay5 <= 240 * 4 * 1.2, 'Delay is capped at max multiplier'); - - // Next crawl time calculation - const now = new Date(); - const nextAt = calculateNextCrawlAt(2, 240); - assert(nextAt > now, 'Next crawl is in future'); - assert(nextAt.getTime() - now.getTime() > 240 * 60 * 1000, 'Includes backoff'); -} - -// ============================================================ -// TEST: Status Transitions -// ============================================================ - -function testStatusTransitions(): void { - section('Status Transitions'); - - // Active status - assert(determineCrawlStatus(0) === 'active', '0 failures -> active'); - assert(determineCrawlStatus(1) === 'active', '1 failure -> active'); - assert(determineCrawlStatus(2) === 'active', '2 failures -> active'); - - // Degraded status - assert(determineCrawlStatus(3) === 'degraded', '3 failures -> degraded'); - assert(determineCrawlStatus(5) === 'degraded', '5 failures -> degraded'); - assert(determineCrawlStatus(9) === 'degraded', '9 failures -> degraded'); - - // Failed status - assert(determineCrawlStatus(10) === 'failed', '10 failures -> failed'); - assert(determineCrawlStatus(15) === 'failed', '15 failures -> failed'); - - // Custom thresholds - const customStatus = determineCrawlStatus(5, { degraded: 5, failed: 8 }); - assert(customStatus === 'degraded', 'Custom threshold: 5 -> degraded'); - - // Recovery check - const recentFailure = new Date(Date.now() - 1 * 60 * 60 * 1000); // 1 hour ago - const oldFailure = new Date(Date.now() - 48 * 60 * 60 * 1000); // 48 hours ago - - assert(shouldAttemptRecovery(recentFailure, 1) === false, 'No recovery for recent failure'); - assert(shouldAttemptRecovery(oldFailure, 1) === true, 'Recovery allowed for old failure'); - assert(shouldAttemptRecovery(null, 0) === true, 'Recovery allowed if no previous failure'); -} - -// ============================================================ -// TEST: Store Validation -// ============================================================ - -function testStoreValidation(): void { - section('Store Validation'); - - // Valid config - const validConfig: RawStoreConfig = { - id: 1, - name: 'Test Store', - platformDispensaryId: '123abc', - menuType: 'dutchie', - }; - const validResult = validateStoreConfig(validConfig); - assert(validResult.isValid === true, 'Valid config passes'); - assert(validResult.config !== null, 'Valid config returns config'); - assert(validResult.config?.slug === 'test-store', 'Slug is generated'); - - // Missing required fields - const missingId: RawStoreConfig = { - id: 0, - name: 'Test', - platformDispensaryId: '123', - menuType: 'dutchie', - }; - const missingIdResult = validateStoreConfig(missingId); - assert(missingIdResult.isValid === false, 'Missing ID fails'); - - // Missing platform ID - const missingPlatform: RawStoreConfig = { - id: 1, - name: 'Test', - menuType: 'dutchie', - }; - const missingPlatformResult = validateStoreConfig(missingPlatform); - assert(missingPlatformResult.isValid === false, 'Missing platform ID fails'); - - // Unknown menu type - const unknownMenu: RawStoreConfig = { - id: 1, - name: 'Test', - platformDispensaryId: '123', - menuType: 'unknown', - }; - const unknownMenuResult = validateStoreConfig(unknownMenu); - assert(unknownMenuResult.isValid === false, 'Unknown menu type fails'); - - // Crawlable check - assert(isCrawlable(validConfig) === true, 'Valid config is crawlable'); - assert(isCrawlable(missingPlatform) === false, 'Missing platform not crawlable'); - assert(isCrawlable({ ...validConfig, crawlStatus: 'failed' }) === false, 'Failed status not crawlable'); - assert(isCrawlable({ ...validConfig, crawlStatus: 'paused' }) === false, 'Paused status not crawlable'); -} - -// ============================================================ -// TEST: User Agent Rotation -// ============================================================ - -function testUserAgentRotation(): void { - section('User Agent Rotation'); - - const rotator = new UserAgentRotator(); - - const first = rotator.getCurrent(); - const second = rotator.getNext(); - const third = rotator.getNext(); - - assert(first !== second, 'User agents rotate'); - assert(second !== third, 'User agents keep rotating'); - assert(USER_AGENTS.includes(first), 'Returns valid UA'); - assert(USER_AGENTS.includes(second), 'Returns valid UA'); - - // Random UA - const random = rotator.getRandom(); - assert(USER_AGENTS.includes(random), 'Random returns valid UA'); - - // Count - assert(rotator.getCount() === USER_AGENTS.length, 'Reports correct count'); -} - -// ============================================================ -// TEST: WithRetry Helper -// ============================================================ - -async function testWithRetryHelper(): Promise { - section('WithRetry Helper'); - - // Successful on first try - let attempts = 0; - const successResult = await withRetry(async () => { - attempts++; - return 'success'; - }, { maxRetries: 3 }); - assert(attempts === 1, 'Succeeds on first try'); - assert(successResult.result === 'success', 'Returns result'); - - // Fails then succeeds - let failThenSucceedAttempts = 0; - const failThenSuccessResult = await withRetry(async () => { - failThenSucceedAttempts++; - if (failThenSucceedAttempts < 3) { - throw new Error('temporary error'); - } - return 'finally succeeded'; - }, { maxRetries: 5, baseBackoffMs: 10 }); - assert(failThenSucceedAttempts === 3, 'Retries until success'); - assert(failThenSuccessResult.result === 'finally succeeded', 'Returns final result'); - assert(failThenSuccessResult.summary.attemptsMade === 3, 'Summary tracks attempts'); - - // Exhausts retries - let alwaysFailAttempts = 0; - try { - await withRetry(async () => { - alwaysFailAttempts++; - throw new Error('always fails'); - }, { maxRetries: 2, baseBackoffMs: 10 }); - assert(false, 'Should have thrown'); - } catch (error: any) { - assert(alwaysFailAttempts === 3, 'Attempts all retries'); // 1 initial + 2 retries - assert(error.name === 'RetryExhaustedError', 'Throws RetryExhaustedError'); - } - - // Non-retryable error stops immediately - let nonRetryableAttempts = 0; - try { - await withRetry(async () => { - nonRetryableAttempts++; - const err = new Error('HTML structure changed - selector not found'); - throw err; - }, { maxRetries: 3, baseBackoffMs: 10 }); - assert(false, 'Should have thrown'); - } catch { - assert(nonRetryableAttempts === 1, 'Non-retryable stops immediately'); - } -} - -// ============================================================ -// TEST: Minimum Crawl Gap -// ============================================================ - -function testMinimumCrawlGap(): void { - section('Minimum Crawl Gap'); - - // Default config - assert(DEFAULT_CONFIG.minCrawlGapMinutes === 2, 'Default gap is 2 minutes'); - assert(DEFAULT_CONFIG.crawlFrequencyMinutes === 240, 'Default frequency is 4 hours'); - - // Gap calculation - const gapMs = DEFAULT_CONFIG.minCrawlGapMinutes * 60 * 1000; - assert(gapMs === 120000, 'Gap is 2 minutes in ms'); - - console.log(' Note: Gap enforcement is tested at DB level (trigger) and application level'); -} - -// ============================================================ -// TEST: Error Metadata -// ============================================================ - -function testErrorMetadata(): void { - section('Error Metadata'); - - // RATE_LIMITED - const rateLimited = getErrorMetadata(CrawlErrorCode.RATE_LIMITED); - assert(rateLimited.retryable === true, 'RATE_LIMITED is retryable'); - assert(rateLimited.rotateProxy === true, 'RATE_LIMITED rotates proxy'); - assert(rateLimited.backoffMultiplier === 2.0, 'RATE_LIMITED has 2x backoff'); - assert(rateLimited.severity === 'medium', 'RATE_LIMITED is medium severity'); - - // HTML_CHANGED - const htmlChanged = getErrorMetadata(CrawlErrorCode.HTML_CHANGED); - assert(htmlChanged.retryable === false, 'HTML_CHANGED is NOT retryable'); - assert(htmlChanged.severity === 'high', 'HTML_CHANGED is high severity'); - - // INVALID_CONFIG - const invalidConfig = getErrorMetadata(CrawlErrorCode.INVALID_CONFIG); - assert(invalidConfig.retryable === false, 'INVALID_CONFIG is NOT retryable'); - assert(invalidConfig.severity === 'critical', 'INVALID_CONFIG is critical'); -} - -// ============================================================ -// MAIN -// ============================================================ - -async function runTests(testName?: string): Promise { - console.log('\n'); - console.log('╔══════════════════════════════════════════════════════════╗'); - console.log('ā•‘ CRAWLER RELIABILITY STRESS TEST - PHASE 1 ā•‘'); - console.log('ā•šā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•'); - - const allTests = !testName || testName === 'all'; - - if (allTests || testName === 'error' || testName === 'classification') { - testErrorClassification(); - } - - if (allTests || testName === 'retry') { - testRetryManager(); - } - - if (allTests || testName === 'backoff') { - testExponentialBackoff(); - } - - if (allTests || testName === 'status') { - testStatusTransitions(); - } - - if (allTests || testName === 'validation' || testName === 'store') { - testStoreValidation(); - } - - if (allTests || testName === 'rotation' || testName === 'ua') { - testUserAgentRotation(); - } - - if (allTests || testName === 'withRetry' || testName === 'helper') { - await testWithRetryHelper(); - } - - if (allTests || testName === 'gap') { - testMinimumCrawlGap(); - } - - if (allTests || testName === 'metadata') { - testErrorMetadata(); - } - - // Summary - console.log('\n'); - console.log('═'.repeat(60)); - console.log('SUMMARY'); - console.log('═'.repeat(60)); - console.log(` Passed: ${testsPassed}`); - console.log(` Failed: ${testsFailed}`); - console.log(` Total: ${testsPassed + testsFailed}`); - - if (testsFailed > 0) { - console.log('\nāŒ SOME TESTS FAILED\n'); - process.exit(1); - } else { - console.log('\nāœ… ALL TESTS PASSED\n'); - process.exit(0); - } -} - -// Run tests -const testName = process.argv[2]; -runTests(testName).catch((error) => { - console.error('Fatal error:', error); - process.exit(1); -}); diff --git a/backend/src/dutchie-az/services/analytics/brand-opportunity.ts b/backend/src/dutchie-az/services/analytics/brand-opportunity.ts deleted file mode 100644 index b23817e9..00000000 --- a/backend/src/dutchie-az/services/analytics/brand-opportunity.ts +++ /dev/null @@ -1,659 +0,0 @@ -/** - * Brand Opportunity / Risk Analytics Service - * - * Provides brand-level opportunity and risk analysis including: - * - Under/overpriced vs market - * - Missing SKU opportunities - * - Stores with declining/growing shelf share - * - Competitor intrusion alerts - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; -import { AnalyticsCache, cacheKey } from './cache'; - -export interface BrandOpportunity { - brandName: string; - underpricedVsMarket: PricePosition[]; - overpricedVsMarket: PricePosition[]; - missingSkuOpportunities: MissingSkuOpportunity[]; - storesWithDecliningShelfShare: StoreShelfShareChange[]; - storesWithGrowingShelfShare: StoreShelfShareChange[]; - competitorIntrusionAlerts: CompetitorAlert[]; - overallScore: number; // 0-100, higher = more opportunity - riskScore: number; // 0-100, higher = more risk -} - -export interface PricePosition { - category: string; - brandAvgPrice: number; - marketAvgPrice: number; - priceDifferencePercent: number; - skuCount: number; - suggestion: string; -} - -export interface MissingSkuOpportunity { - category: string; - subcategory: string | null; - marketSkuCount: number; - brandSkuCount: number; - gapPercent: number; - topCompetitors: string[]; - opportunityScore: number; // 0-100 -} - -export interface StoreShelfShareChange { - storeId: number; - storeName: string; - city: string; - state: string; - currentShelfShare: number; - previousShelfShare: number; - changePercent: number; - currentSkus: number; - competitors: string[]; -} - -export interface CompetitorAlert { - competitorBrand: string; - storeId: number; - storeName: string; - alertType: 'new_entry' | 'expanding' | 'price_undercut'; - details: string; - severity: 'low' | 'medium' | 'high'; - date: string; -} - -export interface MarketPositionSummary { - brandName: string; - marketSharePercent: number; - avgPriceVsMarket: number; // -X% to +X% - categoryStrengths: Array<{ category: string; shelfSharePercent: number }>; - categoryWeaknesses: Array<{ category: string; shelfSharePercent: number; marketLeader: string }>; - growthTrend: 'growing' | 'stable' | 'declining'; - competitorThreats: string[]; -} - -export class BrandOpportunityService { - private pool: Pool; - private cache: AnalyticsCache; - - constructor(pool: Pool, cache: AnalyticsCache) { - this.pool = pool; - this.cache = cache; - } - - /** - * Get full opportunity analysis for a brand - */ - async getBrandOpportunity(brandName: string): Promise { - const key = cacheKey('brand_opportunity', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - const [ - underpriced, - overpriced, - missingSkus, - decliningStores, - growingStores, - alerts, - ] = await Promise.all([ - this.getUnderpricedPositions(brandName), - this.getOverpricedPositions(brandName), - this.getMissingSkuOpportunities(brandName), - this.getStoresWithDecliningShare(brandName), - this.getStoresWithGrowingShare(brandName), - this.getCompetitorAlerts(brandName), - ]); - - // Calculate opportunity score (higher = more opportunity) - const opportunityFactors = [ - missingSkus.length > 0 ? 20 : 0, - underpriced.length > 0 ? 15 : 0, - growingStores.length > 5 ? 20 : growingStores.length * 3, - missingSkus.reduce((sum, m) => sum + m.opportunityScore, 0) / Math.max(1, missingSkus.length) * 0.3, - ]; - const opportunityScore = Math.min(100, opportunityFactors.reduce((a, b) => a + b, 0)); - - // Calculate risk score (higher = more risk) - const riskFactors = [ - decliningStores.length > 5 ? 30 : decliningStores.length * 5, - alerts.filter(a => a.severity === 'high').length * 15, - alerts.filter(a => a.severity === 'medium').length * 8, - overpriced.length > 3 ? 15 : overpriced.length * 3, - ]; - const riskScore = Math.min(100, riskFactors.reduce((a, b) => a + b, 0)); - - return { - brandName, - underpricedVsMarket: underpriced, - overpricedVsMarket: overpriced, - missingSkuOpportunities: missingSkus, - storesWithDecliningShelfShare: decliningStores, - storesWithGrowingShelfShare: growingStores, - competitorIntrusionAlerts: alerts, - overallScore: Math.round(opportunityScore), - riskScore: Math.round(riskScore), - }; - }, 30)).data; - } - - /** - * Get categories where brand is underpriced vs market - */ - async getUnderpricedPositions(brandName: string): Promise { - const result = await this.pool.query(` - WITH brand_prices AS ( - SELECT - type as category, - AVG(extract_min_price(latest_raw_payload)) as brand_avg, - COUNT(*) as sku_count - FROM dutchie_products - WHERE brand_name = $1 AND type IS NOT NULL - GROUP BY type - HAVING COUNT(*) >= 3 - ), - market_prices AS ( - SELECT - type as category, - AVG(extract_min_price(latest_raw_payload)) as market_avg - FROM dutchie_products - WHERE type IS NOT NULL AND brand_name != $1 - GROUP BY type - ) - SELECT - bp.category, - bp.brand_avg, - mp.market_avg, - bp.sku_count, - ((bp.brand_avg - mp.market_avg) / NULLIF(mp.market_avg, 0)) * 100 as diff_pct - FROM brand_prices bp - JOIN market_prices mp ON bp.category = mp.category - WHERE bp.brand_avg < mp.market_avg * 0.9 -- 10% or more below market - AND bp.brand_avg IS NOT NULL - AND mp.market_avg IS NOT NULL - ORDER BY diff_pct - `, [brandName]); - - return result.rows.map(row => ({ - category: row.category, - brandAvgPrice: Math.round(parseFloat(row.brand_avg) * 100) / 100, - marketAvgPrice: Math.round(parseFloat(row.market_avg) * 100) / 100, - priceDifferencePercent: Math.round(parseFloat(row.diff_pct) * 10) / 10, - skuCount: parseInt(row.sku_count) || 0, - suggestion: `Consider price increase - ${Math.abs(Math.round(parseFloat(row.diff_pct)))}% below market average`, - })); - } - - /** - * Get categories where brand is overpriced vs market - */ - async getOverpricedPositions(brandName: string): Promise { - const result = await this.pool.query(` - WITH brand_prices AS ( - SELECT - type as category, - AVG(extract_min_price(latest_raw_payload)) as brand_avg, - COUNT(*) as sku_count - FROM dutchie_products - WHERE brand_name = $1 AND type IS NOT NULL - GROUP BY type - HAVING COUNT(*) >= 3 - ), - market_prices AS ( - SELECT - type as category, - AVG(extract_min_price(latest_raw_payload)) as market_avg - FROM dutchie_products - WHERE type IS NOT NULL AND brand_name != $1 - GROUP BY type - ) - SELECT - bp.category, - bp.brand_avg, - mp.market_avg, - bp.sku_count, - ((bp.brand_avg - mp.market_avg) / NULLIF(mp.market_avg, 0)) * 100 as diff_pct - FROM brand_prices bp - JOIN market_prices mp ON bp.category = mp.category - WHERE bp.brand_avg > mp.market_avg * 1.15 -- 15% or more above market - AND bp.brand_avg IS NOT NULL - AND mp.market_avg IS NOT NULL - ORDER BY diff_pct DESC - `, [brandName]); - - return result.rows.map(row => ({ - category: row.category, - brandAvgPrice: Math.round(parseFloat(row.brand_avg) * 100) / 100, - marketAvgPrice: Math.round(parseFloat(row.market_avg) * 100) / 100, - priceDifferencePercent: Math.round(parseFloat(row.diff_pct) * 10) / 10, - skuCount: parseInt(row.sku_count) || 0, - suggestion: `Price sensitivity risk - ${Math.round(parseFloat(row.diff_pct))}% above market average`, - })); - } - - /** - * Get missing SKU opportunities (category gaps) - */ - async getMissingSkuOpportunities(brandName: string): Promise { - const result = await this.pool.query(` - WITH market_categories AS ( - SELECT - type as category, - subcategory, - COUNT(*) as market_skus, - ARRAY_AGG(DISTINCT brand_name ORDER BY brand_name) FILTER (WHERE brand_name IS NOT NULL) as top_brands - FROM dutchie_products - WHERE type IS NOT NULL - GROUP BY type, subcategory - HAVING COUNT(*) >= 20 - ), - brand_presence AS ( - SELECT - type as category, - subcategory, - COUNT(*) as brand_skus - FROM dutchie_products - WHERE brand_name = $1 AND type IS NOT NULL - GROUP BY type, subcategory - ) - SELECT - mc.category, - mc.subcategory, - mc.market_skus, - COALESCE(bp.brand_skus, 0) as brand_skus, - mc.top_brands[1:5] as competitors - FROM market_categories mc - LEFT JOIN brand_presence bp ON mc.category = bp.category - AND (mc.subcategory = bp.subcategory OR (mc.subcategory IS NULL AND bp.subcategory IS NULL)) - WHERE COALESCE(bp.brand_skus, 0) < mc.market_skus * 0.05 -- Brand has <5% of market presence - ORDER BY mc.market_skus DESC - LIMIT 10 - `, [brandName]); - - return result.rows.map(row => { - const marketSkus = parseInt(row.market_skus) || 0; - const brandSkus = parseInt(row.brand_skus) || 0; - const gapPercent = marketSkus > 0 ? ((marketSkus - brandSkus) / marketSkus) * 100 : 100; - const opportunityScore = Math.min(100, Math.round((marketSkus / 100) * (gapPercent / 100) * 100)); - - return { - category: row.category, - subcategory: row.subcategory, - marketSkuCount: marketSkus, - brandSkuCount: brandSkus, - gapPercent: Math.round(gapPercent), - topCompetitors: (row.competitors || []).filter((c: string) => c !== brandName).slice(0, 5), - opportunityScore, - }; - }); - } - - /** - * Get stores where brand's shelf share is declining - */ - async getStoresWithDecliningShare(brandName: string): Promise { - // Use brand_snapshots for historical comparison - const result = await this.pool.query(` - WITH current_share AS ( - SELECT - dp.dispensary_id as store_id, - d.name as store_name, - d.city, - d.state, - COUNT(*) FILTER (WHERE dp.brand_name = $1) as brand_skus, - COUNT(*) as total_skus, - ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name != $1 AND dp.brand_name IS NOT NULL) as competitors - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - GROUP BY dp.dispensary_id, d.name, d.city, d.state - HAVING COUNT(*) FILTER (WHERE dp.brand_name = $1) > 0 - ) - SELECT - cs.store_id, - cs.store_name, - cs.city, - cs.state, - cs.brand_skus as current_skus, - cs.total_skus, - ROUND((cs.brand_skus::NUMERIC / cs.total_skus) * 100, 2) as current_share, - cs.competitors[1:5] as top_competitors - FROM current_share cs - WHERE cs.brand_skus < 10 -- Low presence - ORDER BY cs.brand_skus - LIMIT 10 - `, [brandName]); - - return result.rows.map(row => ({ - storeId: row.store_id, - storeName: row.store_name, - city: row.city, - state: row.state, - currentShelfShare: parseFloat(row.current_share) || 0, - previousShelfShare: parseFloat(row.current_share) || 0, // Would need historical data - changePercent: 0, - currentSkus: parseInt(row.current_skus) || 0, - competitors: row.top_competitors || [], - })); - } - - /** - * Get stores where brand's shelf share is growing - */ - async getStoresWithGrowingShare(brandName: string): Promise { - const result = await this.pool.query(` - WITH store_share AS ( - SELECT - dp.dispensary_id as store_id, - d.name as store_name, - d.city, - d.state, - COUNT(*) FILTER (WHERE dp.brand_name = $1) as brand_skus, - COUNT(*) as total_skus, - ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name != $1 AND dp.brand_name IS NOT NULL) as competitors - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - GROUP BY dp.dispensary_id, d.name, d.city, d.state - HAVING COUNT(*) FILTER (WHERE dp.brand_name = $1) > 0 - ) - SELECT - ss.store_id, - ss.store_name, - ss.city, - ss.state, - ss.brand_skus as current_skus, - ss.total_skus, - ROUND((ss.brand_skus::NUMERIC / ss.total_skus) * 100, 2) as current_share, - ss.competitors[1:5] as top_competitors - FROM store_share ss - ORDER BY current_share DESC - LIMIT 10 - `, [brandName]); - - return result.rows.map(row => ({ - storeId: row.store_id, - storeName: row.store_name, - city: row.city, - state: row.state, - currentShelfShare: parseFloat(row.current_share) || 0, - previousShelfShare: parseFloat(row.current_share) || 0, - changePercent: 0, - currentSkus: parseInt(row.current_skus) || 0, - competitors: row.top_competitors || [], - })); - } - - /** - * Get competitor intrusion alerts - */ - async getCompetitorAlerts(brandName: string): Promise { - // Check for competitor entries in stores where this brand has presence - const result = await this.pool.query(` - WITH brand_stores AS ( - SELECT DISTINCT dispensary_id - FROM dutchie_products - WHERE brand_name = $1 - ), - competitor_presence AS ( - SELECT - dp.brand_name as competitor, - dp.dispensary_id as store_id, - d.name as store_name, - COUNT(*) as sku_count, - MAX(dp.created_at) as latest_add - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.dispensary_id IN (SELECT dispensary_id FROM brand_stores) - AND dp.brand_name != $1 - AND dp.brand_name IS NOT NULL - AND dp.created_at >= NOW() - INTERVAL '30 days' - GROUP BY dp.brand_name, dp.dispensary_id, d.name - HAVING COUNT(*) >= 5 - ) - SELECT - competitor, - store_id, - store_name, - sku_count, - latest_add - FROM competitor_presence - ORDER BY sku_count DESC - LIMIT 10 - `, [brandName]); - - return result.rows.map(row => { - const skuCount = parseInt(row.sku_count) || 0; - let severity: 'low' | 'medium' | 'high' = 'low'; - if (skuCount >= 20) severity = 'high'; - else if (skuCount >= 10) severity = 'medium'; - - return { - competitorBrand: row.competitor, - storeId: row.store_id, - storeName: row.store_name, - alertType: 'expanding' as const, - details: `${row.competitor} has ${skuCount} SKUs in ${row.store_name}`, - severity, - date: new Date(row.latest_add).toISOString().split('T')[0], - }; - }); - } - - /** - * Get market position summary for a brand - */ - async getMarketPositionSummary(brandName: string): Promise { - const key = cacheKey('market_position', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - const [shareResult, priceResult, categoryResult, threatResult] = await Promise.all([ - // Market share - this.pool.query(` - SELECT - (SELECT COUNT(*) FROM dutchie_products WHERE brand_name = $1) as brand_count, - (SELECT COUNT(*) FROM dutchie_products) as total_count - `, [brandName]), - - // Price vs market - this.pool.query(` - SELECT - (SELECT AVG(extract_min_price(latest_raw_payload)) FROM dutchie_products WHERE brand_name = $1) as brand_avg, - (SELECT AVG(extract_min_price(latest_raw_payload)) FROM dutchie_products WHERE brand_name != $1) as market_avg - `, [brandName]), - - // Category strengths/weaknesses - this.pool.query(` - WITH brand_by_cat AS ( - SELECT type as category, COUNT(*) as brand_count - FROM dutchie_products - WHERE brand_name = $1 AND type IS NOT NULL - GROUP BY type - ), - market_by_cat AS ( - SELECT type as category, COUNT(*) as total_count - FROM dutchie_products WHERE type IS NOT NULL - GROUP BY type - ), - leaders AS ( - SELECT type as category, brand_name, COUNT(*) as cnt, - RANK() OVER (PARTITION BY type ORDER BY COUNT(*) DESC) as rnk - FROM dutchie_products WHERE type IS NOT NULL AND brand_name IS NOT NULL - GROUP BY type, brand_name - ) - SELECT - mc.category, - COALESCE(bc.brand_count, 0) as brand_count, - mc.total_count, - ROUND((COALESCE(bc.brand_count, 0)::NUMERIC / mc.total_count) * 100, 2) as share_pct, - (SELECT brand_name FROM leaders WHERE category = mc.category AND rnk = 1) as leader - FROM market_by_cat mc - LEFT JOIN brand_by_cat bc ON mc.category = bc.category - ORDER BY share_pct DESC - `, [brandName]), - - // Top competitors - this.pool.query(` - SELECT brand_name, COUNT(*) as cnt - FROM dutchie_products - WHERE brand_name IS NOT NULL AND brand_name != $1 - GROUP BY brand_name - ORDER BY cnt DESC - LIMIT 5 - `, [brandName]), - ]); - - const brandCount = parseInt(shareResult.rows[0]?.brand_count) || 0; - const totalCount = parseInt(shareResult.rows[0]?.total_count) || 1; - const marketSharePercent = Math.round((brandCount / totalCount) * 1000) / 10; - - const brandAvg = parseFloat(priceResult.rows[0]?.brand_avg) || 0; - const marketAvg = parseFloat(priceResult.rows[0]?.market_avg) || 1; - const avgPriceVsMarket = Math.round(((brandAvg - marketAvg) / marketAvg) * 1000) / 10; - - const categories = categoryResult.rows; - const strengths = categories - .filter(c => parseFloat(c.share_pct) > 5) - .map(c => ({ category: c.category, shelfSharePercent: parseFloat(c.share_pct) })); - - const weaknesses = categories - .filter(c => parseFloat(c.share_pct) < 2 && c.leader !== brandName) - .map(c => ({ - category: c.category, - shelfSharePercent: parseFloat(c.share_pct), - marketLeader: c.leader || 'Unknown', - })); - - return { - brandName, - marketSharePercent, - avgPriceVsMarket, - categoryStrengths: strengths.slice(0, 5), - categoryWeaknesses: weaknesses.slice(0, 5), - growthTrend: 'stable' as const, // Would need historical data - competitorThreats: threatResult.rows.map(r => r.brand_name), - }; - }, 30)).data; - } - - /** - * Create an analytics alert - */ - async createAlert(alert: { - alertType: string; - severity: 'info' | 'warning' | 'critical'; - title: string; - description?: string; - storeId?: number; - brandName?: string; - productId?: number; - category?: string; - metadata?: Record; - }): Promise { - await this.pool.query(` - INSERT INTO analytics_alerts - (alert_type, severity, title, description, store_id, brand_name, product_id, category, metadata) - VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) - `, [ - alert.alertType, - alert.severity, - alert.title, - alert.description || null, - alert.storeId || null, - alert.brandName || null, - alert.productId || null, - alert.category || null, - alert.metadata ? JSON.stringify(alert.metadata) : null, - ]); - } - - /** - * Get recent alerts - */ - async getAlerts(filters: { - brandName?: string; - storeId?: number; - alertType?: string; - unreadOnly?: boolean; - limit?: number; - } = {}): Promise> { - const { brandName, storeId, alertType, unreadOnly = false, limit = 50 } = filters; - const params: (string | number | boolean)[] = [limit]; - const conditions: string[] = []; - let paramIndex = 2; - - if (brandName) { - conditions.push(`a.brand_name = $${paramIndex++}`); - params.push(brandName); - } - if (storeId) { - conditions.push(`a.store_id = $${paramIndex++}`); - params.push(storeId); - } - if (alertType) { - conditions.push(`a.alert_type = $${paramIndex++}`); - params.push(alertType); - } - if (unreadOnly) { - conditions.push('a.is_read = false'); - } - - const whereClause = conditions.length > 0 - ? 'WHERE ' + conditions.join(' AND ') - : ''; - - const result = await this.pool.query(` - SELECT - a.id, - a.alert_type, - a.severity, - a.title, - a.description, - d.name as store_name, - a.brand_name, - a.created_at, - a.is_read - FROM analytics_alerts a - LEFT JOIN dispensaries d ON a.store_id = d.id - ${whereClause} - ORDER BY a.created_at DESC - LIMIT $1 - `, params); - - return result.rows.map(row => ({ - id: row.id, - alertType: row.alert_type, - severity: row.severity, - title: row.title, - description: row.description, - storeName: row.store_name, - brandName: row.brand_name, - createdAt: row.created_at.toISOString(), - isRead: row.is_read, - })); - } - - /** - * Mark alerts as read - */ - async markAlertsRead(alertIds: number[]): Promise { - if (alertIds.length === 0) return; - - await this.pool.query(` - UPDATE analytics_alerts - SET is_read = true - WHERE id = ANY($1) - `, [alertIds]); - } -} diff --git a/backend/src/dutchie-az/services/analytics/cache.ts b/backend/src/dutchie-az/services/analytics/cache.ts deleted file mode 100644 index 75d15a03..00000000 --- a/backend/src/dutchie-az/services/analytics/cache.ts +++ /dev/null @@ -1,227 +0,0 @@ -/** - * Analytics Cache Service - * - * Provides caching layer for expensive analytics queries. - * Uses PostgreSQL for persistence with configurable TTLs. - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; - -export interface CacheEntry { - key: string; - data: T; - computedAt: Date; - expiresAt: Date; - queryTimeMs?: number; -} - -export interface CacheConfig { - defaultTtlMinutes: number; -} - -const DEFAULT_CONFIG: CacheConfig = { - defaultTtlMinutes: 15, -}; - -export class AnalyticsCache { - private pool: Pool; - private config: CacheConfig; - private memoryCache: Map = new Map(); - - constructor(pool: Pool, config: Partial = {}) { - this.pool = pool; - this.config = { ...DEFAULT_CONFIG, ...config }; - } - - /** - * Get cached data or compute and cache it - */ - async getOrCompute( - key: string, - computeFn: () => Promise, - ttlMinutes?: number - ): Promise<{ data: T; fromCache: boolean; queryTimeMs: number }> { - const ttl = ttlMinutes ?? this.config.defaultTtlMinutes; - - // Check memory cache first - const memEntry = this.memoryCache.get(key); - if (memEntry && new Date() < memEntry.expiresAt) { - return { data: memEntry.data as T, fromCache: true, queryTimeMs: memEntry.queryTimeMs || 0 }; - } - - // Check database cache - const dbEntry = await this.getFromDb(key); - if (dbEntry && new Date() < dbEntry.expiresAt) { - this.memoryCache.set(key, dbEntry); - return { data: dbEntry.data, fromCache: true, queryTimeMs: dbEntry.queryTimeMs || 0 }; - } - - // Compute fresh data - const startTime = Date.now(); - const data = await computeFn(); - const queryTimeMs = Date.now() - startTime; - - // Cache result - const entry: CacheEntry = { - key, - data, - computedAt: new Date(), - expiresAt: new Date(Date.now() + ttl * 60 * 1000), - queryTimeMs, - }; - - await this.saveToDb(entry); - this.memoryCache.set(key, entry); - - return { data, fromCache: false, queryTimeMs }; - } - - /** - * Get from database cache - */ - private async getFromDb(key: string): Promise | null> { - try { - const result = await this.pool.query(` - SELECT cache_data, computed_at, expires_at, query_time_ms - FROM analytics_cache - WHERE cache_key = $1 - AND expires_at > NOW() - `, [key]); - - if (result.rows.length === 0) return null; - - const row = result.rows[0]; - return { - key, - data: row.cache_data as T, - computedAt: row.computed_at, - expiresAt: row.expires_at, - queryTimeMs: row.query_time_ms, - }; - } catch (error) { - console.warn(`[AnalyticsCache] Failed to get from DB: ${error}`); - return null; - } - } - - /** - * Save to database cache - */ - private async saveToDb(entry: CacheEntry): Promise { - try { - await this.pool.query(` - INSERT INTO analytics_cache (cache_key, cache_data, computed_at, expires_at, query_time_ms) - VALUES ($1, $2, $3, $4, $5) - ON CONFLICT (cache_key) - DO UPDATE SET - cache_data = EXCLUDED.cache_data, - computed_at = EXCLUDED.computed_at, - expires_at = EXCLUDED.expires_at, - query_time_ms = EXCLUDED.query_time_ms - `, [entry.key, JSON.stringify(entry.data), entry.computedAt, entry.expiresAt, entry.queryTimeMs]); - } catch (error) { - console.warn(`[AnalyticsCache] Failed to save to DB: ${error}`); - } - } - - /** - * Invalidate a cache entry - */ - async invalidate(key: string): Promise { - this.memoryCache.delete(key); - try { - await this.pool.query('DELETE FROM analytics_cache WHERE cache_key = $1', [key]); - } catch (error) { - console.warn(`[AnalyticsCache] Failed to invalidate: ${error}`); - } - } - - /** - * Invalidate all entries matching a pattern - */ - async invalidatePattern(pattern: string): Promise { - // Clear memory cache - for (const key of this.memoryCache.keys()) { - if (key.includes(pattern)) { - this.memoryCache.delete(key); - } - } - - try { - const result = await this.pool.query( - 'DELETE FROM analytics_cache WHERE cache_key LIKE $1', - [`%${pattern}%`] - ); - return result.rowCount || 0; - } catch (error) { - console.warn(`[AnalyticsCache] Failed to invalidate pattern: ${error}`); - return 0; - } - } - - /** - * Clean expired entries - */ - async cleanExpired(): Promise { - // Clean memory cache - const now = new Date(); - for (const [key, entry] of this.memoryCache.entries()) { - if (now >= entry.expiresAt) { - this.memoryCache.delete(key); - } - } - - try { - const result = await this.pool.query('DELETE FROM analytics_cache WHERE expires_at < NOW()'); - return result.rowCount || 0; - } catch (error) { - console.warn(`[AnalyticsCache] Failed to clean expired: ${error}`); - return 0; - } - } - - /** - * Get cache statistics - */ - async getStats(): Promise<{ - memoryCacheSize: number; - dbCacheSize: number; - expiredCount: number; - }> { - try { - const result = await this.pool.query(` - SELECT - COUNT(*) FILTER (WHERE expires_at > NOW()) as active, - COUNT(*) FILTER (WHERE expires_at <= NOW()) as expired - FROM analytics_cache - `); - - return { - memoryCacheSize: this.memoryCache.size, - dbCacheSize: parseInt(result.rows[0]?.active || '0'), - expiredCount: parseInt(result.rows[0]?.expired || '0'), - }; - } catch (error) { - return { - memoryCacheSize: this.memoryCache.size, - dbCacheSize: 0, - expiredCount: 0, - }; - } - } -} - -/** - * Generate cache key with parameters - */ -export function cacheKey(prefix: string, params: Record = {}): string { - const sortedParams = Object.keys(params) - .sort() - .filter(k => params[k] !== undefined && params[k] !== null) - .map(k => `${k}=${params[k]}`) - .join('&'); - - return sortedParams ? `${prefix}:${sortedParams}` : prefix; -} diff --git a/backend/src/dutchie-az/services/analytics/category-analytics.ts b/backend/src/dutchie-az/services/analytics/category-analytics.ts deleted file mode 100644 index 429bae8d..00000000 --- a/backend/src/dutchie-az/services/analytics/category-analytics.ts +++ /dev/null @@ -1,530 +0,0 @@ -/** - * Category Growth Analytics Service - * - * Provides category-level analytics including: - * - SKU count growth - * - Price growth trends - * - New product additions - * - Category shrinkage - * - Seasonality patterns - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; -import { AnalyticsCache, cacheKey } from './cache'; - -export interface CategoryGrowth { - category: string; - currentSkuCount: number; - previousSkuCount: number; - skuGrowthPercent: number; - currentBrandCount: number; - previousBrandCount: number; - brandGrowthPercent: number; - currentAvgPrice: number | null; - previousAvgPrice: number | null; - priceChangePercent: number | null; - newProducts: number; - discontinuedProducts: number; - trend: 'growing' | 'declining' | 'stable'; -} - -export interface CategorySummary { - category: string; - totalSkus: number; - brandCount: number; - storeCount: number; - avgPrice: number | null; - minPrice: number | null; - maxPrice: number | null; - inStockSkus: number; - outOfStockSkus: number; - stockHealthPercent: number; -} - -export interface CategoryGrowthTrend { - category: string; - dataPoints: Array<{ - date: string; - skuCount: number; - brandCount: number; - avgPrice: number | null; - storeCount: number; - }>; - growth7d: number | null; - growth30d: number | null; - growth90d: number | null; -} - -export interface CategoryHeatmapData { - categories: string[]; - periods: string[]; - data: Array<{ - category: string; - period: string; - value: number; // SKU count, growth %, or price - changeFromPrevious: number | null; - }>; -} - -export interface SeasonalityPattern { - category: string; - monthlyPattern: Array<{ - month: number; - monthName: string; - avgSkuCount: number; - avgPrice: number | null; - seasonalityIndex: number; // 100 = average, >100 = above, <100 = below - }>; - peakMonth: number; - troughMonth: number; -} - -export interface CategoryFilters { - state?: string; - storeId?: number; - minSkus?: number; -} - -export class CategoryAnalyticsService { - private pool: Pool; - private cache: AnalyticsCache; - - constructor(pool: Pool, cache: AnalyticsCache) { - this.pool = pool; - this.cache = cache; - } - - /** - * Get current category summary - */ - async getCategorySummary( - category?: string, - filters: CategoryFilters = {} - ): Promise { - const { state, storeId } = filters; - const key = cacheKey('category_summary', { category, state, storeId }); - - return (await this.cache.getOrCompute(key, async () => { - const params: (string | number)[] = []; - const conditions: string[] = []; - let paramIndex = 1; - - if (category) { - conditions.push(`dp.type = $${paramIndex++}`); - params.push(category); - } - if (state) { - conditions.push(`d.state = $${paramIndex++}`); - params.push(state); - } - if (storeId) { - conditions.push(`dp.dispensary_id = $${paramIndex++}`); - params.push(storeId); - } - - const whereClause = conditions.length > 0 - ? 'WHERE dp.type IS NOT NULL AND ' + conditions.join(' AND ') - : 'WHERE dp.type IS NOT NULL'; - - const result = await this.pool.query(` - SELECT - dp.type as category, - COUNT(*) as total_skus, - COUNT(DISTINCT dp.brand_name) as brand_count, - COUNT(DISTINCT dp.dispensary_id) as store_count, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - MIN(extract_min_price(dp.latest_raw_payload)) as min_price, - MAX(extract_max_price(dp.latest_raw_payload)) as max_price, - SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock, - SUM(CASE WHEN dp.stock_status != 'in_stock' OR dp.stock_status IS NULL THEN 1 ELSE 0 END) as out_of_stock - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - ${whereClause} - GROUP BY dp.type - ORDER BY total_skus DESC - `, params); - - return result.rows.map(row => { - const totalSkus = parseInt(row.total_skus) || 0; - const inStock = parseInt(row.in_stock) || 0; - - return { - category: row.category, - totalSkus, - brandCount: parseInt(row.brand_count) || 0, - storeCount: parseInt(row.store_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - minPrice: row.min_price ? Math.round(parseFloat(row.min_price) * 100) / 100 : null, - maxPrice: row.max_price ? Math.round(parseFloat(row.max_price) * 100) / 100 : null, - inStockSkus: inStock, - outOfStockSkus: parseInt(row.out_of_stock) || 0, - stockHealthPercent: totalSkus > 0 - ? Math.round((inStock / totalSkus) * 100) - : 0, - }; - }); - }, 15)).data; - } - - /** - * Get category growth (comparing periods) - */ - async getCategoryGrowth( - days: number = 7, - filters: CategoryFilters = {} - ): Promise { - const { state, storeId, minSkus = 10 } = filters; - const key = cacheKey('category_growth', { days, state, storeId, minSkus }); - - return (await this.cache.getOrCompute(key, async () => { - // Use category_snapshots for historical comparison - const result = await this.pool.query(` - WITH current_data AS ( - SELECT - category, - total_skus, - brand_count, - avg_price, - store_count - FROM category_snapshots - WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM category_snapshots) - ), - previous_data AS ( - SELECT - category, - total_skus, - brand_count, - avg_price, - store_count - FROM category_snapshots - WHERE snapshot_date = ( - SELECT MAX(snapshot_date) - FROM category_snapshots - WHERE snapshot_date < (SELECT MAX(snapshot_date) FROM category_snapshots) - ($1 || ' days')::INTERVAL - ) - ) - SELECT - c.category, - c.total_skus as current_skus, - COALESCE(p.total_skus, c.total_skus) as previous_skus, - c.brand_count as current_brands, - COALESCE(p.brand_count, c.brand_count) as previous_brands, - c.avg_price as current_price, - p.avg_price as previous_price - FROM current_data c - LEFT JOIN previous_data p ON c.category = p.category - WHERE c.total_skus >= $2 - ORDER BY c.total_skus DESC - `, [days, minSkus]); - - // If no snapshots exist, use current data - if (result.rows.length === 0) { - const fallbackResult = await this.pool.query(` - SELECT - type as category, - COUNT(*) as total_skus, - COUNT(DISTINCT brand_name) as brand_count, - AVG(extract_min_price(latest_raw_payload)) as avg_price - FROM dutchie_products - WHERE type IS NOT NULL - GROUP BY type - HAVING COUNT(*) >= $1 - ORDER BY total_skus DESC - `, [minSkus]); - - return fallbackResult.rows.map(row => ({ - category: row.category, - currentSkuCount: parseInt(row.total_skus) || 0, - previousSkuCount: parseInt(row.total_skus) || 0, - skuGrowthPercent: 0, - currentBrandCount: parseInt(row.brand_count) || 0, - previousBrandCount: parseInt(row.brand_count) || 0, - brandGrowthPercent: 0, - currentAvgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - previousAvgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - priceChangePercent: null, - newProducts: 0, - discontinuedProducts: 0, - trend: 'stable' as const, - })); - } - - return result.rows.map(row => { - const currentSkus = parseInt(row.current_skus) || 0; - const previousSkus = parseInt(row.previous_skus) || currentSkus; - const currentBrands = parseInt(row.current_brands) || 0; - const previousBrands = parseInt(row.previous_brands) || currentBrands; - const currentPrice = row.current_price ? parseFloat(row.current_price) : null; - const previousPrice = row.previous_price ? parseFloat(row.previous_price) : null; - - const skuGrowth = previousSkus > 0 - ? ((currentSkus - previousSkus) / previousSkus) * 100 - : 0; - const brandGrowth = previousBrands > 0 - ? ((currentBrands - previousBrands) / previousBrands) * 100 - : 0; - const priceChange = previousPrice && currentPrice - ? ((currentPrice - previousPrice) / previousPrice) * 100 - : null; - - let trend: 'growing' | 'declining' | 'stable' = 'stable'; - if (skuGrowth > 5) trend = 'growing'; - else if (skuGrowth < -5) trend = 'declining'; - - return { - category: row.category, - currentSkuCount: currentSkus, - previousSkuCount: previousSkus, - skuGrowthPercent: Math.round(skuGrowth * 10) / 10, - currentBrandCount: currentBrands, - previousBrandCount: previousBrands, - brandGrowthPercent: Math.round(brandGrowth * 10) / 10, - currentAvgPrice: currentPrice ? Math.round(currentPrice * 100) / 100 : null, - previousAvgPrice: previousPrice ? Math.round(previousPrice * 100) / 100 : null, - priceChangePercent: priceChange !== null ? Math.round(priceChange * 10) / 10 : null, - newProducts: Math.max(0, currentSkus - previousSkus), - discontinuedProducts: Math.max(0, previousSkus - currentSkus), - trend, - }; - }); - }, 15)).data; - } - - /** - * Get category growth trend over time - */ - async getCategoryGrowthTrend( - category: string, - days: number = 90 - ): Promise { - const key = cacheKey('category_growth_trend', { category, days }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - snapshot_date as date, - total_skus as sku_count, - brand_count, - avg_price, - store_count - FROM category_snapshots - WHERE category = $1 - AND snapshot_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - ORDER BY snapshot_date - `, [category, days]); - - const dataPoints = result.rows.map(row => ({ - date: row.date.toISOString().split('T')[0], - skuCount: parseInt(row.sku_count) || 0, - brandCount: parseInt(row.brand_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - storeCount: parseInt(row.store_count) || 0, - })); - - // Calculate growth rates - const calculateGrowth = (daysBack: number): number | null => { - if (dataPoints.length < 2) return null; - const targetDate = new Date(); - targetDate.setDate(targetDate.getDate() - daysBack); - const targetDateStr = targetDate.toISOString().split('T')[0]; - - const recent = dataPoints[dataPoints.length - 1]; - const older = dataPoints.find(d => d.date <= targetDateStr) || dataPoints[0]; - - if (older.skuCount === 0) return null; - return Math.round(((recent.skuCount - older.skuCount) / older.skuCount) * 1000) / 10; - }; - - return { - category, - dataPoints, - growth7d: calculateGrowth(7), - growth30d: calculateGrowth(30), - growth90d: calculateGrowth(90), - }; - }, 15)).data; - } - - /** - * Get category heatmap data - */ - async getCategoryHeatmap( - metric: 'skus' | 'growth' | 'price' = 'skus', - periods: number = 12 // weeks - ): Promise { - const key = cacheKey('category_heatmap', { metric, periods }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - category, - snapshot_date, - total_skus, - avg_price - FROM category_snapshots - WHERE snapshot_date >= CURRENT_DATE - ($1 * 7 || ' days')::INTERVAL - ORDER BY category, snapshot_date - `, [periods]); - - // Get unique categories and generate weekly periods - const categoriesSet = new Set(); - const periodsSet = new Set(); - - result.rows.forEach(row => { - categoriesSet.add(row.category); - // Group by week - const date = new Date(row.snapshot_date); - const weekStart = new Date(date); - weekStart.setDate(date.getDate() - date.getDay()); - periodsSet.add(weekStart.toISOString().split('T')[0]); - }); - - const categories = Array.from(categoriesSet).sort(); - const periodsList = Array.from(periodsSet).sort(); - - // Aggregate data by category and week - const dataMap = new Map>(); - - result.rows.forEach(row => { - const date = new Date(row.snapshot_date); - const weekStart = new Date(date); - weekStart.setDate(date.getDate() - date.getDay()); - const period = weekStart.toISOString().split('T')[0]; - - if (!dataMap.has(row.category)) { - dataMap.set(row.category, new Map()); - } - const categoryData = dataMap.get(row.category)!; - - if (!categoryData.has(period)) { - categoryData.set(period, { skus: 0, price: null }); - } - const existing = categoryData.get(period)!; - existing.skus = Math.max(existing.skus, parseInt(row.total_skus) || 0); - if (row.avg_price) { - existing.price = parseFloat(row.avg_price); - } - }); - - // Build heatmap data - const data: CategoryHeatmapData['data'] = []; - - categories.forEach(category => { - let previousValue: number | null = null; - - periodsList.forEach(period => { - const categoryData = dataMap.get(category)?.get(period); - let value = 0; - - if (categoryData) { - switch (metric) { - case 'skus': - value = categoryData.skus; - break; - case 'price': - value = categoryData.price || 0; - break; - case 'growth': - value = previousValue !== null && previousValue > 0 - ? ((categoryData.skus - previousValue) / previousValue) * 100 - : 0; - break; - } - } - - const changeFromPrevious = previousValue !== null && previousValue > 0 - ? ((value - previousValue) / previousValue) * 100 - : null; - - data.push({ - category, - period, - value: Math.round(value * 100) / 100, - changeFromPrevious: changeFromPrevious !== null - ? Math.round(changeFromPrevious * 10) / 10 - : null, - }); - - if (metric !== 'growth') { - previousValue = value; - } else if (categoryData) { - previousValue = categoryData.skus; - } - }); - }); - - return { - categories, - periods: periodsList, - data, - }; - }, 30)).data; - } - - /** - * Get top growing/declining categories - */ - async getTopMovers( - limit: number = 5, - days: number = 30 - ): Promise<{ - growing: CategoryGrowth[]; - declining: CategoryGrowth[]; - }> { - const key = cacheKey('top_movers', { limit, days }); - - return (await this.cache.getOrCompute(key, async () => { - const allGrowth = await this.getCategoryGrowth(days); - - const sorted = [...allGrowth].sort((a, b) => b.skuGrowthPercent - a.skuGrowthPercent); - - return { - growing: sorted.filter(c => c.skuGrowthPercent > 0).slice(0, limit), - declining: sorted.filter(c => c.skuGrowthPercent < 0).slice(-limit).reverse(), - }; - }, 15)).data; - } - - /** - * Get category subcategory breakdown - */ - async getSubcategoryBreakdown(category: string): Promise> { - const key = cacheKey('subcategory_breakdown', { category }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - WITH category_total AS ( - SELECT COUNT(*) as total FROM dutchie_products WHERE type = $1 - ) - SELECT - COALESCE(dp.subcategory, 'Other') as subcategory, - COUNT(*) as sku_count, - COUNT(DISTINCT dp.brand_name) as brand_count, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - ct.total as category_total - FROM dutchie_products dp, category_total ct - WHERE dp.type = $1 - GROUP BY dp.subcategory, ct.total - ORDER BY sku_count DESC - `, [category]); - - return result.rows.map(row => ({ - subcategory: row.subcategory, - skuCount: parseInt(row.sku_count) || 0, - brandCount: parseInt(row.brand_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - percentOfCategory: parseInt(row.category_total) > 0 - ? Math.round((parseInt(row.sku_count) / parseInt(row.category_total)) * 1000) / 10 - : 0, - })); - }, 15)).data; - } -} diff --git a/backend/src/dutchie-az/services/analytics/index.ts b/backend/src/dutchie-az/services/analytics/index.ts deleted file mode 100644 index ac73bed0..00000000 --- a/backend/src/dutchie-az/services/analytics/index.ts +++ /dev/null @@ -1,57 +0,0 @@ -/** - * Analytics Module Index - * - * Exports all analytics services for CannaiQ dashboards. - * - * Phase 3: Analytics Dashboards - */ - -export { AnalyticsCache, cacheKey, type CacheEntry, type CacheConfig } from './cache'; - -export { - PriceTrendService, - type PricePoint, - type PriceTrend, - type PriceSummary, - type PriceCompressionResult, - type PriceFilters, -} from './price-trends'; - -export { - PenetrationService, - type BrandPenetration, - type PenetrationTrend, - type ShelfShare, - type BrandPresenceByState, - type PenetrationFilters, -} from './penetration'; - -export { - CategoryAnalyticsService, - type CategoryGrowth, - type CategorySummary, - type CategoryGrowthTrend, - type CategoryHeatmapData, - type SeasonalityPattern, - type CategoryFilters, -} from './category-analytics'; - -export { - StoreChangeService, - type StoreChangeSummary, - type StoreChangeEvent, - type BrandChange, - type ProductChange, - type CategoryLeaderboard, - type StoreFilters, -} from './store-changes'; - -export { - BrandOpportunityService, - type BrandOpportunity, - type PricePosition, - type MissingSkuOpportunity, - type StoreShelfShareChange, - type CompetitorAlert, - type MarketPositionSummary, -} from './brand-opportunity'; diff --git a/backend/src/dutchie-az/services/analytics/penetration.ts b/backend/src/dutchie-az/services/analytics/penetration.ts deleted file mode 100644 index 92baad60..00000000 --- a/backend/src/dutchie-az/services/analytics/penetration.ts +++ /dev/null @@ -1,556 +0,0 @@ -/** - * Brand Penetration Analytics Service - * - * Provides analytics for brand market penetration including: - * - Stores carrying brand - * - SKU counts per brand - * - Percentage of stores carrying - * - Shelf share calculations - * - Penetration trends and momentum - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; -import { AnalyticsCache, cacheKey } from './cache'; - -export interface BrandPenetration { - brandName: string; - brandId: string | null; - totalStores: number; - storesCarrying: number; - penetrationPercent: number; - totalSkus: number; - avgSkusPerStore: number; - shelfSharePercent: number; - categories: string[]; - avgPrice: number | null; - inStockSkus: number; -} - -export interface PenetrationTrend { - brandName: string; - dataPoints: Array<{ - date: string; - storeCount: number; - skuCount: number; - penetrationPercent: number; - }>; - momentumScore: number; // -100 to +100 - riskScore: number; // 0 to 100, higher = more risk - trend: 'growing' | 'declining' | 'stable'; -} - -export interface ShelfShare { - brandName: string; - category: string; - skuCount: number; - categoryTotalSkus: number; - shelfSharePercent: number; - rank: number; -} - -export interface BrandPresenceByState { - state: string; - storeCount: number; - skuCount: number; - avgPrice: number | null; -} - -export interface PenetrationFilters { - state?: string; - category?: string; - minStores?: number; - minSkus?: number; -} - -export class PenetrationService { - private pool: Pool; - private cache: AnalyticsCache; - - constructor(pool: Pool, cache: AnalyticsCache) { - this.pool = pool; - this.cache = cache; - } - - /** - * Get penetration data for a specific brand - */ - async getBrandPenetration( - brandName: string, - filters: PenetrationFilters = {} - ): Promise { - const { state, category } = filters; - const key = cacheKey('brand_penetration', { brandName, state, category }); - - return (await this.cache.getOrCompute(key, async () => { - // Build where clauses - const conditions: string[] = []; - const params: (string | number)[] = [brandName]; - let paramIndex = 2; - - if (state) { - conditions.push(`d.state = $${paramIndex++}`); - params.push(state); - } - if (category) { - conditions.push(`dp.type = $${paramIndex++}`); - params.push(category); - } - - const stateCondition = state ? `AND d.state = $${params.indexOf(state) + 1}` : ''; - const categoryCondition = category ? `AND dp.type = $${params.indexOf(category) + 1}` : ''; - - const result = await this.pool.query(` - WITH total_stores AS ( - SELECT COUNT(DISTINCT id) as total - FROM dispensaries - WHERE 1=1 ${state ? `AND state = $2` : ''} - ), - brand_data AS ( - SELECT - dp.brand_name, - dp.brand_id, - COUNT(DISTINCT dp.dispensary_id) as stores_carrying, - COUNT(*) as total_skus, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock, - ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name = $1 - ${stateCondition} - ${categoryCondition} - GROUP BY dp.brand_name, dp.brand_id - ), - total_skus AS ( - SELECT COUNT(*) as total - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE 1=1 ${stateCondition} ${categoryCondition} - ) - SELECT - bd.brand_name, - bd.brand_id, - ts.total as total_stores, - bd.stores_carrying, - bd.total_skus, - bd.avg_price, - bd.in_stock, - bd.categories, - tsk.total as market_total_skus - FROM brand_data bd, total_stores ts, total_skus tsk - `, params); - - if (result.rows.length === 0) { - return { - brandName, - brandId: null, - totalStores: 0, - storesCarrying: 0, - penetrationPercent: 0, - totalSkus: 0, - avgSkusPerStore: 0, - shelfSharePercent: 0, - categories: [], - avgPrice: null, - inStockSkus: 0, - }; - } - - const row = result.rows[0]; - const totalStores = parseInt(row.total_stores) || 1; - const storesCarrying = parseInt(row.stores_carrying) || 0; - const totalSkus = parseInt(row.total_skus) || 0; - const marketTotalSkus = parseInt(row.market_total_skus) || 1; - - return { - brandName: row.brand_name, - brandId: row.brand_id, - totalStores, - storesCarrying, - penetrationPercent: Math.round((storesCarrying / totalStores) * 1000) / 10, - totalSkus, - avgSkusPerStore: storesCarrying > 0 - ? Math.round((totalSkus / storesCarrying) * 10) / 10 - : 0, - shelfSharePercent: Math.round((totalSkus / marketTotalSkus) * 1000) / 10, - categories: row.categories || [], - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - inStockSkus: parseInt(row.in_stock) || 0, - }; - }, 15)).data; - } - - /** - * Get top brands by penetration - */ - async getTopBrandsByPenetration( - limit: number = 20, - filters: PenetrationFilters = {} - ): Promise { - const { state, category, minStores = 2, minSkus = 5 } = filters; - const key = cacheKey('top_brands_penetration', { limit, state, category, minStores, minSkus }); - - return (await this.cache.getOrCompute(key, async () => { - const params: (string | number)[] = [limit, minStores, minSkus]; - let paramIndex = 4; - - let stateCondition = ''; - let categoryCondition = ''; - - if (state) { - stateCondition = `AND d.state = $${paramIndex++}`; - params.push(state); - } - if (category) { - categoryCondition = `AND dp.type = $${paramIndex++}`; - params.push(category); - } - - const result = await this.pool.query(` - WITH total_stores AS ( - SELECT COUNT(DISTINCT id) as total - FROM dispensaries - WHERE 1=1 ${state ? `AND state = $${params.indexOf(state) + 1}` : ''} - ), - total_skus AS ( - SELECT COUNT(*) as total - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE 1=1 ${stateCondition} ${categoryCondition} - ), - brand_data AS ( - SELECT - dp.brand_name, - dp.brand_id, - COUNT(DISTINCT dp.dispensary_id) as stores_carrying, - COUNT(*) as total_skus, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - SUM(CASE WHEN dp.stock_status = 'in_stock' THEN 1 ELSE 0 END) as in_stock, - ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name IS NOT NULL - ${stateCondition} - ${categoryCondition} - GROUP BY dp.brand_name, dp.brand_id - HAVING COUNT(DISTINCT dp.dispensary_id) >= $2 - AND COUNT(*) >= $3 - ) - SELECT - bd.*, - ts.total as total_stores, - tsk.total as market_total_skus - FROM brand_data bd, total_stores ts, total_skus tsk - ORDER BY bd.stores_carrying DESC, bd.total_skus DESC - LIMIT $1 - `, params); - - return result.rows.map(row => { - const totalStores = parseInt(row.total_stores) || 1; - const storesCarrying = parseInt(row.stores_carrying) || 0; - const totalSkus = parseInt(row.total_skus) || 0; - const marketTotalSkus = parseInt(row.market_total_skus) || 1; - - return { - brandName: row.brand_name, - brandId: row.brand_id, - totalStores, - storesCarrying, - penetrationPercent: Math.round((storesCarrying / totalStores) * 1000) / 10, - totalSkus, - avgSkusPerStore: storesCarrying > 0 - ? Math.round((totalSkus / storesCarrying) * 10) / 10 - : 0, - shelfSharePercent: Math.round((totalSkus / marketTotalSkus) * 1000) / 10, - categories: row.categories || [], - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - inStockSkus: parseInt(row.in_stock) || 0, - }; - }); - }, 15)).data; - } - - /** - * Get penetration trend for a brand (requires historical snapshots) - */ - async getPenetrationTrend( - brandName: string, - days: number = 30 - ): Promise { - const key = cacheKey('penetration_trend', { brandName, days }); - - return (await this.cache.getOrCompute(key, async () => { - // Use brand_snapshots table for historical data - const result = await this.pool.query(` - SELECT - snapshot_date as date, - store_count, - total_skus - FROM brand_snapshots - WHERE brand_name = $1 - AND snapshot_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - ORDER BY snapshot_date - `, [brandName, days]); - - // Get total stores for penetration calculation - const totalResult = await this.pool.query( - 'SELECT COUNT(*) as total FROM dispensaries' - ); - const totalStores = parseInt(totalResult.rows[0]?.total) || 1; - - const dataPoints = result.rows.map(row => ({ - date: row.date.toISOString().split('T')[0], - storeCount: parseInt(row.store_count) || 0, - skuCount: parseInt(row.total_skus) || 0, - penetrationPercent: Math.round((parseInt(row.store_count) / totalStores) * 1000) / 10, - })); - - // Calculate momentum and risk scores - let momentumScore = 0; - let riskScore = 0; - let trend: 'growing' | 'declining' | 'stable' = 'stable'; - - if (dataPoints.length >= 2) { - const first = dataPoints[0]; - const last = dataPoints[dataPoints.length - 1]; - - // Momentum: change in store count - const storeChange = last.storeCount - first.storeCount; - const storeChangePercent = first.storeCount > 0 - ? (storeChange / first.storeCount) * 100 - : 0; - - // Momentum score: -100 to +100 - momentumScore = Math.max(-100, Math.min(100, storeChangePercent * 10)); - - // Risk score: higher if losing stores - if (storeChange < 0) { - riskScore = Math.min(100, Math.abs(storeChangePercent) * 5); - } - - // Determine trend - if (storeChangePercent > 5) trend = 'growing'; - else if (storeChangePercent < -5) trend = 'declining'; - } - - return { - brandName, - dataPoints, - momentumScore: Math.round(momentumScore), - riskScore: Math.round(riskScore), - trend, - }; - }, 15)).data; - } - - /** - * Get shelf share by category for a brand - */ - async getShelfShareByCategory(brandName: string): Promise { - const key = cacheKey('shelf_share_category', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - WITH category_totals AS ( - SELECT - type as category, - COUNT(*) as total_skus - FROM dutchie_products - WHERE type IS NOT NULL - GROUP BY type - ), - brand_by_category AS ( - SELECT - type as category, - COUNT(*) as sku_count - FROM dutchie_products - WHERE brand_name = $1 - AND type IS NOT NULL - GROUP BY type - ), - ranked AS ( - SELECT - ct.category, - COALESCE(bc.sku_count, 0) as sku_count, - ct.total_skus, - RANK() OVER (PARTITION BY ct.category ORDER BY bc.sku_count DESC NULLS LAST) as rank - FROM category_totals ct - LEFT JOIN brand_by_category bc ON ct.category = bc.category - ) - SELECT - r.category, - r.sku_count, - r.total_skus as category_total_skus, - ROUND((r.sku_count::NUMERIC / r.total_skus) * 100, 2) as shelf_share_pct, - (SELECT COUNT(*) + 1 FROM ( - SELECT brand_name, COUNT(*) as cnt - FROM dutchie_products - WHERE type = r.category AND brand_name IS NOT NULL - GROUP BY brand_name - HAVING COUNT(*) > r.sku_count - ) t) as rank - FROM ranked r - WHERE r.sku_count > 0 - ORDER BY r.shelf_share_pct DESC - `, [brandName]); - - return result.rows.map(row => ({ - brandName, - category: row.category, - skuCount: parseInt(row.sku_count) || 0, - categoryTotalSkus: parseInt(row.category_total_skus) || 0, - shelfSharePercent: parseFloat(row.shelf_share_pct) || 0, - rank: parseInt(row.rank) || 0, - })); - }, 15)).data; - } - - /** - * Get brand presence by state/region - */ - async getBrandPresenceByState(brandName: string): Promise { - const key = cacheKey('brand_presence_state', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - d.state, - COUNT(DISTINCT dp.dispensary_id) as store_count, - COUNT(*) as sku_count, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name = $1 - GROUP BY d.state - ORDER BY store_count DESC - `, [brandName]); - - return result.rows.map(row => ({ - state: row.state, - storeCount: parseInt(row.store_count) || 0, - skuCount: parseInt(row.sku_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - })); - }, 15)).data; - } - - /** - * Get stores carrying a brand - */ - async getStoresCarryingBrand(brandName: string): Promise> { - const key = cacheKey('stores_carrying_brand', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - d.id as store_id, - d.name as store_name, - d.city, - d.state, - COUNT(*) as sku_count, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name = $1 - GROUP BY d.id, d.name, d.city, d.state - ORDER BY sku_count DESC - `, [brandName]); - - return result.rows.map(row => ({ - storeId: row.store_id, - storeName: row.store_name, - city: row.city, - state: row.state, - skuCount: parseInt(row.sku_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - categories: row.categories || [], - })); - }, 15)).data; - } - - /** - * Get penetration heatmap data (state-based) - */ - async getPenetrationHeatmap( - brandName?: string - ): Promise> { - const key = cacheKey('penetration_heatmap', { brandName }); - - return (await this.cache.getOrCompute(key, async () => { - if (brandName) { - const result = await this.pool.query(` - WITH state_totals AS ( - SELECT state, COUNT(*) as total_stores - FROM dispensaries - GROUP BY state - ), - brand_by_state AS ( - SELECT - d.state, - COUNT(DISTINCT dp.dispensary_id) as stores_with_brand, - COUNT(*) as total_skus - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name = $1 - GROUP BY d.state - ) - SELECT - st.state, - st.total_stores, - COALESCE(bs.stores_with_brand, 0) as stores_with_brand, - ROUND(COALESCE(bs.stores_with_brand, 0)::NUMERIC / st.total_stores * 100, 1) as penetration_pct, - COALESCE(bs.total_skus, 0) as total_skus - FROM state_totals st - LEFT JOIN brand_by_state bs ON st.state = bs.state - ORDER BY penetration_pct DESC - `, [brandName]); - - return result.rows.map(row => ({ - state: row.state, - totalStores: parseInt(row.total_stores) || 0, - storesWithBrand: parseInt(row.stores_with_brand) || 0, - penetrationPercent: parseFloat(row.penetration_pct) || 0, - totalSkus: parseInt(row.total_skus) || 0, - })); - } else { - // Overall market data by state - const result = await this.pool.query(` - SELECT - d.state, - COUNT(DISTINCT d.id) as total_stores, - COUNT(DISTINCT dp.brand_name) as brand_count, - COUNT(*) as total_skus - FROM dispensaries d - LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id - GROUP BY d.state - ORDER BY total_stores DESC - `); - - return result.rows.map(row => ({ - state: row.state, - totalStores: parseInt(row.total_stores) || 0, - storesWithBrand: parseInt(row.brand_count) || 0, // Using brand count here - penetrationPercent: 100, // Full penetration for overall view - totalSkus: parseInt(row.total_skus) || 0, - })); - } - }, 30)).data; - } -} diff --git a/backend/src/dutchie-az/services/analytics/price-trends.ts b/backend/src/dutchie-az/services/analytics/price-trends.ts deleted file mode 100644 index 8c4e31bf..00000000 --- a/backend/src/dutchie-az/services/analytics/price-trends.ts +++ /dev/null @@ -1,534 +0,0 @@ -/** - * Price Trend Analytics Service - * - * Provides time-series price analytics including: - * - Price over time for products - * - Average MSRP/Wholesale by period - * - Price volatility scoring - * - Price compression detection - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; -import { AnalyticsCache, cacheKey } from './cache'; - -export interface PricePoint { - date: string; - minPrice: number | null; - maxPrice: number | null; - avgPrice: number | null; - wholesalePrice: number | null; - sampleSize: number; -} - -export interface PriceTrend { - productId?: number; - storeId?: number; - brandName?: string; - category?: string; - dataPoints: PricePoint[]; - summary: { - currentAvg: number | null; - previousAvg: number | null; - changePercent: number | null; - trend: 'up' | 'down' | 'stable'; - volatilityScore: number | null; - }; -} - -export interface PriceSummary { - avg7d: number | null; - avg30d: number | null; - avg90d: number | null; - wholesaleAvg7d: number | null; - wholesaleAvg30d: number | null; - wholesaleAvg90d: number | null; - minPrice: number | null; - maxPrice: number | null; - priceRange: number | null; - volatilityScore: number | null; -} - -export interface PriceCompressionResult { - category: string; - brands: Array<{ - brandName: string; - avgPrice: number; - priceDistance: number; // distance from category mean - }>; - compressionScore: number; // 0-100, higher = more compressed - standardDeviation: number; -} - -export interface PriceFilters { - storeId?: number; - brandName?: string; - category?: string; - state?: string; - days?: number; -} - -export class PriceTrendService { - private pool: Pool; - private cache: AnalyticsCache; - - constructor(pool: Pool, cache: AnalyticsCache) { - this.pool = pool; - this.cache = cache; - } - - /** - * Get price trend for a specific product - */ - async getProductPriceTrend( - productId: number, - storeId?: number, - days: number = 30 - ): Promise { - const key = cacheKey('price_trend_product', { productId, storeId, days }); - - return (await this.cache.getOrCompute(key, async () => { - // Try to get from snapshots first - const snapshotResult = await this.pool.query(` - SELECT - DATE(crawled_at) as date, - MIN(rec_min_price_cents) / 100.0 as min_price, - MAX(rec_max_price_cents) / 100.0 as max_price, - AVG(rec_min_price_cents) / 100.0 as avg_price, - AVG(wholesale_min_price_cents) / 100.0 as wholesale_price, - COUNT(*) as sample_size - FROM dutchie_product_snapshots - WHERE dutchie_product_id = $1 - AND crawled_at >= NOW() - ($2 || ' days')::INTERVAL - ${storeId ? 'AND dispensary_id = $3' : ''} - GROUP BY DATE(crawled_at) - ORDER BY date - `, storeId ? [productId, days, storeId] : [productId, days]); - - let dataPoints: PricePoint[] = snapshotResult.rows.map(row => ({ - date: row.date.toISOString().split('T')[0], - minPrice: parseFloat(row.min_price) || null, - maxPrice: parseFloat(row.max_price) || null, - avgPrice: parseFloat(row.avg_price) || null, - wholesalePrice: parseFloat(row.wholesale_price) || null, - sampleSize: parseInt(row.sample_size), - })); - - // If no snapshots, get current price from product - if (dataPoints.length === 0) { - const productResult = await this.pool.query(` - SELECT - extract_min_price(latest_raw_payload) as min_price, - extract_max_price(latest_raw_payload) as max_price, - extract_wholesale_price(latest_raw_payload) as wholesale_price - FROM dutchie_products - WHERE id = $1 - `, [productId]); - - if (productResult.rows.length > 0) { - const row = productResult.rows[0]; - dataPoints = [{ - date: new Date().toISOString().split('T')[0], - minPrice: parseFloat(row.min_price) || null, - maxPrice: parseFloat(row.max_price) || null, - avgPrice: parseFloat(row.min_price) || null, - wholesalePrice: parseFloat(row.wholesale_price) || null, - sampleSize: 1, - }]; - } - } - - const summary = this.calculatePriceSummary(dataPoints); - - return { - productId, - storeId, - dataPoints, - summary, - }; - }, 15)).data; - } - - /** - * Get price trends by brand - */ - async getBrandPriceTrend( - brandName: string, - filters: PriceFilters = {} - ): Promise { - const { storeId, category, state, days = 30 } = filters; - const key = cacheKey('price_trend_brand', { brandName, storeId, category, state, days }); - - return (await this.cache.getOrCompute(key, async () => { - // Use current product data aggregated by date - const result = await this.pool.query(` - SELECT - DATE(dp.updated_at) as date, - MIN(extract_min_price(dp.latest_raw_payload)) as min_price, - MAX(extract_max_price(dp.latest_raw_payload)) as max_price, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - AVG(extract_wholesale_price(dp.latest_raw_payload)) as wholesale_price, - COUNT(*) as sample_size - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.brand_name = $1 - AND dp.updated_at >= NOW() - ($2 || ' days')::INTERVAL - ${storeId ? 'AND dp.dispensary_id = $3' : ''} - ${category ? `AND dp.type = $${storeId ? 4 : 3}` : ''} - ${state ? `AND d.state = $${storeId ? (category ? 5 : 4) : (category ? 4 : 3)}` : ''} - GROUP BY DATE(dp.updated_at) - ORDER BY date - `, this.buildParams([brandName, days], { storeId, category, state })); - - const dataPoints: PricePoint[] = result.rows.map(row => ({ - date: row.date.toISOString().split('T')[0], - minPrice: parseFloat(row.min_price) || null, - maxPrice: parseFloat(row.max_price) || null, - avgPrice: parseFloat(row.avg_price) || null, - wholesalePrice: parseFloat(row.wholesale_price) || null, - sampleSize: parseInt(row.sample_size), - })); - - return { - brandName, - storeId, - category, - dataPoints, - summary: this.calculatePriceSummary(dataPoints), - }; - }, 15)).data; - } - - /** - * Get price trends by category - */ - async getCategoryPriceTrend( - category: string, - filters: PriceFilters = {} - ): Promise { - const { storeId, brandName, state, days = 30 } = filters; - const key = cacheKey('price_trend_category', { category, storeId, brandName, state, days }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - DATE(dp.updated_at) as date, - MIN(extract_min_price(dp.latest_raw_payload)) as min_price, - MAX(extract_max_price(dp.latest_raw_payload)) as max_price, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - AVG(extract_wholesale_price(dp.latest_raw_payload)) as wholesale_price, - COUNT(*) as sample_size - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.type = $1 - AND dp.updated_at >= NOW() - ($2 || ' days')::INTERVAL - ${storeId ? 'AND dp.dispensary_id = $3' : ''} - ${brandName ? `AND dp.brand_name = $${storeId ? 4 : 3}` : ''} - ${state ? `AND d.state = $${storeId ? (brandName ? 5 : 4) : (brandName ? 4 : 3)}` : ''} - GROUP BY DATE(dp.updated_at) - ORDER BY date - `, this.buildParams([category, days], { storeId, brandName, state })); - - const dataPoints: PricePoint[] = result.rows.map(row => ({ - date: row.date.toISOString().split('T')[0], - minPrice: parseFloat(row.min_price) || null, - maxPrice: parseFloat(row.max_price) || null, - avgPrice: parseFloat(row.avg_price) || null, - wholesalePrice: parseFloat(row.wholesale_price) || null, - sampleSize: parseInt(row.sample_size), - })); - - return { - category, - storeId, - brandName, - dataPoints, - summary: this.calculatePriceSummary(dataPoints), - }; - }, 15)).data; - } - - /** - * Get price summary statistics - */ - async getPriceSummary(filters: PriceFilters = {}): Promise { - const { storeId, brandName, category, state } = filters; - const key = cacheKey('price_summary', filters as Record); - - return (await this.cache.getOrCompute(key, async () => { - const whereConditions: string[] = []; - const params: (string | number)[] = []; - let paramIndex = 1; - - if (storeId) { - whereConditions.push(`dp.dispensary_id = $${paramIndex++}`); - params.push(storeId); - } - if (brandName) { - whereConditions.push(`dp.brand_name = $${paramIndex++}`); - params.push(brandName); - } - if (category) { - whereConditions.push(`dp.type = $${paramIndex++}`); - params.push(category); - } - if (state) { - whereConditions.push(`d.state = $${paramIndex++}`); - params.push(state); - } - - const whereClause = whereConditions.length > 0 - ? 'WHERE ' + whereConditions.join(' AND ') - : ''; - - const result = await this.pool.query(` - WITH prices AS ( - SELECT - extract_min_price(dp.latest_raw_payload) as min_price, - extract_max_price(dp.latest_raw_payload) as max_price, - extract_wholesale_price(dp.latest_raw_payload) as wholesale_price - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - ${whereClause} - ) - SELECT - AVG(min_price) as avg_price, - AVG(wholesale_price) as avg_wholesale, - MIN(min_price) as min_price, - MAX(max_price) as max_price, - STDDEV(min_price) as std_dev - FROM prices - WHERE min_price IS NOT NULL - `, params); - - const row = result.rows[0]; - const avgPrice = parseFloat(row.avg_price) || null; - const stdDev = parseFloat(row.std_dev) || null; - const volatility = avgPrice && stdDev ? (stdDev / avgPrice) * 100 : null; - - return { - avg7d: avgPrice, // Using current data as proxy - avg30d: avgPrice, - avg90d: avgPrice, - wholesaleAvg7d: parseFloat(row.avg_wholesale) || null, - wholesaleAvg30d: parseFloat(row.avg_wholesale) || null, - wholesaleAvg90d: parseFloat(row.avg_wholesale) || null, - minPrice: parseFloat(row.min_price) || null, - maxPrice: parseFloat(row.max_price) || null, - priceRange: row.max_price && row.min_price - ? parseFloat(row.max_price) - parseFloat(row.min_price) - : null, - volatilityScore: volatility ? Math.round(volatility * 10) / 10 : null, - }; - }, 30)).data; - } - - /** - * Detect price compression in a category - */ - async detectPriceCompression( - category: string, - state?: string - ): Promise { - const key = cacheKey('price_compression', { category, state }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - WITH brand_prices AS ( - SELECT - dp.brand_name, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - COUNT(*) as sku_count - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.type = $1 - AND dp.brand_name IS NOT NULL - ${state ? 'AND d.state = $2' : ''} - GROUP BY dp.brand_name - HAVING COUNT(*) >= 3 - ), - stats AS ( - SELECT - AVG(avg_price) as category_avg, - STDDEV(avg_price) as std_dev - FROM brand_prices - WHERE avg_price IS NOT NULL - ) - SELECT - bp.brand_name, - bp.avg_price, - ABS(bp.avg_price - s.category_avg) as price_distance, - s.category_avg, - s.std_dev - FROM brand_prices bp, stats s - WHERE bp.avg_price IS NOT NULL - ORDER BY bp.avg_price - `, state ? [category, state] : [category]); - - if (result.rows.length === 0) { - return { - category, - brands: [], - compressionScore: 0, - standardDeviation: 0, - }; - } - - const categoryAvg = parseFloat(result.rows[0].category_avg) || 0; - const stdDev = parseFloat(result.rows[0].std_dev) || 0; - - // Compression score: lower std dev relative to mean = more compression - // Scale to 0-100 where 100 = very compressed - const cv = categoryAvg > 0 ? (stdDev / categoryAvg) * 100 : 0; - const compressionScore = Math.max(0, Math.min(100, 100 - cv)); - - const brands = result.rows.map(row => ({ - brandName: row.brand_name, - avgPrice: parseFloat(row.avg_price) || 0, - priceDistance: parseFloat(row.price_distance) || 0, - })); - - return { - category, - brands, - compressionScore: Math.round(compressionScore), - standardDeviation: Math.round(stdDev * 100) / 100, - }; - }, 30)).data; - } - - /** - * Get global price statistics - */ - async getGlobalPriceStats(): Promise<{ - totalProductsWithPrice: number; - avgPrice: number | null; - medianPrice: number | null; - priceByCategory: Array<{ category: string; avgPrice: number; count: number }>; - priceByState: Array<{ state: string; avgPrice: number; count: number }>; - }> { - const key = 'global_price_stats'; - - return (await this.cache.getOrCompute(key, async () => { - const [countResult, categoryResult, stateResult] = await Promise.all([ - this.pool.query(` - SELECT - COUNT(*) FILTER (WHERE extract_min_price(latest_raw_payload) IS NOT NULL) as with_price, - AVG(extract_min_price(latest_raw_payload)) as avg_price, - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY extract_min_price(latest_raw_payload)) as median - FROM dutchie_products - `), - this.pool.query(` - SELECT - type as category, - AVG(extract_min_price(latest_raw_payload)) as avg_price, - COUNT(*) as count - FROM dutchie_products - WHERE type IS NOT NULL - AND extract_min_price(latest_raw_payload) IS NOT NULL - GROUP BY type - ORDER BY avg_price DESC - `), - this.pool.query(` - SELECT - d.state, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price, - COUNT(*) as count - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE extract_min_price(dp.latest_raw_payload) IS NOT NULL - GROUP BY d.state - ORDER BY avg_price DESC - `), - ]); - - return { - totalProductsWithPrice: parseInt(countResult.rows[0]?.with_price || '0'), - avgPrice: parseFloat(countResult.rows[0]?.avg_price) || null, - medianPrice: parseFloat(countResult.rows[0]?.median) || null, - priceByCategory: categoryResult.rows.map(r => ({ - category: r.category, - avgPrice: parseFloat(r.avg_price) || 0, - count: parseInt(r.count), - })), - priceByState: stateResult.rows.map(r => ({ - state: r.state, - avgPrice: parseFloat(r.avg_price) || 0, - count: parseInt(r.count), - })), - }; - }, 30)).data; - } - - // ============================================================ - // HELPER METHODS - // ============================================================ - - private calculatePriceSummary(dataPoints: PricePoint[]): PriceTrend['summary'] { - if (dataPoints.length === 0) { - return { - currentAvg: null, - previousAvg: null, - changePercent: null, - trend: 'stable', - volatilityScore: null, - }; - } - - const prices = dataPoints - .map(d => d.avgPrice) - .filter((p): p is number => p !== null); - - if (prices.length === 0) { - return { - currentAvg: null, - previousAvg: null, - changePercent: null, - trend: 'stable', - volatilityScore: null, - }; - } - - const currentAvg = prices[prices.length - 1]; - const midpoint = Math.floor(prices.length / 2); - const previousAvg = prices.length > 1 ? prices[midpoint] : currentAvg; - - const changePercent = previousAvg > 0 - ? ((currentAvg - previousAvg) / previousAvg) * 100 - : null; - - // Calculate volatility (coefficient of variation) - const mean = prices.reduce((a, b) => a + b, 0) / prices.length; - const variance = prices.reduce((sum, p) => sum + Math.pow(p - mean, 2), 0) / prices.length; - const stdDev = Math.sqrt(variance); - const volatilityScore = mean > 0 ? (stdDev / mean) * 100 : null; - - let trend: 'up' | 'down' | 'stable' = 'stable'; - if (changePercent !== null) { - if (changePercent > 5) trend = 'up'; - else if (changePercent < -5) trend = 'down'; - } - - return { - currentAvg: Math.round(currentAvg * 100) / 100, - previousAvg: Math.round(previousAvg * 100) / 100, - changePercent: changePercent !== null ? Math.round(changePercent * 10) / 10 : null, - trend, - volatilityScore: volatilityScore !== null ? Math.round(volatilityScore * 10) / 10 : null, - }; - } - - private buildParams( - baseParams: (string | number)[], - optionalParams: Record - ): (string | number)[] { - const params = [...baseParams]; - for (const value of Object.values(optionalParams)) { - if (value !== undefined) { - params.push(value); - } - } - return params; - } -} diff --git a/backend/src/dutchie-az/services/analytics/store-changes.ts b/backend/src/dutchie-az/services/analytics/store-changes.ts deleted file mode 100644 index 5a744070..00000000 --- a/backend/src/dutchie-az/services/analytics/store-changes.ts +++ /dev/null @@ -1,587 +0,0 @@ -/** - * Store Change Tracking Service - * - * Tracks changes at the store level including: - * - New/lost brands - * - New/discontinued products - * - Stock status transitions - * - Price changes - * - Category movement leaderboards - * - * Phase 3: Analytics Dashboards - */ - -import { Pool } from 'pg'; -import { AnalyticsCache, cacheKey } from './cache'; - -export interface StoreChangeSummary { - storeId: number; - storeName: string; - city: string; - state: string; - brandsAdded7d: number; - brandsAdded30d: number; - brandsLost7d: number; - brandsLost30d: number; - productsAdded7d: number; - productsAdded30d: number; - productsDiscontinued7d: number; - productsDiscontinued30d: number; - priceDrops7d: number; - priceIncreases7d: number; - restocks7d: number; - stockOuts7d: number; -} - -export interface StoreChangeEvent { - id: number; - storeId: number; - storeName: string; - eventType: string; - eventDate: string; - brandName: string | null; - productName: string | null; - category: string | null; - oldValue: string | null; - newValue: string | null; - metadata: Record | null; -} - -export interface BrandChange { - brandName: string; - changeType: 'added' | 'removed'; - date: string; - skuCount: number; - categories: string[]; -} - -export interface ProductChange { - productId: number; - productName: string; - brandName: string | null; - category: string | null; - changeType: 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock'; - date: string; - oldValue?: string; - newValue?: string; -} - -export interface CategoryLeaderboard { - category: string; - storeId: number; - storeName: string; - skuCount: number; - brandCount: number; - avgPrice: number | null; - changePercent7d: number; - rank: number; -} - -export interface StoreFilters { - storeId?: number; - state?: string; - days?: number; - eventType?: string; -} - -export class StoreChangeService { - private pool: Pool; - private cache: AnalyticsCache; - - constructor(pool: Pool, cache: AnalyticsCache) { - this.pool = pool; - this.cache = cache; - } - - /** - * Get change summary for a store - */ - async getStoreChangeSummary( - storeId: number - ): Promise { - const key = cacheKey('store_change_summary', { storeId }); - - return (await this.cache.getOrCompute(key, async () => { - // Get store info - const storeResult = await this.pool.query(` - SELECT id, name, city, state FROM dispensaries WHERE id = $1 - `, [storeId]); - - if (storeResult.rows.length === 0) return null; - const store = storeResult.rows[0]; - - // Get change events counts - const eventsResult = await this.pool.query(` - SELECT - event_type, - COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '7 days') as count_7d, - COUNT(*) FILTER (WHERE event_date >= CURRENT_DATE - INTERVAL '30 days') as count_30d - FROM store_change_events - WHERE store_id = $1 - GROUP BY event_type - `, [storeId]); - - const counts: Record = {}; - eventsResult.rows.forEach(row => { - counts[row.event_type] = { - count_7d: parseInt(row.count_7d) || 0, - count_30d: parseInt(row.count_30d) || 0, - }; - }); - - return { - storeId: store.id, - storeName: store.name, - city: store.city, - state: store.state, - brandsAdded7d: counts['brand_added']?.count_7d || 0, - brandsAdded30d: counts['brand_added']?.count_30d || 0, - brandsLost7d: counts['brand_removed']?.count_7d || 0, - brandsLost30d: counts['brand_removed']?.count_30d || 0, - productsAdded7d: counts['product_added']?.count_7d || 0, - productsAdded30d: counts['product_added']?.count_30d || 0, - productsDiscontinued7d: counts['product_removed']?.count_7d || 0, - productsDiscontinued30d: counts['product_removed']?.count_30d || 0, - priceDrops7d: counts['price_drop']?.count_7d || 0, - priceIncreases7d: counts['price_increase']?.count_7d || 0, - restocks7d: counts['restocked']?.count_7d || 0, - stockOuts7d: counts['out_of_stock']?.count_7d || 0, - }; - }, 15)).data; - } - - /** - * Get recent change events for a store - */ - async getStoreChangeEvents( - storeId: number, - filters: { eventType?: string; days?: number; limit?: number } = {} - ): Promise { - const { eventType, days = 30, limit = 100 } = filters; - const key = cacheKey('store_change_events', { storeId, eventType, days, limit }); - - return (await this.cache.getOrCompute(key, async () => { - const params: (string | number)[] = [storeId, days, limit]; - let eventTypeCondition = ''; - - if (eventType) { - eventTypeCondition = 'AND event_type = $4'; - params.push(eventType); - } - - const result = await this.pool.query(` - SELECT - sce.id, - sce.store_id, - d.name as store_name, - sce.event_type, - sce.event_date, - sce.brand_name, - sce.product_name, - sce.category, - sce.old_value, - sce.new_value, - sce.metadata - FROM store_change_events sce - JOIN dispensaries d ON sce.store_id = d.id - WHERE sce.store_id = $1 - AND sce.event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - ${eventTypeCondition} - ORDER BY sce.event_date DESC, sce.id DESC - LIMIT $3 - `, params); - - return result.rows.map(row => ({ - id: row.id, - storeId: row.store_id, - storeName: row.store_name, - eventType: row.event_type, - eventDate: row.event_date.toISOString().split('T')[0], - brandName: row.brand_name, - productName: row.product_name, - category: row.category, - oldValue: row.old_value, - newValue: row.new_value, - metadata: row.metadata, - })); - }, 5)).data; - } - - /** - * Get new brands added to a store - */ - async getNewBrands( - storeId: number, - days: number = 30 - ): Promise { - const key = cacheKey('new_brands', { storeId, days }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - brand_name, - event_date, - metadata - FROM store_change_events - WHERE store_id = $1 - AND event_type = 'brand_added' - AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - ORDER BY event_date DESC - `, [storeId, days]); - - return result.rows.map(row => ({ - brandName: row.brand_name, - changeType: 'added' as const, - date: row.event_date.toISOString().split('T')[0], - skuCount: row.metadata?.sku_count || 0, - categories: row.metadata?.categories || [], - })); - }, 15)).data; - } - - /** - * Get brands lost from a store - */ - async getLostBrands( - storeId: number, - days: number = 30 - ): Promise { - const key = cacheKey('lost_brands', { storeId, days }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - brand_name, - event_date, - metadata - FROM store_change_events - WHERE store_id = $1 - AND event_type = 'brand_removed' - AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - ORDER BY event_date DESC - `, [storeId, days]); - - return result.rows.map(row => ({ - brandName: row.brand_name, - changeType: 'removed' as const, - date: row.event_date.toISOString().split('T')[0], - skuCount: row.metadata?.sku_count || 0, - categories: row.metadata?.categories || [], - })); - }, 15)).data; - } - - /** - * Get product changes for a store - */ - async getProductChanges( - storeId: number, - changeType?: 'added' | 'discontinued' | 'price_drop' | 'price_increase' | 'restocked' | 'out_of_stock', - days: number = 7 - ): Promise { - const key = cacheKey('product_changes', { storeId, changeType, days }); - - return (await this.cache.getOrCompute(key, async () => { - const eventTypeMap: Record = { - 'added': 'product_added', - 'discontinued': 'product_removed', - 'price_drop': 'price_drop', - 'price_increase': 'price_increase', - 'restocked': 'restocked', - 'out_of_stock': 'out_of_stock', - }; - - const params: (string | number)[] = [storeId, days]; - let eventCondition = ''; - - if (changeType) { - eventCondition = 'AND event_type = $3'; - params.push(eventTypeMap[changeType]); - } - - const result = await this.pool.query(` - SELECT - product_id, - product_name, - brand_name, - category, - event_type, - event_date, - old_value, - new_value - FROM store_change_events - WHERE store_id = $1 - AND event_date >= CURRENT_DATE - ($2 || ' days')::INTERVAL - AND product_id IS NOT NULL - ${eventCondition} - ORDER BY event_date DESC - LIMIT 100 - `, params); - - const reverseMap: Record = { - 'product_added': 'added', - 'product_removed': 'discontinued', - 'price_drop': 'price_drop', - 'price_increase': 'price_increase', - 'restocked': 'restocked', - 'out_of_stock': 'out_of_stock', - }; - - return result.rows.map(row => ({ - productId: row.product_id, - productName: row.product_name, - brandName: row.brand_name, - category: row.category, - changeType: reverseMap[row.event_type] || 'added', - date: row.event_date.toISOString().split('T')[0], - oldValue: row.old_value, - newValue: row.new_value, - })); - }, 5)).data; - } - - /** - * Get category leaderboard across stores - */ - async getCategoryLeaderboard( - category: string, - limit: number = 20 - ): Promise { - const key = cacheKey('category_leaderboard', { category, limit }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - WITH store_category_stats AS ( - SELECT - dp.dispensary_id as store_id, - d.name as store_name, - COUNT(*) as sku_count, - COUNT(DISTINCT dp.brand_name) as brand_count, - AVG(extract_min_price(dp.latest_raw_payload)) as avg_price - FROM dutchie_products dp - JOIN dispensaries d ON dp.dispensary_id = d.id - WHERE dp.type = $1 - GROUP BY dp.dispensary_id, d.name - ) - SELECT - scs.*, - RANK() OVER (ORDER BY scs.sku_count DESC) as rank - FROM store_category_stats scs - ORDER BY scs.sku_count DESC - LIMIT $2 - `, [category, limit]); - - return result.rows.map(row => ({ - category, - storeId: row.store_id, - storeName: row.store_name, - skuCount: parseInt(row.sku_count) || 0, - brandCount: parseInt(row.brand_count) || 0, - avgPrice: row.avg_price ? Math.round(parseFloat(row.avg_price) * 100) / 100 : null, - changePercent7d: 0, // Would need historical data - rank: parseInt(row.rank) || 0, - })); - }, 15)).data; - } - - /** - * Get stores with most activity (changes) - */ - async getMostActiveStores( - days: number = 7, - limit: number = 10 - ): Promise> { - const key = cacheKey('most_active_stores', { days, limit }); - - return (await this.cache.getOrCompute(key, async () => { - const result = await this.pool.query(` - SELECT - d.id as store_id, - d.name as store_name, - d.city, - d.state, - COUNT(*) as total_changes, - COUNT(*) FILTER (WHERE sce.event_type IN ('brand_added', 'brand_removed')) as brands_changed, - COUNT(*) FILTER (WHERE sce.event_type IN ('product_added', 'product_removed')) as products_changed, - COUNT(*) FILTER (WHERE sce.event_type IN ('price_drop', 'price_increase')) as price_changes, - COUNT(*) FILTER (WHERE sce.event_type IN ('restocked', 'out_of_stock')) as stock_changes - FROM store_change_events sce - JOIN dispensaries d ON sce.store_id = d.id - WHERE sce.event_date >= CURRENT_DATE - ($1 || ' days')::INTERVAL - GROUP BY d.id, d.name, d.city, d.state - ORDER BY total_changes DESC - LIMIT $2 - `, [days, limit]); - - return result.rows.map(row => ({ - storeId: row.store_id, - storeName: row.store_name, - city: row.city, - state: row.state, - totalChanges: parseInt(row.total_changes) || 0, - brandsChanged: parseInt(row.brands_changed) || 0, - productsChanged: parseInt(row.products_changed) || 0, - priceChanges: parseInt(row.price_changes) || 0, - stockChanges: parseInt(row.stock_changes) || 0, - })); - }, 15)).data; - } - - /** - * Compare two stores - */ - async compareStores( - storeId1: number, - storeId2: number - ): Promise<{ - store1: { id: number; name: string; brands: string[]; categories: string[]; skuCount: number }; - store2: { id: number; name: string; brands: string[]; categories: string[]; skuCount: number }; - sharedBrands: string[]; - uniqueToStore1: string[]; - uniqueToStore2: string[]; - categoryComparison: Array<{ - category: string; - store1Skus: number; - store2Skus: number; - difference: number; - }>; - }> { - const key = cacheKey('compare_stores', { storeId1, storeId2 }); - - return (await this.cache.getOrCompute(key, async () => { - const [store1Data, store2Data] = await Promise.all([ - this.pool.query(` - SELECT - d.id, d.name, - ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name IS NOT NULL) as brands, - ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories, - COUNT(*) as sku_count - FROM dispensaries d - LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id - WHERE d.id = $1 - GROUP BY d.id, d.name - `, [storeId1]), - this.pool.query(` - SELECT - d.id, d.name, - ARRAY_AGG(DISTINCT dp.brand_name) FILTER (WHERE dp.brand_name IS NOT NULL) as brands, - ARRAY_AGG(DISTINCT dp.type) FILTER (WHERE dp.type IS NOT NULL) as categories, - COUNT(*) as sku_count - FROM dispensaries d - LEFT JOIN dutchie_products dp ON d.id = dp.dispensary_id - WHERE d.id = $1 - GROUP BY d.id, d.name - `, [storeId2]), - ]); - - const s1 = store1Data.rows[0]; - const s2 = store2Data.rows[0]; - - const brands1Array: string[] = (s1?.brands || []).filter((b: string | null): b is string => b !== null); - const brands2Array: string[] = (s2?.brands || []).filter((b: string | null): b is string => b !== null); - const brands1 = new Set(brands1Array); - const brands2 = new Set(brands2Array); - - const sharedBrands: string[] = brands1Array.filter(b => brands2.has(b)); - const uniqueToStore1: string[] = brands1Array.filter(b => !brands2.has(b)); - const uniqueToStore2: string[] = brands2Array.filter(b => !brands1.has(b)); - - // Category comparison - const categoryResult = await this.pool.query(` - WITH store1_cats AS ( - SELECT type as category, COUNT(*) as sku_count - FROM dutchie_products WHERE dispensary_id = $1 AND type IS NOT NULL - GROUP BY type - ), - store2_cats AS ( - SELECT type as category, COUNT(*) as sku_count - FROM dutchie_products WHERE dispensary_id = $2 AND type IS NOT NULL - GROUP BY type - ), - all_cats AS ( - SELECT category FROM store1_cats - UNION - SELECT category FROM store2_cats - ) - SELECT - ac.category, - COALESCE(s1.sku_count, 0) as store1_skus, - COALESCE(s2.sku_count, 0) as store2_skus - FROM all_cats ac - LEFT JOIN store1_cats s1 ON ac.category = s1.category - LEFT JOIN store2_cats s2 ON ac.category = s2.category - ORDER BY (COALESCE(s1.sku_count, 0) + COALESCE(s2.sku_count, 0)) DESC - `, [storeId1, storeId2]); - - return { - store1: { - id: s1?.id || storeId1, - name: s1?.name || 'Unknown', - brands: s1?.brands || [], - categories: s1?.categories || [], - skuCount: parseInt(s1?.sku_count) || 0, - }, - store2: { - id: s2?.id || storeId2, - name: s2?.name || 'Unknown', - brands: s2?.brands || [], - categories: s2?.categories || [], - skuCount: parseInt(s2?.sku_count) || 0, - }, - sharedBrands, - uniqueToStore1, - uniqueToStore2, - categoryComparison: categoryResult.rows.map(row => ({ - category: row.category, - store1Skus: parseInt(row.store1_skus) || 0, - store2Skus: parseInt(row.store2_skus) || 0, - difference: (parseInt(row.store1_skus) || 0) - (parseInt(row.store2_skus) || 0), - })), - }; - }, 15)).data; - } - - /** - * Record a change event (used by crawler/worker) - */ - async recordChangeEvent(event: { - storeId: number; - eventType: string; - brandName?: string; - productId?: number; - productName?: string; - category?: string; - oldValue?: string; - newValue?: string; - metadata?: Record; - }): Promise { - await this.pool.query(` - INSERT INTO store_change_events - (store_id, event_type, event_date, brand_name, product_id, product_name, category, old_value, new_value, metadata) - VALUES ($1, $2, CURRENT_DATE, $3, $4, $5, $6, $7, $8, $9) - `, [ - event.storeId, - event.eventType, - event.brandName || null, - event.productId || null, - event.productName || null, - event.category || null, - event.oldValue || null, - event.newValue || null, - event.metadata ? JSON.stringify(event.metadata) : null, - ]); - - // Invalidate cache - await this.cache.invalidatePattern(`store_change_summary:storeId=${event.storeId}`); - } -} diff --git a/backend/src/dutchie-az/services/azdhs-import.ts b/backend/src/dutchie-az/services/azdhs-import.ts deleted file mode 100644 index 9f944518..00000000 --- a/backend/src/dutchie-az/services/azdhs-import.ts +++ /dev/null @@ -1,266 +0,0 @@ -/** - * LEGACY SERVICE - AZDHS Import - * - * DEPRECATED: This service creates its own database pool. - * Future implementations should use the canonical CannaiQ connection. - * - * Imports Arizona dispensaries from the main database's dispensaries table - * (which was populated from AZDHS data) into the isolated Dutchie AZ database. - * - * This establishes the canonical list of AZ dispensaries to match against Dutchie. - * - * DO NOT: - * - Run this in automated jobs - * - Use DATABASE_URL directly - */ - -import { Pool } from 'pg'; -import { query as dutchieQuery } from '../db/connection'; -import { Dispensary } from '../types'; - -// Single database connection (cannaiq in cannaiq-postgres container) -// Use CANNAIQ_DB_* env vars or defaults -const MAIN_DB_CONNECTION = process.env.CANNAIQ_DB_URL || - `postgresql://${process.env.CANNAIQ_DB_USER || 'dutchie'}:${process.env.CANNAIQ_DB_PASS || 'dutchie_local_pass'}@${process.env.CANNAIQ_DB_HOST || 'localhost'}:${process.env.CANNAIQ_DB_PORT || '54320'}/${process.env.CANNAIQ_DB_NAME || 'cannaiq'}`; - -/** - * AZDHS dispensary record from the main database - */ -interface AZDHSDispensary { - id: number; - azdhs_id: number; - name: string; - company_name?: string; - address?: string; - city: string; - state: string; - zip?: string; - latitude?: number; - longitude?: number; - dba_name?: string; - phone?: string; - email?: string; - website?: string; - google_rating?: string; - google_review_count?: number; - slug: string; - menu_provider?: string; - product_provider?: string; - created_at: Date; - updated_at: Date; -} - -/** - * Import result statistics - */ -interface ImportResult { - total: number; - imported: number; - skipped: number; - errors: string[]; -} - -/** - * Create a temporary connection to the main database - */ -function getMainDBPool(): Pool { - console.warn('[AZDHS Import] LEGACY: Using separate pool. Should use canonical CannaiQ connection.'); - return new Pool({ - connectionString: MAIN_DB_CONNECTION, - max: 5, - idleTimeoutMillis: 30000, - connectionTimeoutMillis: 5000, - }); -} - -/** - * Fetch all AZ dispensaries from the main database - */ -async function fetchAZDHSDispensaries(): Promise { - const pool = getMainDBPool(); - - try { - const result = await pool.query(` - SELECT - id, azdhs_id, name, company_name, address, city, state, zip, - latitude, longitude, dba_name, phone, email, website, - google_rating, google_review_count, slug, - menu_provider, product_provider, - created_at, updated_at - FROM dispensaries - WHERE state = 'AZ' - ORDER BY id - `); - - return result.rows; - } finally { - await pool.end(); - } -} - -/** - * Import a single dispensary into the Dutchie AZ database - */ -async function importDispensary(disp: AZDHSDispensary): Promise { - const result = await dutchieQuery<{ id: number }>( - ` - INSERT INTO dispensaries ( - platform, name, slug, city, state, postal_code, address, - latitude, longitude, is_delivery, is_pickup, raw_metadata, updated_at - ) VALUES ( - $1, $2, $3, $4, $5, $6, $7, - $8, $9, $10, $11, $12, NOW() - ) - ON CONFLICT (platform, slug, city, state) DO UPDATE SET - name = EXCLUDED.name, - postal_code = EXCLUDED.postal_code, - address = EXCLUDED.address, - latitude = EXCLUDED.latitude, - longitude = EXCLUDED.longitude, - raw_metadata = EXCLUDED.raw_metadata, - updated_at = NOW() - RETURNING id - `, - [ - 'dutchie', // Will be updated when Dutchie match is found - disp.dba_name || disp.name, - disp.slug, - disp.city, - disp.state, - disp.zip, - disp.address, - disp.latitude, - disp.longitude, - false, // is_delivery - unknown - true, // is_pickup - assume true - JSON.stringify({ - azdhs_id: disp.azdhs_id, - main_db_id: disp.id, - company_name: disp.company_name, - phone: disp.phone, - email: disp.email, - website: disp.website, - google_rating: disp.google_rating, - google_review_count: disp.google_review_count, - menu_provider: disp.menu_provider, - product_provider: disp.product_provider, - }), - ] - ); - - return result.rows[0].id; -} - -/** - * Import all AZDHS dispensaries into the Dutchie AZ database - */ -export async function importAZDHSDispensaries(): Promise { - console.log('[AZDHS Import] Starting import from main database...'); - - const result: ImportResult = { - total: 0, - imported: 0, - skipped: 0, - errors: [], - }; - - try { - const dispensaries = await fetchAZDHSDispensaries(); - result.total = dispensaries.length; - - console.log(`[AZDHS Import] Found ${dispensaries.length} AZ dispensaries in main DB`); - - for (const disp of dispensaries) { - try { - const id = await importDispensary(disp); - result.imported++; - console.log(`[AZDHS Import] Imported: ${disp.name} (${disp.city}) -> id=${id}`); - } catch (error: any) { - if (error.message.includes('duplicate')) { - result.skipped++; - } else { - result.errors.push(`${disp.name}: ${error.message}`); - } - } - } - } catch (error: any) { - result.errors.push(`Failed to fetch from main DB: ${error.message}`); - } - - console.log(`[AZDHS Import] Complete: ${result.imported} imported, ${result.skipped} skipped, ${result.errors.length} errors`); - return result; -} - -/** - * Import dispensaries from JSON file (backup export) - */ -export async function importFromJSON(jsonPath: string): Promise { - console.log(`[AZDHS Import] Importing from JSON: ${jsonPath}`); - - const result: ImportResult = { - total: 0, - imported: 0, - skipped: 0, - errors: [], - }; - - try { - const fs = await import('fs/promises'); - const data = await fs.readFile(jsonPath, 'utf-8'); - const dispensaries: AZDHSDispensary[] = JSON.parse(data); - - result.total = dispensaries.length; - console.log(`[AZDHS Import] Found ${dispensaries.length} dispensaries in JSON file`); - - for (const disp of dispensaries) { - try { - const id = await importDispensary(disp); - result.imported++; - } catch (error: any) { - if (error.message.includes('duplicate')) { - result.skipped++; - } else { - result.errors.push(`${disp.name}: ${error.message}`); - } - } - } - } catch (error: any) { - result.errors.push(`Failed to read JSON file: ${error.message}`); - } - - console.log(`[AZDHS Import] Complete: ${result.imported} imported, ${result.skipped} skipped`); - return result; -} - -/** - * Get import statistics - */ -export async function getImportStats(): Promise<{ - totalDispensaries: number; - withPlatformIds: number; - withoutPlatformIds: number; - lastImportedAt?: Date; -}> { - const { rows } = await dutchieQuery<{ - total: string; - with_platform_id: string; - without_platform_id: string; - last_updated: Date; - }>(` - SELECT - COUNT(*) as total, - COUNT(platform_dispensary_id) as with_platform_id, - COUNT(*) - COUNT(platform_dispensary_id) as without_platform_id, - MAX(updated_at) as last_updated - FROM dispensaries - WHERE state = 'AZ' - `); - - const stats = rows[0]; - return { - totalDispensaries: parseInt(stats.total, 10), - withPlatformIds: parseInt(stats.with_platform_id, 10), - withoutPlatformIds: parseInt(stats.without_platform_id, 10), - lastImportedAt: stats.last_updated, - }; -} diff --git a/backend/src/dutchie-az/services/directory-matcher.ts b/backend/src/dutchie-az/services/directory-matcher.ts deleted file mode 100644 index 39fc5af7..00000000 --- a/backend/src/dutchie-az/services/directory-matcher.ts +++ /dev/null @@ -1,481 +0,0 @@ -/** - * Directory-Based Store Matcher - * - * Scrapes provider directory pages (Curaleaf, Sol, etc.) to get store lists, - * then matches them to existing dispensaries by fuzzy name/city/address matching. - * - * This allows us to: - * 1. Find specific store URLs for directory-style websites - * 2. Match stores confidently by name+city - * 3. Mark non-Dutchie providers as not_crawlable until we build crawlers - */ - -import { query } from '../db/connection'; - -// ============================================================ -// TYPES -// ============================================================ - -export interface DirectoryStore { - name: string; - city: string; - state: string; - address: string | null; - storeUrl: string; -} - -export interface MatchResult { - directoryStore: DirectoryStore; - dispensaryId: number | null; - dispensaryName: string | null; - confidence: 'high' | 'medium' | 'low' | 'none'; - matchReason: string; -} - -export interface DirectoryMatchReport { - provider: string; - totalDirectoryStores: number; - highConfidenceMatches: number; - mediumConfidenceMatches: number; - lowConfidenceMatches: number; - unmatched: number; - results: MatchResult[]; -} - -// ============================================================ -// NORMALIZATION FUNCTIONS -// ============================================================ - -/** - * Normalize a string for comparison: - * - Lowercase - * - Remove common suffixes (dispensary, cannabis, etc.) - * - Remove punctuation - * - Collapse whitespace - */ -function normalizeForComparison(str: string): string { - if (!str) return ''; - - return str - .toLowerCase() - .replace(/\s+(dispensary|cannabis|marijuana|medical|recreational|shop|store|flower|wellness)(\s|$)/gi, ' ') - .replace(/[^\w\s]/g, ' ') // Remove punctuation - .replace(/\s+/g, ' ') // Collapse whitespace - .trim(); -} - -/** - * Normalize city name for comparison - */ -function normalizeCity(city: string): string { - if (!city) return ''; - - return city - .toLowerCase() - .replace(/[^\w\s]/g, '') - .trim(); -} - -/** - * Calculate similarity between two strings (0-1) - * Uses Levenshtein distance normalized by max length - */ -function stringSimilarity(a: string, b: string): number { - if (!a || !b) return 0; - if (a === b) return 1; - - const longer = a.length > b.length ? a : b; - const shorter = a.length > b.length ? b : a; - - if (longer.length === 0) return 1; - - const distance = levenshteinDistance(longer, shorter); - return (longer.length - distance) / longer.length; -} - -/** - * Levenshtein distance between two strings - */ -function levenshteinDistance(a: string, b: string): number { - const matrix: number[][] = []; - - for (let i = 0; i <= b.length; i++) { - matrix[i] = [i]; - } - - for (let j = 0; j <= a.length; j++) { - matrix[0][j] = j; - } - - for (let i = 1; i <= b.length; i++) { - for (let j = 1; j <= a.length; j++) { - if (b.charAt(i - 1) === a.charAt(j - 1)) { - matrix[i][j] = matrix[i - 1][j - 1]; - } else { - matrix[i][j] = Math.min( - matrix[i - 1][j - 1] + 1, // substitution - matrix[i][j - 1] + 1, // insertion - matrix[i - 1][j] + 1 // deletion - ); - } - } - } - - return matrix[b.length][a.length]; -} - -/** - * Check if string contains another (with normalization) - */ -function containsNormalized(haystack: string, needle: string): boolean { - return normalizeForComparison(haystack).includes(normalizeForComparison(needle)); -} - -// ============================================================ -// PROVIDER DIRECTORY SCRAPERS -// ============================================================ - -/** - * Sol Flower (livewithsol.com) - Static HTML, easy to scrape - */ -export async function scrapeSolDirectory(): Promise { - console.log('[DirectoryMatcher] Scraping Sol Flower directory...'); - - try { - const response = await fetch('https://www.livewithsol.com/locations/', { - headers: { - 'User-Agent': - 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', - Accept: 'text/html', - }, - }); - - if (!response.ok) { - throw new Error(`HTTP ${response.status}`); - } - - const html = await response.text(); - - // Extract store entries from HTML - // Sol's structure: Each location has name, address in specific divs - const stores: DirectoryStore[] = []; - - // Pattern to find location cards - // Format: NAME with address nearby - const locationRegex = - /]+href="(\/locations\/[^"]+)"[^>]*>([^<]+)<\/a>[\s\S]*?(\d+[^<]+(?:Ave|St|Blvd|Dr|Rd|Way)[^<]*)/gi; - - let match; - while ((match = locationRegex.exec(html)) !== null) { - const [, path, name, address] = match; - - // Extract city from common Arizona cities - let city = 'Unknown'; - const cityPatterns = [ - { pattern: /phoenix/i, city: 'Phoenix' }, - { pattern: /scottsdale/i, city: 'Scottsdale' }, - { pattern: /tempe/i, city: 'Tempe' }, - { pattern: /tucson/i, city: 'Tucson' }, - { pattern: /mesa/i, city: 'Mesa' }, - { pattern: /sun city/i, city: 'Sun City' }, - { pattern: /glendale/i, city: 'Glendale' }, - ]; - - for (const { pattern, city: cityName } of cityPatterns) { - if (pattern.test(name) || pattern.test(address)) { - city = cityName; - break; - } - } - - stores.push({ - name: name.trim(), - city, - state: 'AZ', - address: address.trim(), - storeUrl: `https://www.livewithsol.com${path}`, - }); - } - - // If regex didn't work, use known hardcoded values (fallback) - if (stores.length === 0) { - console.log('[DirectoryMatcher] Using hardcoded Sol locations'); - return [ - { name: 'Sol Flower 32nd & Shea', city: 'Phoenix', state: 'AZ', address: '3217 E Shea Blvd Suite 1 A', storeUrl: 'https://www.livewithsol.com/locations/deer-valley/' }, - { name: 'Sol Flower Scottsdale Airpark', city: 'Scottsdale', state: 'AZ', address: '14980 N 78th Way Ste 204', storeUrl: 'https://www.livewithsol.com/locations/scottsdale-airpark/' }, - { name: 'Sol Flower Sun City', city: 'Sun City', state: 'AZ', address: '13650 N 99th Ave', storeUrl: 'https://www.livewithsol.com/locations/sun-city/' }, - { name: 'Sol Flower Tempe McClintock', city: 'Tempe', state: 'AZ', address: '1322 N McClintock Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-mcclintock/' }, - { name: 'Sol Flower Tempe University', city: 'Tempe', state: 'AZ', address: '2424 W University Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-university/' }, - { name: 'Sol Flower Foothills Tucson', city: 'Tucson', state: 'AZ', address: '6026 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/foothills-tucson/' }, - { name: 'Sol Flower South Tucson', city: 'Tucson', state: 'AZ', address: '3000 W Valencia Rd Ste 210', storeUrl: 'https://www.livewithsol.com/locations/south-tucson/' }, - { name: 'Sol Flower North Tucson', city: 'Tucson', state: 'AZ', address: '4837 N 1st Ave', storeUrl: 'https://www.livewithsol.com/locations/north-tucson/' }, - { name: 'Sol Flower Casas Adobes', city: 'Tucson', state: 'AZ', address: '6437 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/casas-adobes/' }, - ]; - } - - console.log(`[DirectoryMatcher] Found ${stores.length} Sol Flower locations`); - return stores; - } catch (error: any) { - console.error('[DirectoryMatcher] Error scraping Sol directory:', error.message); - // Return hardcoded fallback - return [ - { name: 'Sol Flower 32nd & Shea', city: 'Phoenix', state: 'AZ', address: '3217 E Shea Blvd Suite 1 A', storeUrl: 'https://www.livewithsol.com/locations/deer-valley/' }, - { name: 'Sol Flower Scottsdale Airpark', city: 'Scottsdale', state: 'AZ', address: '14980 N 78th Way Ste 204', storeUrl: 'https://www.livewithsol.com/locations/scottsdale-airpark/' }, - { name: 'Sol Flower Sun City', city: 'Sun City', state: 'AZ', address: '13650 N 99th Ave', storeUrl: 'https://www.livewithsol.com/locations/sun-city/' }, - { name: 'Sol Flower Tempe McClintock', city: 'Tempe', state: 'AZ', address: '1322 N McClintock Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-mcclintock/' }, - { name: 'Sol Flower Tempe University', city: 'Tempe', state: 'AZ', address: '2424 W University Dr', storeUrl: 'https://www.livewithsol.com/locations/tempe-university/' }, - { name: 'Sol Flower Foothills Tucson', city: 'Tucson', state: 'AZ', address: '6026 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/foothills-tucson/' }, - { name: 'Sol Flower South Tucson', city: 'Tucson', state: 'AZ', address: '3000 W Valencia Rd Ste 210', storeUrl: 'https://www.livewithsol.com/locations/south-tucson/' }, - { name: 'Sol Flower North Tucson', city: 'Tucson', state: 'AZ', address: '4837 N 1st Ave', storeUrl: 'https://www.livewithsol.com/locations/north-tucson/' }, - { name: 'Sol Flower Casas Adobes', city: 'Tucson', state: 'AZ', address: '6437 N Oracle Rd', storeUrl: 'https://www.livewithsol.com/locations/casas-adobes/' }, - ]; - } -} - -/** - * Curaleaf - Has age-gate, so we need hardcoded AZ locations - * In production, this would use Playwright to bypass age-gate - */ -export async function scrapeCuraleafDirectory(): Promise { - console.log('[DirectoryMatcher] Using hardcoded Curaleaf AZ locations (age-gate blocks simple fetch)...'); - - // Hardcoded Arizona Curaleaf locations from public knowledge - // These would be scraped via Playwright in production - return [ - { name: 'Curaleaf Phoenix Camelback', city: 'Phoenix', state: 'AZ', address: '4811 E Camelback Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-phoenix-camelback' }, - { name: 'Curaleaf Phoenix Midtown', city: 'Phoenix', state: 'AZ', address: '1928 E Highland Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-phoenix-midtown' }, - { name: 'Curaleaf Glendale East', city: 'Glendale', state: 'AZ', address: '5150 W Glendale Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-glendale-east' }, - { name: 'Curaleaf Glendale West', city: 'Glendale', state: 'AZ', address: '6501 W Glendale Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-glendale-west' }, - { name: 'Curaleaf Gilbert', city: 'Gilbert', state: 'AZ', address: '1736 E Williams Field Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-gilbert' }, - { name: 'Curaleaf Mesa', city: 'Mesa', state: 'AZ', address: '1540 S Power Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-mesa' }, - { name: 'Curaleaf Tempe', city: 'Tempe', state: 'AZ', address: '1815 E Broadway Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tempe' }, - { name: 'Curaleaf Scottsdale', city: 'Scottsdale', state: 'AZ', address: '8904 E Indian Bend Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-scottsdale' }, - { name: 'Curaleaf Tucson Prince', city: 'Tucson', state: 'AZ', address: '3955 W Prince Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tucson-prince' }, - { name: 'Curaleaf Tucson Midvale', city: 'Tucson', state: 'AZ', address: '2936 N Midvale Park Rd', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-tucson-midvale' }, - { name: 'Curaleaf Sedona', city: 'Sedona', state: 'AZ', address: '525 AZ-179', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-sedona' }, - { name: 'Curaleaf Youngtown', city: 'Youngtown', state: 'AZ', address: '11125 W Grand Ave', storeUrl: 'https://curaleaf.com/stores/curaleaf-az-youngtown' }, - ]; -} - -// ============================================================ -// MATCHING LOGIC -// ============================================================ - -interface Dispensary { - id: number; - name: string; - city: string | null; - state: string | null; - address: string | null; - menu_type: string | null; - menu_url: string | null; - website: string | null; -} - -/** - * Match a directory store to an existing dispensary - */ -function matchStoreToDispensary(store: DirectoryStore, dispensaries: Dispensary[]): MatchResult { - const normalizedStoreName = normalizeForComparison(store.name); - const normalizedStoreCity = normalizeCity(store.city); - - let bestMatch: Dispensary | null = null; - let bestScore = 0; - let matchReason = ''; - - for (const disp of dispensaries) { - const normalizedDispName = normalizeForComparison(disp.name); - const normalizedDispCity = normalizeCity(disp.city || ''); - - let score = 0; - const reasons: string[] = []; - - // 1. Name similarity (max 50 points) - const nameSimilarity = stringSimilarity(normalizedStoreName, normalizedDispName); - score += nameSimilarity * 50; - if (nameSimilarity > 0.8) reasons.push(`name_match(${(nameSimilarity * 100).toFixed(0)}%)`); - - // 2. City match (25 points for exact, 15 for partial) - if (normalizedStoreCity && normalizedDispCity) { - if (normalizedStoreCity === normalizedDispCity) { - score += 25; - reasons.push('city_exact'); - } else if ( - normalizedStoreCity.includes(normalizedDispCity) || - normalizedDispCity.includes(normalizedStoreCity) - ) { - score += 15; - reasons.push('city_partial'); - } - } - - // 3. Address contains street name (15 points) - if (store.address && disp.address) { - const storeStreet = store.address.toLowerCase().split(/\s+/).slice(1, 4).join(' '); - const dispStreet = disp.address.toLowerCase().split(/\s+/).slice(1, 4).join(' '); - if (storeStreet && dispStreet && stringSimilarity(storeStreet, dispStreet) > 0.7) { - score += 15; - reasons.push('address_match'); - } - } - - // 4. Brand name in dispensary name (10 points) - const brandName = store.name.split(' ')[0].toLowerCase(); // e.g., "Curaleaf", "Sol" - if (disp.name.toLowerCase().includes(brandName)) { - score += 10; - reasons.push('brand_match'); - } - - if (score > bestScore) { - bestScore = score; - bestMatch = disp; - matchReason = reasons.join(', '); - } - } - - // Determine confidence level - let confidence: 'high' | 'medium' | 'low' | 'none'; - if (bestScore >= 70) { - confidence = 'high'; - } else if (bestScore >= 50) { - confidence = 'medium'; - } else if (bestScore >= 30) { - confidence = 'low'; - } else { - confidence = 'none'; - } - - return { - directoryStore: store, - dispensaryId: bestMatch?.id || null, - dispensaryName: bestMatch?.name || null, - confidence, - matchReason: matchReason || 'no_match', - }; -} - -// ============================================================ -// MAIN FUNCTIONS -// ============================================================ - -/** - * Run directory matching for a provider and update database - * Only applies high-confidence matches automatically - */ -export async function matchDirectoryToDispensaries( - provider: 'curaleaf' | 'sol', - dryRun: boolean = true -): Promise { - console.log(`[DirectoryMatcher] Running ${provider} directory matching (dryRun=${dryRun})...`); - - // Get directory stores - let directoryStores: DirectoryStore[]; - if (provider === 'curaleaf') { - directoryStores = await scrapeCuraleafDirectory(); - } else if (provider === 'sol') { - directoryStores = await scrapeSolDirectory(); - } else { - throw new Error(`Unknown provider: ${provider}`); - } - - // Get all AZ dispensaries from database - const { rows: dispensaries } = await query( - `SELECT id, name, city, state, address, menu_type, menu_url, website - FROM dispensaries - WHERE state = 'AZ'` - ); - - console.log(`[DirectoryMatcher] Matching ${directoryStores.length} directory stores against ${dispensaries.length} dispensaries`); - - // Match each directory store - const results: MatchResult[] = []; - for (const store of directoryStores) { - const match = matchStoreToDispensary(store, dispensaries); - results.push(match); - - // Only apply high-confidence matches if not dry run - if (!dryRun && match.confidence === 'high' && match.dispensaryId) { - await applyDirectoryMatch(match.dispensaryId, provider, store); - } - } - - // Count results - const report: DirectoryMatchReport = { - provider, - totalDirectoryStores: directoryStores.length, - highConfidenceMatches: results.filter((r) => r.confidence === 'high').length, - mediumConfidenceMatches: results.filter((r) => r.confidence === 'medium').length, - lowConfidenceMatches: results.filter((r) => r.confidence === 'low').length, - unmatched: results.filter((r) => r.confidence === 'none').length, - results, - }; - - console.log(`[DirectoryMatcher] ${provider} matching complete:`); - console.log(` - High confidence: ${report.highConfidenceMatches}`); - console.log(` - Medium confidence: ${report.mediumConfidenceMatches}`); - console.log(` - Low confidence: ${report.lowConfidenceMatches}`); - console.log(` - Unmatched: ${report.unmatched}`); - - return report; -} - -/** - * Apply a directory match to a dispensary - */ -async function applyDirectoryMatch( - dispensaryId: number, - provider: string, - store: DirectoryStore -): Promise { - console.log(`[DirectoryMatcher] Applying match: dispensary ${dispensaryId} -> ${store.storeUrl}`); - - await query( - ` - UPDATE dispensaries SET - menu_type = $1, - menu_url = $2, - platform_dispensary_id = NULL, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', $1::text, - 'detection_method', 'directory_match'::text, - 'detected_at', NOW(), - 'directory_store_name', $3::text, - 'directory_store_url', $2::text, - 'directory_store_city', $4::text, - 'directory_store_address', $5::text, - 'not_crawlable', true, - 'not_crawlable_reason', $6::text - ), - updated_at = NOW() - WHERE id = $7 - `, - [ - provider, - store.storeUrl, - store.name, - store.city, - store.address, - `${provider} proprietary menu - no crawler available`, - dispensaryId, - ] - ); -} - -/** - * Preview matches without applying them - */ -export async function previewDirectoryMatches( - provider: 'curaleaf' | 'sol' -): Promise { - return matchDirectoryToDispensaries(provider, true); -} - -/** - * Apply high-confidence matches - */ -export async function applyHighConfidenceMatches( - provider: 'curaleaf' | 'sol' -): Promise { - return matchDirectoryToDispensaries(provider, false); -} diff --git a/backend/src/dutchie-az/services/discovery.ts b/backend/src/dutchie-az/services/discovery.ts deleted file mode 100644 index de2f3ba1..00000000 --- a/backend/src/dutchie-az/services/discovery.ts +++ /dev/null @@ -1,592 +0,0 @@ -/** - * Dutchie AZ Discovery Service - * - * Discovers and manages dispensaries from Dutchie for Arizona. - */ - -import { query, getClient } from '../db/connection'; -import { discoverArizonaDispensaries, resolveDispensaryId, resolveDispensaryIdWithDetails, ResolveDispensaryResult } from './graphql-client'; -import { Dispensary } from '../types'; - -/** - * Upsert a dispensary record - */ -async function upsertDispensary(dispensary: Partial): Promise { - const result = await query<{ id: number }>( - ` - INSERT INTO dispensaries ( - platform, name, slug, city, state, postal_code, address, - latitude, longitude, platform_dispensary_id, - is_delivery, is_pickup, raw_metadata, updated_at - ) VALUES ( - $1, $2, $3, $4, $5, $6, $7, - $8, $9, $10, - $11, $12, $13, NOW() - ) - ON CONFLICT (platform, slug, city, state) DO UPDATE SET - name = EXCLUDED.name, - postal_code = EXCLUDED.postal_code, - address = EXCLUDED.address, - latitude = EXCLUDED.latitude, - longitude = EXCLUDED.longitude, - platform_dispensary_id = COALESCE(EXCLUDED.platform_dispensary_id, dispensaries.platform_dispensary_id), - is_delivery = EXCLUDED.is_delivery, - is_pickup = EXCLUDED.is_pickup, - raw_metadata = EXCLUDED.raw_metadata, - updated_at = NOW() - RETURNING id - `, - [ - dispensary.platform || 'dutchie', - dispensary.name, - dispensary.slug, - dispensary.city, - dispensary.state || 'AZ', - dispensary.postalCode, - dispensary.address, - dispensary.latitude, - dispensary.longitude, - dispensary.platformDispensaryId, - dispensary.isDelivery || false, - dispensary.isPickup || true, - dispensary.rawMetadata ? JSON.stringify(dispensary.rawMetadata) : null, - ] - ); - - return result.rows[0].id; -} - -/** - * Normalize a raw discovery result to Dispensary - */ -function normalizeDispensary(raw: any): Partial { - return { - platform: 'dutchie', - name: raw.name || raw.Name || '', - slug: raw.slug || raw.cName || raw.id || '', - city: raw.city || raw.address?.city || '', - state: 'AZ', - postalCode: raw.postalCode || raw.address?.postalCode || raw.address?.zip, - address: raw.streetAddress || raw.address?.streetAddress, - latitude: raw.latitude || raw.location?.lat, - longitude: raw.longitude || raw.location?.lng, - platformDispensaryId: raw.dispensaryId || raw.id || null, - isDelivery: raw.isDelivery || raw.delivery || false, - isPickup: raw.isPickup || raw.pickup || true, - rawMetadata: raw, - }; -} - -/** - * Import dispensaries from the existing dispensaries table (from AZDHS data) - * This creates records in the dutchie_az database for AZ dispensaries - */ -export async function importFromExistingDispensaries(): Promise<{ imported: number }> { - console.log('[Discovery] Importing from existing dispensaries table...'); - - // This is a workaround - we'll use the dispensaries we already know about - // and try to resolve their Dutchie IDs - const knownDispensaries = [ - { name: 'Deeply Rooted', slug: 'AZ-Deeply-Rooted', city: 'Phoenix', state: 'AZ' }, - { name: 'Curaleaf Gilbert', slug: 'curaleaf-gilbert', city: 'Gilbert', state: 'AZ' }, - { name: 'Zen Leaf Prescott', slug: 'AZ-zen-leaf-prescott', city: 'Prescott', state: 'AZ' }, - // Add more known Dutchie stores here - ]; - - let imported = 0; - - for (const disp of knownDispensaries) { - try { - const id = await upsertDispensary({ - platform: 'dutchie', - name: disp.name, - slug: disp.slug, - city: disp.city, - state: disp.state, - }); - imported++; - console.log(`[Discovery] Imported: ${disp.name} (id=${id})`); - } catch (error: any) { - console.error(`[Discovery] Failed to import ${disp.name}:`, error.message); - } - } - - return { imported }; -} - -/** - * Discover all Arizona Dutchie dispensaries via GraphQL - */ -export async function discoverDispensaries(): Promise<{ discovered: number; errors: string[] }> { - console.log('[Discovery] Starting Arizona dispensary discovery...'); - const errors: string[] = []; - let discovered = 0; - - try { - const rawDispensaries = await discoverArizonaDispensaries(); - console.log(`[Discovery] Found ${rawDispensaries.length} dispensaries from GraphQL`); - - for (const raw of rawDispensaries) { - try { - const normalized = normalizeDispensary(raw); - if (normalized.name && normalized.slug && normalized.city) { - await upsertDispensary(normalized); - discovered++; - } - } catch (error: any) { - errors.push(`${raw.name || raw.slug}: ${error.message}`); - } - } - } catch (error: any) { - errors.push(`Discovery failed: ${error.message}`); - } - - console.log(`[Discovery] Completed: ${discovered} dispensaries, ${errors.length} errors`); - return { discovered, errors }; -} - -/** - * Check if a string looks like a MongoDB ObjectId (24 hex chars) - */ -export function isObjectId(value: string): boolean { - return /^[a-f0-9]{24}$/i.test(value); -} - -/** - * Extract cName (slug) or platform_dispensary_id from a Dutchie menu_url - * - * Supports formats: - * - https://dutchie.com/embedded-menu/ -> returns { type: 'cName', value: '' } - * - https://dutchie.com/dispensary/ -> returns { type: 'cName', value: '' } - * - https://dutchie.com/api/v2/embedded-menu/.js -> returns { type: 'platformId', value: '' } - * - * For backward compatibility, extractCNameFromMenuUrl still returns just the string value. - */ -export interface MenuUrlExtraction { - type: 'cName' | 'platformId'; - value: string; -} - -export function extractFromMenuUrl(menuUrl: string | null | undefined): MenuUrlExtraction | null { - if (!menuUrl) return null; - - try { - const url = new URL(menuUrl); - const pathname = url.pathname; - - // Match /api/v2/embedded-menu/.js - this contains the platform_dispensary_id directly - const apiMatch = pathname.match(/^\/api\/v2\/embedded-menu\/([a-f0-9]{24})\.js$/i); - if (apiMatch) { - return { type: 'platformId', value: apiMatch[1] }; - } - - // Match /embedded-menu/ or /dispensary/ - const embeddedMatch = pathname.match(/^\/embedded-menu\/([^/?]+)/); - if (embeddedMatch) { - const value = embeddedMatch[1]; - // Check if it's actually an ObjectId (some URLs use ID directly) - if (isObjectId(value)) { - return { type: 'platformId', value }; - } - return { type: 'cName', value }; - } - - const dispensaryMatch = pathname.match(/^\/dispensary\/([^/?]+)/); - if (dispensaryMatch) { - const value = dispensaryMatch[1]; - if (isObjectId(value)) { - return { type: 'platformId', value }; - } - return { type: 'cName', value }; - } - - return null; - } catch { - return null; - } -} - -/** - * Extract cName (slug) from a Dutchie menu_url - * Backward compatible - use extractFromMenuUrl for full info - */ -export function extractCNameFromMenuUrl(menuUrl: string | null | undefined): string | null { - const extraction = extractFromMenuUrl(menuUrl); - return extraction?.value || null; -} - -/** - * Resolve platform dispensary IDs for all dispensaries that don't have one - * CRITICAL: Uses cName extracted from menu_url, NOT the slug column! - * - * Uses the new resolveDispensaryIdWithDetails which: - * 1. Extracts dispensaryId from window.reactEnv in the embedded menu page (preferred) - * 2. Falls back to GraphQL if reactEnv extraction fails - * 3. Returns HTTP status so we can mark 403/404 stores as not_crawlable - */ -export async function resolvePlatformDispensaryIds(): Promise<{ resolved: number; failed: number; skipped: number; notCrawlable: number }> { - console.log('[Discovery] Resolving platform dispensary IDs...'); - - const { rows: dispensaries } = await query( - ` - SELECT id, name, slug, menu_url, menu_type, platform_dispensary_id, crawl_status - FROM dispensaries - WHERE menu_type = 'dutchie' - AND platform_dispensary_id IS NULL - AND menu_url IS NOT NULL - AND (crawl_status IS NULL OR crawl_status != 'not_crawlable') - ORDER BY id - ` - ); - - let resolved = 0; - let failed = 0; - let skipped = 0; - let notCrawlable = 0; - - for (const dispensary of dispensaries) { - try { - // Extract cName from menu_url - this is the CORRECT way to get the Dutchie slug - const cName = extractCNameFromMenuUrl(dispensary.menu_url); - - if (!cName) { - console.log(`[Discovery] Skipping ${dispensary.name}: Could not extract cName from menu_url: ${dispensary.menu_url}`); - skipped++; - continue; - } - - console.log(`[Discovery] Resolving ID for: ${dispensary.name} (cName=${cName}, menu_url=${dispensary.menu_url})`); - - // Use the new detailed resolver that extracts from reactEnv first - const result = await resolveDispensaryIdWithDetails(cName); - - if (result.dispensaryId) { - // SUCCESS: Store resolved - await query( - ` - UPDATE dispensaries - SET platform_dispensary_id = $1, - platform_dispensary_id_resolved_at = NOW(), - crawl_status = 'ready', - crawl_status_reason = $2, - crawl_status_updated_at = NOW(), - last_tested_menu_url = $3, - last_http_status = $4, - updated_at = NOW() - WHERE id = $5 - `, - [ - result.dispensaryId, - `Resolved from ${result.source || 'page'}`, - dispensary.menu_url, - result.httpStatus, - dispensary.id, - ] - ); - resolved++; - console.log(`[Discovery] Resolved: ${cName} -> ${result.dispensaryId} (source: ${result.source})`); - } else if (result.httpStatus === 403 || result.httpStatus === 404) { - // NOT CRAWLABLE: Store removed or not accessible - await query( - ` - UPDATE dispensaries - SET platform_dispensary_id = NULL, - crawl_status = 'not_crawlable', - crawl_status_reason = $1, - crawl_status_updated_at = NOW(), - last_tested_menu_url = $2, - last_http_status = $3, - updated_at = NOW() - WHERE id = $4 - `, - [ - result.error || `HTTP ${result.httpStatus}: Removed from Dutchie`, - dispensary.menu_url, - result.httpStatus, - dispensary.id, - ] - ); - notCrawlable++; - console.log(`[Discovery] Marked not crawlable: ${cName} (HTTP ${result.httpStatus})`); - } else { - // FAILED: Could not resolve but page loaded - await query( - ` - UPDATE dispensaries - SET crawl_status = 'not_ready', - crawl_status_reason = $1, - crawl_status_updated_at = NOW(), - last_tested_menu_url = $2, - last_http_status = $3, - updated_at = NOW() - WHERE id = $4 - `, - [ - result.error || 'Could not extract dispensaryId from page', - dispensary.menu_url, - result.httpStatus, - dispensary.id, - ] - ); - failed++; - console.log(`[Discovery] Could not resolve: ${cName} - ${result.error}`); - } - - // Delay between requests - await new Promise((r) => setTimeout(r, 2000)); - } catch (error: any) { - failed++; - console.error(`[Discovery] Error resolving ${dispensary.name}:`, error.message); - } - } - - console.log(`[Discovery] Completed: ${resolved} resolved, ${failed} failed, ${skipped} skipped, ${notCrawlable} not crawlable`); - return { resolved, failed, skipped, notCrawlable }; -} - -// Use shared dispensary columns (handles optional columns like provider_detection_data) -import { DISPENSARY_COLUMNS } from '../db/dispensary-columns'; - -/** - * Get all dispensaries - */ - -export async function getAllDispensaries(): Promise { - const { rows } = await query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE menu_type = 'dutchie' ORDER BY name` - ); - return rows.map(mapDbRowToDispensary); -} - -/** - * Map snake_case DB row to camelCase Dispensary object - * CRITICAL: DB returns snake_case (platform_dispensary_id) but TypeScript expects camelCase (platformDispensaryId) - * This function is exported for use in other modules that query dispensaries directly. - * - * NOTE: The consolidated dispensaries table column mappings: - * - zip → postalCode - * - menu_type → menuType (keep platform as 'dutchie') - * - last_crawl_at → lastCrawledAt - * - platform_dispensary_id → platformDispensaryId - */ -export function mapDbRowToDispensary(row: any): Dispensary { - // Extract website from raw_metadata if available (field may not exist in all environments) - let rawMetadata = undefined; - if (row.raw_metadata !== undefined) { - rawMetadata = typeof row.raw_metadata === 'string' - ? JSON.parse(row.raw_metadata) - : row.raw_metadata; - } - const website = row.website || rawMetadata?.website || undefined; - - return { - id: row.id, - platform: row.platform || 'dutchie', // keep platform as-is, default to 'dutchie' - name: row.name, - dbaName: row.dbaName || row.dba_name || undefined, // dba_name column is optional - slug: row.slug, - city: row.city, - state: row.state, - postalCode: row.postalCode || row.zip || row.postal_code, - latitude: row.latitude ? parseFloat(row.latitude) : undefined, - longitude: row.longitude ? parseFloat(row.longitude) : undefined, - address: row.address, - platformDispensaryId: row.platformDispensaryId || row.platform_dispensary_id, // CRITICAL mapping! - isDelivery: row.is_delivery, - isPickup: row.is_pickup, - rawMetadata: rawMetadata, - lastCrawledAt: row.lastCrawledAt || row.last_crawl_at, // use last_crawl_at - productCount: row.product_count, - createdAt: row.created_at, - updatedAt: row.updated_at, - menuType: row.menuType || row.menu_type, - menuUrl: row.menuUrl || row.menu_url, - scrapeEnabled: row.scrapeEnabled ?? row.scrape_enabled, - providerDetectionData: row.provider_detection_data, - platformDispensaryIdResolvedAt: row.platform_dispensary_id_resolved_at, - website, - }; -} - -/** - * Get dispensary by ID - * NOTE: Uses SQL aliases to map snake_case → camelCase directly - */ -export async function getDispensaryById(id: number): Promise { - const { rows } = await query( - ` - SELECT - id, - name, - slug, - city, - state, - zip AS "postalCode", - address, - latitude, - longitude, - menu_type AS "menuType", - menu_url AS "menuUrl", - platform_dispensary_id AS "platformDispensaryId", - website, - provider_detection_data AS "providerDetectionData", - created_at, - updated_at - FROM dispensaries - WHERE id = $1 - `, - [id] - ); - if (!rows[0]) return null; - return mapDbRowToDispensary(rows[0]); -} - -/** - * Get dispensaries with platform IDs (ready for crawling) - */ -export async function getDispensariesWithPlatformIds(): Promise { - const { rows } = await query( - ` - SELECT ${DISPENSARY_COLUMNS} FROM dispensaries - WHERE menu_type = 'dutchie' AND platform_dispensary_id IS NOT NULL - ORDER BY name - ` - ); - return rows.map(mapDbRowToDispensary); -} - -/** - * Re-resolve a single dispensary's platform ID - * Clears the existing ID and re-resolves from the menu_url cName - */ -export async function reResolveDispensaryPlatformId(dispensaryId: number): Promise<{ - success: boolean; - platformId: string | null; - cName: string | null; - error?: string; -}> { - console.log(`[Discovery] Re-resolving platform ID for dispensary ${dispensaryId}...`); - - const dispensary = await getDispensaryById(dispensaryId); - if (!dispensary) { - return { success: false, platformId: null, cName: null, error: 'Dispensary not found' }; - } - - const cName = extractCNameFromMenuUrl(dispensary.menuUrl); - if (!cName) { - console.log(`[Discovery] Could not extract cName from menu_url: ${dispensary.menuUrl}`); - return { - success: false, - platformId: null, - cName: null, - error: `Could not extract cName from menu_url: ${dispensary.menuUrl}`, - }; - } - - console.log(`[Discovery] Extracted cName: ${cName} from menu_url: ${dispensary.menuUrl}`); - - try { - const platformId = await resolveDispensaryId(cName); - - if (platformId) { - await query( - ` - UPDATE dispensaries - SET platform_dispensary_id = $1, - platform_dispensary_id_resolved_at = NOW(), - updated_at = NOW() - WHERE id = $2 - `, - [platformId, dispensaryId] - ); - console.log(`[Discovery] Resolved: ${cName} -> ${platformId}`); - return { success: true, platformId, cName }; - } else { - // Clear the invalid platform ID and mark as not crawlable - await query( - ` - UPDATE dispensaries - SET platform_dispensary_id = NULL, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - '{"resolution_error": "cName no longer exists on Dutchie", "not_crawlable": true}'::jsonb, - updated_at = NOW() - WHERE id = $1 - `, - [dispensaryId] - ); - console.log(`[Discovery] Could not resolve: ${cName} - marked as not crawlable`); - return { - success: false, - platformId: null, - cName, - error: `cName "${cName}" no longer exists on Dutchie`, - }; - } - } catch (error: any) { - console.error(`[Discovery] Error resolving ${cName}:`, error.message); - return { success: false, platformId: null, cName, error: error.message }; - } -} - -/** - * Update menu_url for a dispensary and re-resolve platform ID - */ -export async function updateMenuUrlAndResolve(dispensaryId: number, newMenuUrl: string): Promise<{ - success: boolean; - platformId: string | null; - cName: string | null; - error?: string; -}> { - console.log(`[Discovery] Updating menu_url for dispensary ${dispensaryId} to: ${newMenuUrl}`); - - const cName = extractCNameFromMenuUrl(newMenuUrl); - if (!cName) { - return { - success: false, - platformId: null, - cName: null, - error: `Could not extract cName from new menu_url: ${newMenuUrl}`, - }; - } - - // Update the menu_url first - await query( - ` - UPDATE dispensaries - SET menu_url = $1, - menu_type = 'dutchie', - platform_dispensary_id = NULL, - updated_at = NOW() - WHERE id = $2 - `, - [newMenuUrl, dispensaryId] - ); - - // Now resolve the platform ID with the new cName - return await reResolveDispensaryPlatformId(dispensaryId); -} - -/** - * Mark a dispensary as not crawlable (when resolution fails permanently) - */ -export async function markDispensaryNotCrawlable(dispensaryId: number, reason: string): Promise { - await query( - ` - UPDATE dispensaries - SET platform_dispensary_id = NULL, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object('not_crawlable', true, 'not_crawlable_reason', $1::text, 'not_crawlable_at', NOW()::text), - updated_at = NOW() - WHERE id = $2 - `, - [reason, dispensaryId] - ); - console.log(`[Discovery] Marked dispensary ${dispensaryId} as not crawlable: ${reason}`); -} - -/** - * Get the cName for a dispensary (extracted from menu_url) - */ -export function getDispensaryCName(dispensary: Dispensary): string | null { - return extractCNameFromMenuUrl(dispensary.menuUrl); -} diff --git a/backend/src/dutchie-az/services/error-taxonomy.ts b/backend/src/dutchie-az/services/error-taxonomy.ts deleted file mode 100644 index d2cb2929..00000000 --- a/backend/src/dutchie-az/services/error-taxonomy.ts +++ /dev/null @@ -1,491 +0,0 @@ -/** - * Error Taxonomy Module - * - * Standardized error codes and classification for crawler reliability. - * All crawl results must use these codes for consistent error handling. - * - * Phase 1: Crawler Reliability & Stabilization - */ - -// ============================================================ -// ERROR CODES -// ============================================================ - -/** - * Standardized error codes for all crawl operations. - * These codes are stored in the database for analytics and debugging. - */ -export const CrawlErrorCode = { - // Success states - SUCCESS: 'SUCCESS', - - // Rate limiting - RATE_LIMITED: 'RATE_LIMITED', // 429 responses - - // Proxy issues - BLOCKED_PROXY: 'BLOCKED_PROXY', // 407 or proxy-related blocks - PROXY_TIMEOUT: 'PROXY_TIMEOUT', // Proxy connection timeout - - // Content issues - HTML_CHANGED: 'HTML_CHANGED', // Page structure changed - NO_PRODUCTS: 'NO_PRODUCTS', // Empty response (valid but no data) - PARSE_ERROR: 'PARSE_ERROR', // Failed to parse response - - // Network issues - TIMEOUT: 'TIMEOUT', // Request timeout - NETWORK_ERROR: 'NETWORK_ERROR', // Connection failed - DNS_ERROR: 'DNS_ERROR', // DNS resolution failed - - // Authentication - AUTH_FAILED: 'AUTH_FAILED', // Authentication/session issues - - // Server errors - SERVER_ERROR: 'SERVER_ERROR', // 5xx responses - SERVICE_UNAVAILABLE: 'SERVICE_UNAVAILABLE', // 503 - - // Configuration issues - INVALID_CONFIG: 'INVALID_CONFIG', // Bad store configuration - MISSING_PLATFORM_ID: 'MISSING_PLATFORM_ID', // No platform_dispensary_id - - // Unknown - UNKNOWN_ERROR: 'UNKNOWN_ERROR', // Catch-all for unclassified errors -} as const; - -export type CrawlErrorCodeType = typeof CrawlErrorCode[keyof typeof CrawlErrorCode]; - -// ============================================================ -// ERROR CLASSIFICATION -// ============================================================ - -/** - * Error metadata for each error code - */ -interface ErrorMetadata { - code: CrawlErrorCodeType; - retryable: boolean; - rotateProxy: boolean; - rotateUserAgent: boolean; - backoffMultiplier: number; - severity: 'low' | 'medium' | 'high' | 'critical'; - description: string; -} - -/** - * Metadata for each error code - defines retry behavior - */ -export const ERROR_METADATA: Record = { - [CrawlErrorCode.SUCCESS]: { - code: CrawlErrorCode.SUCCESS, - retryable: false, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 0, - severity: 'low', - description: 'Crawl completed successfully', - }, - - [CrawlErrorCode.RATE_LIMITED]: { - code: CrawlErrorCode.RATE_LIMITED, - retryable: true, - rotateProxy: true, - rotateUserAgent: true, - backoffMultiplier: 2.0, - severity: 'medium', - description: 'Rate limited by target (429)', - }, - - [CrawlErrorCode.BLOCKED_PROXY]: { - code: CrawlErrorCode.BLOCKED_PROXY, - retryable: true, - rotateProxy: true, - rotateUserAgent: true, - backoffMultiplier: 1.5, - severity: 'medium', - description: 'Proxy blocked or rejected (407)', - }, - - [CrawlErrorCode.PROXY_TIMEOUT]: { - code: CrawlErrorCode.PROXY_TIMEOUT, - retryable: true, - rotateProxy: true, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'low', - description: 'Proxy connection timed out', - }, - - [CrawlErrorCode.HTML_CHANGED]: { - code: CrawlErrorCode.HTML_CHANGED, - retryable: false, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'high', - description: 'Page structure changed - needs selector update', - }, - - [CrawlErrorCode.NO_PRODUCTS]: { - code: CrawlErrorCode.NO_PRODUCTS, - retryable: true, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'low', - description: 'No products returned (may be temporary)', - }, - - [CrawlErrorCode.PARSE_ERROR]: { - code: CrawlErrorCode.PARSE_ERROR, - retryable: true, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'medium', - description: 'Failed to parse response data', - }, - - [CrawlErrorCode.TIMEOUT]: { - code: CrawlErrorCode.TIMEOUT, - retryable: true, - rotateProxy: true, - rotateUserAgent: false, - backoffMultiplier: 1.5, - severity: 'medium', - description: 'Request timed out', - }, - - [CrawlErrorCode.NETWORK_ERROR]: { - code: CrawlErrorCode.NETWORK_ERROR, - retryable: true, - rotateProxy: true, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'medium', - description: 'Network connection failed', - }, - - [CrawlErrorCode.DNS_ERROR]: { - code: CrawlErrorCode.DNS_ERROR, - retryable: true, - rotateProxy: true, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'medium', - description: 'DNS resolution failed', - }, - - [CrawlErrorCode.AUTH_FAILED]: { - code: CrawlErrorCode.AUTH_FAILED, - retryable: true, - rotateProxy: false, - rotateUserAgent: true, - backoffMultiplier: 2.0, - severity: 'high', - description: 'Authentication or session failed', - }, - - [CrawlErrorCode.SERVER_ERROR]: { - code: CrawlErrorCode.SERVER_ERROR, - retryable: true, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 1.5, - severity: 'medium', - description: 'Server error (5xx)', - }, - - [CrawlErrorCode.SERVICE_UNAVAILABLE]: { - code: CrawlErrorCode.SERVICE_UNAVAILABLE, - retryable: true, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 2.0, - severity: 'high', - description: 'Service temporarily unavailable (503)', - }, - - [CrawlErrorCode.INVALID_CONFIG]: { - code: CrawlErrorCode.INVALID_CONFIG, - retryable: false, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 0, - severity: 'critical', - description: 'Invalid store configuration', - }, - - [CrawlErrorCode.MISSING_PLATFORM_ID]: { - code: CrawlErrorCode.MISSING_PLATFORM_ID, - retryable: false, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 0, - severity: 'critical', - description: 'Missing platform_dispensary_id', - }, - - [CrawlErrorCode.UNKNOWN_ERROR]: { - code: CrawlErrorCode.UNKNOWN_ERROR, - retryable: true, - rotateProxy: false, - rotateUserAgent: false, - backoffMultiplier: 1.0, - severity: 'high', - description: 'Unknown/unclassified error', - }, -}; - -// ============================================================ -// ERROR CLASSIFICATION FUNCTIONS -// ============================================================ - -/** - * Classify an error into a standardized error code. - * - * @param error - The error to classify (Error object, string, or HTTP status) - * @param httpStatus - Optional HTTP status code - * @returns Standardized error code - */ -export function classifyError( - error: Error | string | null, - httpStatus?: number -): CrawlErrorCodeType { - // Check HTTP status first - if (httpStatus) { - if (httpStatus === 429) return CrawlErrorCode.RATE_LIMITED; - if (httpStatus === 407) return CrawlErrorCode.BLOCKED_PROXY; - if (httpStatus === 401 || httpStatus === 403) return CrawlErrorCode.AUTH_FAILED; - if (httpStatus === 503) return CrawlErrorCode.SERVICE_UNAVAILABLE; - if (httpStatus >= 500) return CrawlErrorCode.SERVER_ERROR; - } - - if (!error) return CrawlErrorCode.UNKNOWN_ERROR; - - const message = typeof error === 'string' ? error.toLowerCase() : error.message.toLowerCase(); - - // Rate limiting patterns - if (message.includes('rate limit') || message.includes('too many requests') || message.includes('429')) { - return CrawlErrorCode.RATE_LIMITED; - } - - // Proxy patterns - if (message.includes('proxy') && (message.includes('block') || message.includes('reject') || message.includes('407'))) { - return CrawlErrorCode.BLOCKED_PROXY; - } - - // Timeout patterns - if (message.includes('timeout') || message.includes('timed out') || message.includes('etimedout')) { - if (message.includes('proxy')) { - return CrawlErrorCode.PROXY_TIMEOUT; - } - return CrawlErrorCode.TIMEOUT; - } - - // Network patterns - if (message.includes('econnrefused') || message.includes('econnreset') || message.includes('network')) { - return CrawlErrorCode.NETWORK_ERROR; - } - - // DNS patterns - if (message.includes('enotfound') || message.includes('dns') || message.includes('getaddrinfo')) { - return CrawlErrorCode.DNS_ERROR; - } - - // Auth patterns - if (message.includes('auth') || message.includes('unauthorized') || message.includes('forbidden') || message.includes('401') || message.includes('403')) { - return CrawlErrorCode.AUTH_FAILED; - } - - // HTML change patterns - if (message.includes('selector') || message.includes('element not found') || message.includes('structure changed')) { - return CrawlErrorCode.HTML_CHANGED; - } - - // Parse patterns - if (message.includes('parse') || message.includes('json') || message.includes('syntax')) { - return CrawlErrorCode.PARSE_ERROR; - } - - // No products patterns - if (message.includes('no products') || message.includes('empty') || message.includes('0 products')) { - return CrawlErrorCode.NO_PRODUCTS; - } - - // Server error patterns - if (message.includes('500') || message.includes('502') || message.includes('503') || message.includes('504')) { - return CrawlErrorCode.SERVER_ERROR; - } - - // Config patterns - if (message.includes('config') || message.includes('invalid') || message.includes('missing')) { - if (message.includes('platform') || message.includes('dispensary_id')) { - return CrawlErrorCode.MISSING_PLATFORM_ID; - } - return CrawlErrorCode.INVALID_CONFIG; - } - - return CrawlErrorCode.UNKNOWN_ERROR; -} - -/** - * Get metadata for an error code - */ -export function getErrorMetadata(code: CrawlErrorCodeType): ErrorMetadata { - return ERROR_METADATA[code] || ERROR_METADATA[CrawlErrorCode.UNKNOWN_ERROR]; -} - -/** - * Check if an error is retryable - */ -export function isRetryable(code: CrawlErrorCodeType): boolean { - return getErrorMetadata(code).retryable; -} - -/** - * Check if proxy should be rotated for this error - */ -export function shouldRotateProxy(code: CrawlErrorCodeType): boolean { - return getErrorMetadata(code).rotateProxy; -} - -/** - * Check if user agent should be rotated for this error - */ -export function shouldRotateUserAgent(code: CrawlErrorCodeType): boolean { - return getErrorMetadata(code).rotateUserAgent; -} - -/** - * Get backoff multiplier for this error - */ -export function getBackoffMultiplier(code: CrawlErrorCodeType): number { - return getErrorMetadata(code).backoffMultiplier; -} - -// ============================================================ -// CRAWL RESULT TYPE -// ============================================================ - -/** - * Standardized crawl result with error taxonomy - */ -export interface CrawlResult { - success: boolean; - dispensaryId: number; - - // Error info - errorCode: CrawlErrorCodeType; - errorMessage?: string; - httpStatus?: number; - - // Timing - startedAt: Date; - finishedAt: Date; - durationMs: number; - - // Context - attemptNumber: number; - proxyUsed?: string; - userAgentUsed?: string; - - // Metrics (on success) - productsFound?: number; - productsUpserted?: number; - snapshotsCreated?: number; - imagesDownloaded?: number; - - // Metadata - metadata?: Record; -} - -/** - * Create a success result - */ -export function createSuccessResult( - dispensaryId: number, - startedAt: Date, - metrics: { - productsFound: number; - productsUpserted: number; - snapshotsCreated: number; - imagesDownloaded?: number; - }, - context?: { - attemptNumber?: number; - proxyUsed?: string; - userAgentUsed?: string; - } -): CrawlResult { - const finishedAt = new Date(); - return { - success: true, - dispensaryId, - errorCode: CrawlErrorCode.SUCCESS, - startedAt, - finishedAt, - durationMs: finishedAt.getTime() - startedAt.getTime(), - attemptNumber: context?.attemptNumber || 1, - proxyUsed: context?.proxyUsed, - userAgentUsed: context?.userAgentUsed, - ...metrics, - }; -} - -/** - * Create a failure result - */ -export function createFailureResult( - dispensaryId: number, - startedAt: Date, - error: Error | string, - httpStatus?: number, - context?: { - attemptNumber?: number; - proxyUsed?: string; - userAgentUsed?: string; - } -): CrawlResult { - const finishedAt = new Date(); - const errorCode = classifyError(error, httpStatus); - const errorMessage = typeof error === 'string' ? error : error.message; - - return { - success: false, - dispensaryId, - errorCode, - errorMessage, - httpStatus, - startedAt, - finishedAt, - durationMs: finishedAt.getTime() - startedAt.getTime(), - attemptNumber: context?.attemptNumber || 1, - proxyUsed: context?.proxyUsed, - userAgentUsed: context?.userAgentUsed, - }; -} - -// ============================================================ -// LOGGING HELPERS -// ============================================================ - -/** - * Format error code for logging - */ -export function formatErrorForLog(result: CrawlResult): string { - const metadata = getErrorMetadata(result.errorCode); - const retryInfo = metadata.retryable ? '(retryable)' : '(non-retryable)'; - const proxyInfo = result.proxyUsed ? ` via ${result.proxyUsed}` : ''; - - if (result.success) { - return `[${result.errorCode}] Crawl successful: ${result.productsFound} products${proxyInfo}`; - } - - return `[${result.errorCode}] ${result.errorMessage}${proxyInfo} ${retryInfo}`; -} - -/** - * Get user-friendly error description - */ -export function getErrorDescription(code: CrawlErrorCodeType): string { - return getErrorMetadata(code).description; -} diff --git a/backend/src/dutchie-az/services/graphql-client.ts b/backend/src/dutchie-az/services/graphql-client.ts deleted file mode 100644 index 95e45d98..00000000 --- a/backend/src/dutchie-az/services/graphql-client.ts +++ /dev/null @@ -1,712 +0,0 @@ -/** - * Dutchie GraphQL Client - * - * Uses Puppeteer to establish a session (get CF cookies), then makes - * SERVER-SIDE fetch calls to api-gw.dutchie.com with those cookies. - * - * DUTCHIE FETCH RULES: - * 1. Server-side only - use axios (never browser fetch with CORS) - * 2. Use dispensaryFilter.cNameOrID, NOT dispensaryId directly - * 3. Headers must mimic Chrome: User-Agent, Origin, Referer - * 4. If 403, extract CF cookies from Puppeteer session and include them - * 5. Log status codes, error bodies, and product counts - */ - -import axios, { AxiosError } from 'axios'; -import puppeteer from 'puppeteer-extra'; -import type { Browser, Page, Protocol } from 'puppeteer'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; -import { - DutchieRawProduct, - DutchiePOSChild, - CrawlMode, -} from '../types'; -import { dutchieConfig, GRAPHQL_HASHES, ARIZONA_CENTERPOINTS } from '../config/dutchie'; - -puppeteer.use(StealthPlugin()); - -// Re-export for backward compatibility -export { GRAPHQL_HASHES, ARIZONA_CENTERPOINTS }; - -// ============================================================ -// SESSION MANAGEMENT - Get CF cookies via Puppeteer -// ============================================================ - -interface SessionCredentials { - cookies: string; // Cookie header string - userAgent: string; - browser: Browser; - page: Page; // Keep page reference for extracting dispensaryId - dispensaryId?: string; // Extracted from window.reactEnv if available - httpStatus?: number; // HTTP status code from navigation -} - -/** - * Create a session by navigating to the embedded menu page - * and extracting CF clearance cookies for server-side requests. - * Also extracts dispensaryId from window.reactEnv if available. - */ -async function createSession(cName: string): Promise { - const browser = await puppeteer.launch({ - headless: 'new', - args: dutchieConfig.browserArgs, - }); - - const page = await browser.newPage(); - const userAgent = dutchieConfig.userAgent; - - await page.setUserAgent(userAgent); - await page.setViewport({ width: 1920, height: 1080 }); - await page.evaluateOnNewDocument(() => { - Object.defineProperty(navigator, 'webdriver', { get: () => false }); - (window as any).chrome = { runtime: {} }; - }); - - // Navigate to the embedded menu page for this dispensary - const embeddedMenuUrl = `https://dutchie.com/embedded-menu/${cName}`; - console.log(`[GraphQL Client] Loading ${embeddedMenuUrl} to get CF cookies...`); - - let httpStatus: number | undefined; - let dispensaryId: string | undefined; - - try { - const response = await page.goto(embeddedMenuUrl, { - waitUntil: 'networkidle2', - timeout: dutchieConfig.navigationTimeout, - }); - httpStatus = response?.status(); - await new Promise((r) => setTimeout(r, dutchieConfig.pageLoadDelay)); - - // Try to extract dispensaryId from window.reactEnv - try { - dispensaryId = await page.evaluate(() => { - return (window as any).reactEnv?.dispensaryId || null; - }); - if (dispensaryId) { - console.log(`[GraphQL Client] Extracted dispensaryId from reactEnv: ${dispensaryId}`); - } - } catch (evalError: any) { - console.log(`[GraphQL Client] Could not extract dispensaryId from reactEnv: ${evalError.message}`); - } - } catch (error: any) { - console.warn(`[GraphQL Client] Navigation warning: ${error.message}`); - // Continue anyway - we may have gotten cookies - } - - // Extract cookies - const cookies = await page.cookies(); - const cookieString = cookies.map((c: Protocol.Network.Cookie) => `${c.name}=${c.value}`).join('; '); - - console.log(`[GraphQL Client] Got ${cookies.length} cookies, HTTP status: ${httpStatus}`); - if (cookies.length > 0) { - console.log(`[GraphQL Client] Cookie names: ${cookies.map(c => c.name).join(', ')}`); - } - - return { cookies: cookieString, userAgent, browser, page, dispensaryId, httpStatus }; -} - -/** - * Close session (browser) - */ -async function closeSession(session: SessionCredentials): Promise { - await session.browser.close(); -} - -// ============================================================ -// SERVER-SIDE GRAPHQL FETCH USING AXIOS -// ============================================================ - -/** - * Build headers that mimic a real browser request - */ -function buildHeaders(session: SessionCredentials, cName: string): Record { - const embeddedMenuUrl = `https://dutchie.com/embedded-menu/${cName}`; - - return { - 'accept': 'application/json, text/plain, */*', - 'accept-language': 'en-US,en;q=0.9', - 'accept-encoding': 'gzip, deflate, br', - 'content-type': 'application/json', - 'origin': 'https://dutchie.com', - 'referer': embeddedMenuUrl, - 'user-agent': session.userAgent, - 'apollographql-client-name': 'Marketplace (production)', - 'sec-ch-ua': '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"', - 'sec-ch-ua-mobile': '?0', - 'sec-ch-ua-platform': '"Windows"', - 'sec-fetch-dest': 'empty', - 'sec-fetch-mode': 'cors', - 'sec-fetch-site': 'same-site', - ...(session.cookies ? { 'cookie': session.cookies } : {}), - }; -} - -/** - * Execute GraphQL query server-side using axios - * Uses cookies from the browser session to bypass CF - */ -async function executeGraphQL( - session: SessionCredentials, - operationName: string, - variables: any, - hash: string, - cName: string -): Promise { - const endpoint = dutchieConfig.graphqlEndpoint; - const headers = buildHeaders(session, cName); - - // Build request body for POST - const body = { - operationName, - variables, - extensions: { - persistedQuery: { version: 1, sha256Hash: hash }, - }, - }; - - console.log(`[GraphQL Client] POST: ${operationName} -> ${endpoint}`); - console.log(`[GraphQL Client] Variables: ${JSON.stringify(variables).slice(0, 300)}...`); - - try { - const response = await axios.post(endpoint, body, { - headers, - timeout: 30000, - validateStatus: () => true, // Don't throw on non-2xx - }); - - // Log response details - console.log(`[GraphQL Client] Response status: ${response.status}`); - - if (response.status !== 200) { - const bodyPreview = typeof response.data === 'string' - ? response.data.slice(0, 500) - : JSON.stringify(response.data).slice(0, 500); - console.error(`[GraphQL Client] HTTP ${response.status}: ${bodyPreview}`); - throw new Error(`HTTP ${response.status}`); - } - - // Check for GraphQL errors - if (response.data?.errors && response.data.errors.length > 0) { - console.error(`[GraphQL Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`); - } - - return response.data; - } catch (error: any) { - if (axios.isAxiosError(error)) { - const axiosError = error as AxiosError; - console.error(`[GraphQL Client] Axios error: ${axiosError.message}`); - if (axiosError.response) { - console.error(`[GraphQL Client] Response status: ${axiosError.response.status}`); - console.error(`[GraphQL Client] Response data: ${JSON.stringify(axiosError.response.data).slice(0, 500)}`); - } - if (axiosError.code) { - console.error(`[GraphQL Client] Error code: ${axiosError.code}`); - } - } else { - console.error(`[GraphQL Client] Error: ${error.message}`); - } - throw error; - } -} - -// ============================================================ -// DISPENSARY ID RESOLUTION -// ============================================================ - -/** - * Resolution result with HTTP status for error handling - */ -export interface ResolveDispensaryResult { - dispensaryId: string | null; - httpStatus?: number; - error?: string; - source?: 'reactEnv' | 'graphql'; -} - -/** - * Resolve a dispensary slug to its internal platform ID. - * - * STRATEGY: - * 1. Navigate to embedded menu page and extract window.reactEnv.dispensaryId (preferred) - * 2. Fall back to GraphQL GetAddressBasedDispensaryData query if reactEnv fails - * - * Returns the dispensaryId (platform_dispensary_id) or null if not found. - * Throws if page returns 403/404 so caller can mark as not_crawlable. - */ -export async function resolveDispensaryId(slug: string): Promise { - const result = await resolveDispensaryIdWithDetails(slug); - return result.dispensaryId; -} - -/** - * Resolve a dispensary slug with full details (HTTP status, source, error). - * Use this when you need to know WHY resolution failed. - */ -export async function resolveDispensaryIdWithDetails(slug: string): Promise { - console.log(`[GraphQL Client] Resolving dispensary ID for slug: ${slug}`); - - const session = await createSession(slug); - - try { - // Check HTTP status first - if 403/404, the store is not crawlable - if (session.httpStatus && (session.httpStatus === 403 || session.httpStatus === 404)) { - console.log(`[GraphQL Client] Page returned HTTP ${session.httpStatus} for ${slug} - not crawlable`); - return { - dispensaryId: null, - httpStatus: session.httpStatus, - error: `HTTP ${session.httpStatus}: Store removed or not accessible`, - source: 'reactEnv', - }; - } - - // PREFERRED: Use dispensaryId from window.reactEnv (extracted during createSession) - if (session.dispensaryId) { - console.log(`[GraphQL Client] Resolved ${slug} -> ${session.dispensaryId} (from reactEnv)`); - return { - dispensaryId: session.dispensaryId, - httpStatus: session.httpStatus, - source: 'reactEnv', - }; - } - - // FALLBACK: Try GraphQL query - console.log(`[GraphQL Client] reactEnv.dispensaryId not found for ${slug}, trying GraphQL...`); - - const variables = { - dispensaryFilter: { - cNameOrID: slug, - }, - }; - - const result = await executeGraphQL( - session, - 'GetAddressBasedDispensaryData', - variables, - GRAPHQL_HASHES.GetAddressBasedDispensaryData, - slug - ); - - const dispensaryId = result?.data?.dispensaryBySlug?.id || - result?.data?.dispensary?.id || - result?.data?.getAddressBasedDispensaryData?.dispensary?.id; - - if (dispensaryId) { - console.log(`[GraphQL Client] Resolved ${slug} -> ${dispensaryId} (from GraphQL)`); - return { - dispensaryId, - httpStatus: session.httpStatus, - source: 'graphql', - }; - } - - console.log(`[GraphQL Client] Could not resolve ${slug}, GraphQL response:`, JSON.stringify(result).slice(0, 300)); - return { - dispensaryId: null, - httpStatus: session.httpStatus, - error: 'Could not extract dispensaryId from reactEnv or GraphQL', - }; - } finally { - await closeSession(session); - } -} - -/** - * Discover Arizona dispensaries via geo-based query - */ -export async function discoverArizonaDispensaries(): Promise { - console.log('[GraphQL Client] Discovering Arizona dispensaries...'); - - // Use Phoenix as the default center - const session = await createSession('AZ-Deeply-Rooted'); - const allDispensaries: any[] = []; - const seenIds = new Set(); - - try { - for (const centerpoint of ARIZONA_CENTERPOINTS) { - console.log(`[GraphQL Client] Scanning ${centerpoint.name}...`); - - const variables = { - dispensariesFilter: { - latitude: centerpoint.lat, - longitude: centerpoint.lng, - distance: 100, - state: 'AZ', - }, - }; - - try { - const result = await executeGraphQL( - session, - 'ConsumerDispensaries', - variables, - GRAPHQL_HASHES.ConsumerDispensaries, - 'AZ-Deeply-Rooted' - ); - - const dispensaries = result?.data?.consumerDispensaries || []; - - for (const d of dispensaries) { - const id = d.id || d.dispensaryId; - if (id && !seenIds.has(id)) { - seenIds.add(id); - allDispensaries.push(d); - } - } - - console.log(`[GraphQL Client] Found ${dispensaries.length} in ${centerpoint.name} (${allDispensaries.length} total unique)`); - } catch (error: any) { - console.warn(`[GraphQL Client] Error scanning ${centerpoint.name}: ${error.message}`); - } - - // Delay between requests - await new Promise((r) => setTimeout(r, 1000)); - } - } finally { - await closeSession(session); - } - - console.log(`[GraphQL Client] Discovery complete: ${allDispensaries.length} dispensaries`); - return allDispensaries; -} - -// ============================================================ -// PRODUCT FILTERING VARIABLES -// ============================================================ - -/** - * Build filter variables for FilteredProducts query - * - * CRITICAL: Uses dispensaryId directly (the MongoDB ObjectId, e.g. "6405ef617056e8014d79101b") - * NOT dispensaryFilter.cNameOrID! - * - * The actual browser request structure is: - * { - * "productsFilter": { - * "dispensaryId": "6405ef617056e8014d79101b", - * "pricingType": "rec", - * "Status": "Active", // Mode A only - * "strainTypes": [], - * "subcategories": [], - * "types": [], - * "useCache": true, - * ... - * }, - * "page": 0, - * "perPage": 100 - * } - * - * Mode A = UI parity (Status: "Active") - * Mode B = MAX COVERAGE (no Status filter) - */ -function buildFilterVariables( - platformDispensaryId: string, - pricingType: 'rec' | 'med', - crawlMode: CrawlMode, - page: number, - perPage: number -): any { - const isModeA = crawlMode === 'mode_a'; - - // Per CLAUDE.md Rule #11: Use simple productsFilter with dispensaryId directly - // Do NOT use dispensaryFilter.cNameOrID - that's outdated - const productsFilter: Record = { - dispensaryId: platformDispensaryId, - pricingType: pricingType, - }; - - // Mode A: Only active products (UI parity) - Status: "Active" - // Mode B: MAX COVERAGE (OOS/inactive) - omit Status or set to null - if (isModeA) { - productsFilter.Status = 'Active'; - } - // Mode B: No Status filter = returns all products including OOS/inactive - - return { - productsFilter, - page, - perPage, - }; -} - -// ============================================================ -// PRODUCT FETCHING WITH PAGINATION -// ============================================================ - -/** - * Fetch products for a single mode with pagination - */ -async function fetchProductsForMode( - session: SessionCredentials, - platformDispensaryId: string, - cName: string, - pricingType: 'rec' | 'med', - crawlMode: CrawlMode -): Promise<{ products: DutchieRawProduct[]; totalCount: number; crawlMode: CrawlMode }> { - const perPage = dutchieConfig.perPage; - const maxPages = dutchieConfig.maxPages; - const maxRetries = dutchieConfig.maxRetries; - const pageDelayMs = dutchieConfig.pageDelayMs; - - const allProducts: DutchieRawProduct[] = []; - let pageNum = 0; - let totalCount = 0; - let consecutiveEmptyPages = 0; - - console.log(`[GraphQL Client] Fetching products for ${cName} (platformId: ${platformDispensaryId}, ${pricingType}, ${crawlMode})...`); - - while (pageNum < maxPages) { - const variables = buildFilterVariables(platformDispensaryId, pricingType, crawlMode, pageNum, perPage); - - let result: any = null; - let lastError: Error | null = null; - - // Retry logic - for (let attempt = 0; attempt <= maxRetries; attempt++) { - try { - result = await executeGraphQL( - session, - 'FilteredProducts', - variables, - GRAPHQL_HASHES.FilteredProducts, - cName - ); - lastError = null; - break; - } catch (error: any) { - lastError = error; - console.warn(`[GraphQL Client] Page ${pageNum} attempt ${attempt + 1} failed: ${error.message}`); - if (attempt < maxRetries) { - await new Promise((r) => setTimeout(r, 1000 * (attempt + 1))); - } - } - } - - if (lastError) { - console.error(`[GraphQL Client] Page ${pageNum} failed after ${maxRetries + 1} attempts`); - break; - } - - if (result?.errors) { - console.error('[GraphQL Client] GraphQL errors:', JSON.stringify(result.errors)); - break; - } - - // Log response shape on first page - if (pageNum === 0) { - console.log(`[GraphQL Client] Response keys: ${Object.keys(result || {}).join(', ')}`); - if (result?.data) { - console.log(`[GraphQL Client] data keys: ${Object.keys(result.data || {}).join(', ')}`); - } - if (!result?.data?.filteredProducts) { - console.log(`[GraphQL Client] WARNING: No filteredProducts in response!`); - console.log(`[GraphQL Client] Full response: ${JSON.stringify(result).slice(0, 1000)}`); - } - } - - const products = result?.data?.filteredProducts?.products || []; - const queryInfo = result?.data?.filteredProducts?.queryInfo; - - if (queryInfo?.totalCount) { - totalCount = queryInfo.totalCount; - } - - console.log( - `[GraphQL Client] Page ${pageNum}: ${products.length} products (total so far: ${allProducts.length + products.length}/${totalCount})` - ); - - if (products.length === 0) { - consecutiveEmptyPages++; - if (consecutiveEmptyPages >= 2) { - console.log('[GraphQL Client] Multiple empty pages, stopping pagination'); - break; - } - } else { - consecutiveEmptyPages = 0; - allProducts.push(...products); - } - - // Stop if incomplete page (last page) - if (products.length < perPage) { - console.log(`[GraphQL Client] Incomplete page (${products.length} < ${perPage}), stopping`); - break; - } - - pageNum++; - await new Promise((r) => setTimeout(r, pageDelayMs)); - } - - console.log(`[GraphQL Client] Fetched ${allProducts.length} total products (${crawlMode})`); - return { products: allProducts, totalCount: totalCount || allProducts.length, crawlMode }; -} - -// ============================================================ -// LEGACY SINGLE-MODE INTERFACE -// ============================================================ - -/** - * Fetch all products for a dispensary (single mode) - */ -export async function fetchAllProducts( - platformDispensaryId: string, - pricingType: 'rec' | 'med' = 'rec', - options: { - perPage?: number; - maxPages?: number; - menuUrl?: string; - crawlMode?: CrawlMode; - cName?: string; - } = {} -): Promise<{ products: DutchieRawProduct[]; totalCount: number; crawlMode: CrawlMode }> { - const { crawlMode = 'mode_a' } = options; - - // cName is now REQUIRED - no default fallback to avoid using wrong store's session - const cName = options.cName; - if (!cName) { - throw new Error('[GraphQL Client] cName is required for fetchAllProducts - cannot use another store\'s session'); - } - - const session = await createSession(cName); - - try { - return await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, crawlMode); - } finally { - await closeSession(session); - } -} - -// ============================================================ -// MODE A+B MERGING -// ============================================================ - -/** - * Merge POSMetaData.children arrays from Mode A and Mode B products - */ -function mergeProductOptions( - modeAProduct: DutchieRawProduct, - modeBProduct: DutchieRawProduct -): DutchiePOSChild[] { - const modeAChildren = modeAProduct.POSMetaData?.children || []; - const modeBChildren = modeBProduct.POSMetaData?.children || []; - - const getOptionKey = (child: DutchiePOSChild): string => { - return child.canonicalID || child.canonicalSKU || child.canonicalPackageId || child.option || ''; - }; - - const mergedMap = new Map(); - - for (const child of modeAChildren) { - const key = getOptionKey(child); - if (key) mergedMap.set(key, child); - } - - for (const child of modeBChildren) { - const key = getOptionKey(child); - if (key && !mergedMap.has(key)) { - mergedMap.set(key, child); - } - } - - return Array.from(mergedMap.values()); -} - -/** - * Merge a Mode A product with a Mode B product - */ -function mergeProducts( - modeAProduct: DutchieRawProduct, - modeBProduct: DutchieRawProduct | undefined -): DutchieRawProduct { - if (!modeBProduct) { - return modeAProduct; - } - - const mergedChildren = mergeProductOptions(modeAProduct, modeBProduct); - - return { - ...modeAProduct, - POSMetaData: { - ...modeAProduct.POSMetaData, - children: mergedChildren, - }, - }; -} - -// ============================================================ -// MAIN EXPORT: TWO-MODE CRAWL -// ============================================================ - -/** - * Fetch products using BOTH crawl modes with SINGLE session - * Runs Mode A then Mode B, merges results - */ -export async function fetchAllProductsBothModes( - platformDispensaryId: string, - pricingType: 'rec' | 'med' = 'rec', - options: { - perPage?: number; - maxPages?: number; - menuUrl?: string; - cName?: string; - } = {} -): Promise<{ - modeA: { products: DutchieRawProduct[]; totalCount: number }; - modeB: { products: DutchieRawProduct[]; totalCount: number }; - merged: { products: DutchieRawProduct[]; totalCount: number }; -}> { - // cName is now REQUIRED - no default fallback to avoid using wrong store's session - const cName = options.cName; - if (!cName) { - throw new Error('[GraphQL Client] cName is required for fetchAllProductsBothModes - cannot use another store\'s session'); - } - - console.log(`[GraphQL Client] Running two-mode crawl for ${cName} (${pricingType})...`); - console.log(`[GraphQL Client] Platform ID: ${platformDispensaryId}, cName: ${cName}`); - - const session = await createSession(cName); - - try { - // Mode A (UI parity) - const modeAResult = await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, 'mode_a'); - - // Delay between modes - await new Promise((r) => setTimeout(r, dutchieConfig.modeDelayMs)); - - // Mode B (MAX COVERAGE) - const modeBResult = await fetchProductsForMode(session, platformDispensaryId, cName, pricingType, 'mode_b'); - - // Merge results - const modeBMap = new Map(); - for (const product of modeBResult.products) { - modeBMap.set(product._id, product); - } - - const productMap = new Map(); - - // Add Mode A products, merging with Mode B if exists - for (const product of modeAResult.products) { - const modeBProduct = modeBMap.get(product._id); - const mergedProduct = mergeProducts(product, modeBProduct); - productMap.set(product._id, mergedProduct); - } - - // Add Mode B products not in Mode A - for (const product of modeBResult.products) { - if (!productMap.has(product._id)) { - productMap.set(product._id, product); - } - } - - const mergedProducts = Array.from(productMap.values()); - - console.log(`[GraphQL Client] Merged: ${mergedProducts.length} unique products`); - console.log(`[GraphQL Client] Mode A: ${modeAResult.products.length}, Mode B: ${modeBResult.products.length}`); - - return { - modeA: { products: modeAResult.products, totalCount: modeAResult.totalCount }, - modeB: { products: modeBResult.products, totalCount: modeBResult.totalCount }, - merged: { products: mergedProducts, totalCount: mergedProducts.length }, - }; - } finally { - await closeSession(session); - } -} diff --git a/backend/src/dutchie-az/services/job-queue.ts b/backend/src/dutchie-az/services/job-queue.ts deleted file mode 100644 index d2908b30..00000000 --- a/backend/src/dutchie-az/services/job-queue.ts +++ /dev/null @@ -1,665 +0,0 @@ -/** - * Job Queue Service - * - * DB-backed job queue with claiming/locking for distributed workers. - * Ensures only one worker processes a given store at a time. - */ - -import { query, getClient } from '../db/connection'; -import { v4 as uuidv4 } from 'uuid'; -import * as os from 'os'; -import { DEFAULT_CONFIG } from './store-validator'; - -// Minimum gap between crawls for the same dispensary (in minutes) -const MIN_CRAWL_GAP_MINUTES = DEFAULT_CONFIG.minCrawlGapMinutes; // 2 minutes - -// ============================================================ -// TYPES -// ============================================================ - -export interface QueuedJob { - id: number; - jobType: string; - dispensaryId: number | null; - status: 'pending' | 'running' | 'completed' | 'failed'; - priority: number; - retryCount: number; - maxRetries: number; - claimedBy: string | null; - claimedAt: Date | null; - workerHostname: string | null; - startedAt: Date | null; - completedAt: Date | null; - errorMessage: string | null; - productsFound: number; - productsUpserted: number; - snapshotsCreated: number; - currentPage: number; - totalPages: number | null; - lastHeartbeatAt: Date | null; - metadata: Record | null; - createdAt: Date; -} - -export interface EnqueueJobOptions { - jobType: string; - dispensaryId?: number; - priority?: number; - metadata?: Record; - maxRetries?: number; -} - -export interface ClaimJobOptions { - workerId: string; - jobTypes?: string[]; - lockDurationMinutes?: number; -} - -export interface JobProgress { - productsFound?: number; - productsUpserted?: number; - snapshotsCreated?: number; - currentPage?: number; - totalPages?: number; -} - -// ============================================================ -// WORKER IDENTITY -// ============================================================ - -let _workerId: string | null = null; - -/** - * Get or create a unique worker ID for this process - * In Kubernetes, uses POD_NAME for clarity; otherwise generates a unique ID - */ -export function getWorkerId(): string { - if (!_workerId) { - // Prefer POD_NAME in K8s (set via fieldRef) - const podName = process.env.POD_NAME; - if (podName) { - _workerId = podName; - } else { - const hostname = os.hostname(); - const pid = process.pid; - const uuid = uuidv4().slice(0, 8); - _workerId = `${hostname}-${pid}-${uuid}`; - } - } - return _workerId; -} - -/** - * Get hostname for worker tracking - * In Kubernetes, uses POD_NAME; otherwise uses os.hostname() - */ -export function getWorkerHostname(): string { - return process.env.POD_NAME || os.hostname(); -} - -// ============================================================ -// JOB ENQUEUEING -// ============================================================ - -export interface EnqueueResult { - jobId: number | null; - skipped: boolean; - reason?: 'already_queued' | 'too_soon' | 'error'; - message?: string; -} - -/** - * Enqueue a new job for processing - * Returns null if a pending/running job already exists for this dispensary - * or if a job was completed/failed within the minimum gap period - */ -export async function enqueueJob(options: EnqueueJobOptions): Promise { - const result = await enqueueJobWithReason(options); - return result.jobId; -} - -/** - * Enqueue a new job with detailed result info - * Enforces: - * 1. No duplicate pending/running jobs for same dispensary - * 2. Minimum 2-minute gap between crawls for same dispensary - */ -export async function enqueueJobWithReason(options: EnqueueJobOptions): Promise { - const { - jobType, - dispensaryId, - priority = 0, - metadata, - maxRetries = 3, - } = options; - - // Check if there's already a pending/running job for this dispensary - if (dispensaryId) { - const { rows: existing } = await query( - `SELECT id FROM dispensary_crawl_jobs - WHERE dispensary_id = $1 AND status IN ('pending', 'running') - LIMIT 1`, - [dispensaryId] - ); - - if (existing.length > 0) { - console.log(`[JobQueue] Skipping enqueue - job already exists for dispensary ${dispensaryId}`); - return { - jobId: null, - skipped: true, - reason: 'already_queued', - message: `Job already pending/running for dispensary ${dispensaryId}`, - }; - } - - // Check minimum gap since last job (2 minutes) - const { rows: recent } = await query( - `SELECT id, created_at, status - FROM dispensary_crawl_jobs - WHERE dispensary_id = $1 - ORDER BY created_at DESC - LIMIT 1`, - [dispensaryId] - ); - - if (recent.length > 0) { - const lastJobTime = new Date(recent[0].created_at); - const minGapMs = MIN_CRAWL_GAP_MINUTES * 60 * 1000; - const timeSinceLastJob = Date.now() - lastJobTime.getTime(); - - if (timeSinceLastJob < minGapMs) { - const waitSeconds = Math.ceil((minGapMs - timeSinceLastJob) / 1000); - console.log(`[JobQueue] Skipping enqueue - minimum ${MIN_CRAWL_GAP_MINUTES}min gap not met for dispensary ${dispensaryId}. Wait ${waitSeconds}s`); - return { - jobId: null, - skipped: true, - reason: 'too_soon', - message: `Minimum ${MIN_CRAWL_GAP_MINUTES}-minute gap required. Try again in ${waitSeconds} seconds.`, - }; - } - } - } - - try { - const { rows } = await query( - `INSERT INTO dispensary_crawl_jobs (job_type, dispensary_id, status, priority, max_retries, metadata, created_at) - VALUES ($1, $2, 'pending', $3, $4, $5, NOW()) - RETURNING id`, - [jobType, dispensaryId || null, priority, maxRetries, metadata ? JSON.stringify(metadata) : null] - ); - - const jobId = rows[0].id; - console.log(`[JobQueue] Enqueued job ${jobId} (type=${jobType}, dispensary=${dispensaryId})`); - return { jobId, skipped: false }; - } catch (error: any) { - // Handle database trigger rejection for minimum gap - if (error.message?.includes('Minimum') && error.message?.includes('gap')) { - console.log(`[JobQueue] DB rejected - minimum gap not met for dispensary ${dispensaryId}`); - return { - jobId: null, - skipped: true, - reason: 'too_soon', - message: error.message, - }; - } - throw error; - } -} - -export interface BulkEnqueueResult { - enqueued: number; - skipped: number; - skippedReasons: { - alreadyQueued: number; - tooSoon: number; - }; -} - -/** - * Bulk enqueue jobs for multiple dispensaries - * Skips dispensaries that already have pending/running jobs - * or have jobs within the minimum gap period - */ -export async function bulkEnqueueJobs( - jobType: string, - dispensaryIds: number[], - options: { priority?: number; metadata?: Record } = {} -): Promise { - const { priority = 0, metadata } = options; - - // Get dispensaries that already have pending/running jobs - const { rows: existing } = await query( - `SELECT DISTINCT dispensary_id FROM dispensary_crawl_jobs - WHERE dispensary_id = ANY($1) AND status IN ('pending', 'running')`, - [dispensaryIds] - ); - const existingSet = new Set(existing.map((r: any) => r.dispensary_id)); - - // Get dispensaries that have recent jobs within minimum gap - const { rows: recent } = await query( - `SELECT DISTINCT dispensary_id FROM dispensary_crawl_jobs - WHERE dispensary_id = ANY($1) - AND created_at > NOW() - ($2 || ' minutes')::INTERVAL - AND dispensary_id NOT IN ( - SELECT dispensary_id FROM dispensary_crawl_jobs - WHERE dispensary_id = ANY($1) AND status IN ('pending', 'running') - )`, - [dispensaryIds, MIN_CRAWL_GAP_MINUTES] - ); - const recentSet = new Set(recent.map((r: any) => r.dispensary_id)); - - // Filter out dispensaries with existing or recent jobs - const toEnqueue = dispensaryIds.filter(id => !existingSet.has(id) && !recentSet.has(id)); - - if (toEnqueue.length === 0) { - return { - enqueued: 0, - skipped: dispensaryIds.length, - skippedReasons: { - alreadyQueued: existingSet.size, - tooSoon: recentSet.size, - }, - }; - } - - // Bulk insert - each row needs 4 params: job_type, dispensary_id, priority, metadata - const metadataJson = metadata ? JSON.stringify(metadata) : null; - const values = toEnqueue.map((_, i) => { - const offset = i * 4; - return `($${offset + 1}, $${offset + 2}, 'pending', $${offset + 3}, 3, $${offset + 4}, NOW())`; - }).join(', '); - - const params: any[] = []; - toEnqueue.forEach(dispensaryId => { - params.push(jobType, dispensaryId, priority, metadataJson); - }); - - await query( - `INSERT INTO dispensary_crawl_jobs (job_type, dispensary_id, status, priority, max_retries, metadata, created_at) - VALUES ${values}`, - params - ); - - console.log(`[JobQueue] Bulk enqueued ${toEnqueue.length} jobs, skipped ${existingSet.size} (queued) + ${recentSet.size} (recent)`); - return { - enqueued: toEnqueue.length, - skipped: existingSet.size + recentSet.size, - skippedReasons: { - alreadyQueued: existingSet.size, - tooSoon: recentSet.size, - }, - }; -} - -// ============================================================ -// JOB CLAIMING (with locking) -// ============================================================ - -/** - * Claim the next available job from the queue - * Uses SELECT FOR UPDATE SKIP LOCKED to prevent double-claims - */ -export async function claimNextJob(options: ClaimJobOptions): Promise { - const { workerId, jobTypes, lockDurationMinutes = 30 } = options; - const hostname = getWorkerHostname(); - - const client = await getClient(); - - try { - await client.query('BEGIN'); - - // Build job type filter - let typeFilter = ''; - const params: any[] = [workerId, hostname, lockDurationMinutes]; - let paramIndex = 4; - - if (jobTypes && jobTypes.length > 0) { - typeFilter = `AND job_type = ANY($${paramIndex})`; - params.push(jobTypes); - paramIndex++; - } - - // Claim the next pending job using FOR UPDATE SKIP LOCKED - // This atomically selects and locks a row, skipping any already locked by other workers - const { rows } = await client.query( - `UPDATE dispensary_crawl_jobs - SET - status = 'running', - claimed_by = $1, - claimed_at = NOW(), - worker_id = $1, - worker_hostname = $2, - started_at = NOW(), - locked_until = NOW() + ($3 || ' minutes')::INTERVAL, - last_heartbeat_at = NOW(), - updated_at = NOW() - WHERE id = ( - SELECT id FROM dispensary_crawl_jobs - WHERE status = 'pending' - ${typeFilter} - ORDER BY priority DESC, created_at ASC - FOR UPDATE SKIP LOCKED - LIMIT 1 - ) - RETURNING *`, - params - ); - - await client.query('COMMIT'); - - if (rows.length === 0) { - return null; - } - - const job = mapDbRowToJob(rows[0]); - console.log(`[JobQueue] Worker ${workerId} claimed job ${job.id} (type=${job.jobType}, dispensary=${job.dispensaryId})`); - return job; - } catch (error) { - await client.query('ROLLBACK'); - throw error; - } finally { - client.release(); - } -} - -// ============================================================ -// JOB PROGRESS & COMPLETION -// ============================================================ - -/** - * Update job progress (for live monitoring) - */ -export async function updateJobProgress(jobId: number, progress: JobProgress): Promise { - const updates: string[] = ['last_heartbeat_at = NOW()', 'updated_at = NOW()']; - const params: any[] = []; - let paramIndex = 1; - - if (progress.productsFound !== undefined) { - updates.push(`products_found = $${paramIndex++}`); - params.push(progress.productsFound); - } - if (progress.productsUpserted !== undefined) { - updates.push(`products_upserted = $${paramIndex++}`); - params.push(progress.productsUpserted); - } - if (progress.snapshotsCreated !== undefined) { - updates.push(`snapshots_created = $${paramIndex++}`); - params.push(progress.snapshotsCreated); - } - if (progress.currentPage !== undefined) { - updates.push(`current_page = $${paramIndex++}`); - params.push(progress.currentPage); - } - if (progress.totalPages !== undefined) { - updates.push(`total_pages = $${paramIndex++}`); - params.push(progress.totalPages); - } - - params.push(jobId); - - await query( - `UPDATE dispensary_crawl_jobs SET ${updates.join(', ')} WHERE id = $${paramIndex}`, - params - ); -} - -/** - * Send heartbeat to keep job alive (prevents timeout) - */ -export async function heartbeat(jobId: number): Promise { - await query( - `UPDATE dispensary_crawl_jobs - SET last_heartbeat_at = NOW(), locked_until = NOW() + INTERVAL '30 minutes' - WHERE id = $1 AND status = 'running'`, - [jobId] - ); -} - -/** - * Mark job as completed - * - * Stores visibility tracking stats (visibilityLostCount, visibilityRestoredCount) - * in the metadata JSONB column for dashboard analytics. - */ -export async function completeJob( - jobId: number, - result: { - productsFound?: number; - productsUpserted?: number; - snapshotsCreated?: number; - visibilityLostCount?: number; - visibilityRestoredCount?: number; - } -): Promise { - // Build metadata with visibility stats if provided - const metadata: Record = {}; - if (result.visibilityLostCount !== undefined) { - metadata.visibilityLostCount = result.visibilityLostCount; - } - if (result.visibilityRestoredCount !== undefined) { - metadata.visibilityRestoredCount = result.visibilityRestoredCount; - } - if (result.snapshotsCreated !== undefined) { - metadata.snapshotsCreated = result.snapshotsCreated; - } - - await query( - `UPDATE dispensary_crawl_jobs - SET - status = 'completed', - completed_at = NOW(), - products_found = COALESCE($2, products_found), - products_updated = COALESCE($3, products_updated), - metadata = COALESCE(metadata, '{}'::jsonb) || $4::jsonb, - updated_at = NOW() - WHERE id = $1`, - [ - jobId, - result.productsFound, - result.productsUpserted, - JSON.stringify(metadata), - ] - ); - console.log(`[JobQueue] Job ${jobId} completed`); -} - -/** - * Mark job as failed - */ -export async function failJob(jobId: number, errorMessage: string): Promise { - // Check if we should retry - const { rows } = await query( - `SELECT retry_count, max_retries FROM dispensary_crawl_jobs WHERE id = $1`, - [jobId] - ); - - if (rows.length === 0) return false; - - const { retry_count, max_retries } = rows[0]; - - if (retry_count < max_retries) { - // Re-queue for retry - await query( - `UPDATE dispensary_crawl_jobs - SET - status = 'pending', - retry_count = retry_count + 1, - claimed_by = NULL, - claimed_at = NULL, - worker_id = NULL, - worker_hostname = NULL, - started_at = NULL, - locked_until = NULL, - last_heartbeat_at = NULL, - error_message = $2, - updated_at = NOW() - WHERE id = $1`, - [jobId, errorMessage] - ); - console.log(`[JobQueue] Job ${jobId} failed, re-queued for retry (${retry_count + 1}/${max_retries})`); - return true; // Will retry - } else { - // Mark as failed permanently - await query( - `UPDATE dispensary_crawl_jobs - SET - status = 'failed', - completed_at = NOW(), - error_message = $2, - updated_at = NOW() - WHERE id = $1`, - [jobId, errorMessage] - ); - console.log(`[JobQueue] Job ${jobId} failed permanently after ${retry_count} retries`); - return false; // No more retries - } -} - -// ============================================================ -// QUEUE MONITORING -// ============================================================ - -/** - * Get queue statistics - */ -export async function getQueueStats(): Promise<{ - pending: number; - running: number; - completed1h: number; - failed1h: number; - activeWorkers: number; - avgDurationSeconds: number | null; -}> { - const { rows } = await query(`SELECT * FROM v_queue_stats`); - const stats = rows[0] || {}; - - return { - pending: parseInt(stats.pending_jobs || '0', 10), - running: parseInt(stats.running_jobs || '0', 10), - completed1h: parseInt(stats.completed_1h || '0', 10), - failed1h: parseInt(stats.failed_1h || '0', 10), - activeWorkers: parseInt(stats.active_workers || '0', 10), - avgDurationSeconds: stats.avg_duration_seconds ? parseFloat(stats.avg_duration_seconds) : null, - }; -} - -/** - * Get active workers - */ -export async function getActiveWorkers(): Promise> { - const { rows } = await query(`SELECT * FROM v_active_workers`); - - return rows.map((row: any) => ({ - workerId: row.worker_id, - hostname: row.worker_hostname, - currentJobs: parseInt(row.current_jobs || '0', 10), - totalProductsFound: parseInt(row.total_products_found || '0', 10), - totalProductsUpserted: parseInt(row.total_products_upserted || '0', 10), - totalSnapshots: parseInt(row.total_snapshots || '0', 10), - firstClaimedAt: new Date(row.first_claimed_at), - lastHeartbeat: row.last_heartbeat ? new Date(row.last_heartbeat) : null, - })); -} - -/** - * Get running jobs with worker info - */ -export async function getRunningJobs(): Promise { - const { rows } = await query( - `SELECT cj.*, d.name as dispensary_name, d.city - FROM dispensary_crawl_jobs cj - LEFT JOIN dispensaries d ON cj.dispensary_id = d.id - WHERE cj.status = 'running' - ORDER BY cj.started_at DESC` - ); - - return rows.map(mapDbRowToJob); -} - -/** - * Recover stale jobs (workers that died without completing) - */ -export async function recoverStaleJobs(staleMinutes: number = 15): Promise { - const { rowCount } = await query( - `UPDATE dispensary_crawl_jobs - SET - status = 'pending', - claimed_by = NULL, - claimed_at = NULL, - worker_id = NULL, - worker_hostname = NULL, - started_at = NULL, - locked_until = NULL, - error_message = 'Recovered from stale worker', - retry_count = retry_count + 1, - updated_at = NOW() - WHERE status = 'running' - AND last_heartbeat_at < NOW() - ($1 || ' minutes')::INTERVAL - AND retry_count < max_retries`, - [staleMinutes] - ); - - if (rowCount && rowCount > 0) { - console.log(`[JobQueue] Recovered ${rowCount} stale jobs`); - } - return rowCount || 0; -} - -/** - * Clean up old completed/failed jobs - */ -export async function cleanupOldJobs(olderThanDays: number = 7): Promise { - const { rowCount } = await query( - `DELETE FROM dispensary_crawl_jobs - WHERE status IN ('completed', 'failed') - AND completed_at < NOW() - ($1 || ' days')::INTERVAL`, - [olderThanDays] - ); - - if (rowCount && rowCount > 0) { - console.log(`[JobQueue] Cleaned up ${rowCount} old jobs`); - } - return rowCount || 0; -} - -// ============================================================ -// HELPERS -// ============================================================ - -function mapDbRowToJob(row: any): QueuedJob { - return { - id: row.id, - jobType: row.job_type, - dispensaryId: row.dispensary_id, - status: row.status, - priority: row.priority || 0, - retryCount: row.retry_count || 0, - maxRetries: row.max_retries || 3, - claimedBy: row.claimed_by, - claimedAt: row.claimed_at ? new Date(row.claimed_at) : null, - workerHostname: row.worker_hostname, - startedAt: row.started_at ? new Date(row.started_at) : null, - completedAt: row.completed_at ? new Date(row.completed_at) : null, - errorMessage: row.error_message, - productsFound: row.products_found || 0, - productsUpserted: row.products_upserted || 0, - snapshotsCreated: row.snapshots_created || 0, - currentPage: row.current_page || 0, - totalPages: row.total_pages, - lastHeartbeatAt: row.last_heartbeat_at ? new Date(row.last_heartbeat_at) : null, - metadata: row.metadata, - createdAt: new Date(row.created_at), - // Add extra fields from join if present - ...(row.dispensary_name && { dispensaryName: row.dispensary_name }), - ...(row.city && { city: row.city }), - }; -} diff --git a/backend/src/dutchie-az/services/menu-detection.ts b/backend/src/dutchie-az/services/menu-detection.ts deleted file mode 100644 index a8b094c9..00000000 --- a/backend/src/dutchie-az/services/menu-detection.ts +++ /dev/null @@ -1,1173 +0,0 @@ -/** - * Menu Detection Service - * - * Detects menu provider (dutchie, treez, jane, etc.) from dispensary menu_url - * and resolves platform_dispensary_id for dutchie stores. - * - * This service: - * 1. Iterates dispensaries with unknown/missing menu_type or platform_dispensary_id - * 2. Detects provider from menu_url patterns - * 3. For dutchie: extracts cName and resolves platform_dispensary_id via GraphQL - * 4. Logs results to job_run_logs - */ - -import { query } from '../db/connection'; -import { extractCNameFromMenuUrl, extractFromMenuUrl, mapDbRowToDispensary } from './discovery'; -import { resolveDispensaryId } from './graphql-client'; -import { Dispensary, JobStatus } from '../types'; - -// Use shared dispensary columns (handles optional columns like provider_detection_data) -import { DISPENSARY_COLUMNS } from '../db/dispensary-columns'; - -// ============================================================ -// TYPES -// ============================================================ - -export type MenuProvider = - | 'dutchie' - | 'treez' - | 'jane' - | 'iheartjane' - | 'weedmaps' - | 'leafly' - | 'meadow' - | 'blaze' - | 'flowhub' - | 'dispense' - | 'custom' - | 'unknown'; - -export interface DetectionResult { - dispensaryId: number; - dispensaryName: string; - previousMenuType: string | null; - detectedProvider: MenuProvider; - cName: string | null; - platformDispensaryId: string | null; - success: boolean; - error?: string; -} - -export interface BulkDetectionResult { - totalProcessed: number; - totalSucceeded: number; - totalFailed: number; - totalSkipped: number; - results: DetectionResult[]; - errors: string[]; -} - -// ============================================================ -// PROVIDER DETECTION PATTERNS -// ============================================================ - -const PROVIDER_URL_PATTERNS: Array<{ provider: MenuProvider; patterns: RegExp[] }> = [ - // We detect provider based on the actual menu link we find, not just the site domain. - { - provider: 'dutchie', - patterns: [ - /dutchie\.com/i, - /\/embedded-menu\//i, - /\/dispensary\/[A-Z]{2}-/i, // e.g., /dispensary/AZ-store-name - /dutchie-plus/i, - /curaleaf\.com/i, // Curaleaf uses Dutchie platform - /livewithsol\.com/i, // Sol Flower uses Dutchie platform - ], - }, - { - provider: 'treez', - patterns: [ - /treez\.io/i, - /shop\.treez/i, - /treez-ecommerce/i, - ], - }, - { - provider: 'jane', - patterns: [ - /jane\.co/i, - /iheartjane\.com/i, - /embed\.iheartjane/i, - ], - }, - { - provider: 'weedmaps', - patterns: [ - /weedmaps\.com/i, - /menu\.weedmaps/i, - ], - }, - { - provider: 'leafly', - patterns: [ - /leafly\.com/i, - /order\.leafly/i, - ], - }, - { - provider: 'meadow', - patterns: [ - /getmeadow\.com/i, - /meadow\.co/i, - ], - }, - { - provider: 'blaze', - patterns: [ - /blaze\.me/i, - /blazepos\.com/i, - ], - }, - { - provider: 'flowhub', - patterns: [ - /flowhub\.com/i, - /flowhub\.co/i, - ], - }, - { - provider: 'dispense', - patterns: [ - /dispense\.io/i, - /dispenseapp\.com/i, - ], - }, -]; - -// ============================================================ -// WEBSITE CRAWL FUNCTIONS -// ============================================================ - -/** - * Result from crawling a website to find menu links - */ -export interface WebsiteCrawlResult { - menuUrl: string | null; - provider: MenuProvider; - foundLinks: string[]; - crawledPages: string[]; - platformDispensaryId?: string | null; - error?: string; -} - -/** - * Link patterns that suggest a menu or ordering page - */ -const MENU_LINK_PATTERNS = [ - /\/menu/i, - /\/order/i, - /\/shop/i, - /\/products/i, - /\/dispensary/i, - /\/store/i, - /curaleaf\.com/i, - /dutchie\.com/i, - /treez\.io/i, - /jane\.co/i, - /iheartjane\.com/i, - /weedmaps\.com/i, - /leafly\.com/i, - /getmeadow\.com/i, - /blaze\.me/i, - /flowhub\.com/i, - /dispense\.io/i, -]; - -/** - * Check if a URL is a Curaleaf store URL - */ -function isCuraleafUrl(url: string | null | undefined): boolean { - if (!url) return false; - return /curaleaf\.com\/(stores|dispensary)\//i.test(url); -} - -/** - * Fetch a page and extract all links - */ -async function fetchPageLinks(url: string, timeout: number = 10000): Promise<{ links: string[]; error?: string }> { - try { - const controller = new AbortController(); - const timeoutId = setTimeout(() => controller.abort(), timeout); - - // Use Googlebot User-Agent to bypass age gates on dispensary websites - const response = await fetch(url, { - signal: controller.signal, - headers: { - 'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', - 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', - }, - redirect: 'follow', - }); - - clearTimeout(timeoutId); - - if (!response.ok) { - return { links: [], error: `HTTP ${response.status}` }; - } - - const html = await response.text(); - - // Quick check: if the page contains reactEnv.dispensaryId, treat it as Dutchie - // Use direct match for dispensaryId - the [^}]* pattern fails with nested braces in JSON - const reactEnvMatch = /"dispensaryId"\s*:\s*"([a-fA-F0-9]{24})"/i.exec(html); - if (reactEnvMatch && reactEnvMatch[1]) { - return { links: [`dutchie-reactenv:${reactEnvMatch[1]}`] }; - } - - // Extract all href attributes from anchor tags - const linkRegex = /href=["']([^"']+)["']/gi; - const links: string[] = []; - let match; - - while ((match = linkRegex.exec(html)) !== null) { - const href = match[1]; - // Convert relative URLs to absolute - try { - const absoluteUrl = new URL(href, url).href; - links.push(absoluteUrl); - } catch { - // Skip invalid URLs - } - } - - // Also look for iframe src attributes (common for embedded menus) - const iframeRegex = /src=["']([^"']+)["']/gi; - while ((match = iframeRegex.exec(html)) !== null) { - const src = match[1]; - try { - const absoluteUrl = new URL(src, url).href; - // Only add if it matches a provider pattern - for (const { patterns } of PROVIDER_URL_PATTERNS) { - if (patterns.some(p => p.test(absoluteUrl))) { - links.push(absoluteUrl); - break; - } - } - } catch { - // Skip invalid URLs - } - } - - return { links: [...new Set(links)] }; // Deduplicate - } catch (error: any) { - if (error.name === 'AbortError') { - return { links: [], error: 'Timeout' }; - } - return { links: [], error: error.message }; - } -} - -/** - * Crawl a dispensary's website to find menu provider links - * - * Strategy: - * 1. Fetch the homepage and extract all links - * 2. Look for links that match known provider patterns (dutchie, treez, etc.) - * 3. If no direct match, look for menu/order/shop links and follow them (1-2 hops) - * 4. Check followed pages for provider patterns - */ -export async function crawlWebsiteForMenuLinks(websiteUrl: string): Promise { - console.log(`[WebsiteCrawl] Crawling ${websiteUrl} for menu links...`); - - const result: WebsiteCrawlResult = { - menuUrl: null, - provider: 'unknown', - foundLinks: [], - crawledPages: [], - }; - - // Normalize URL - let baseUrl: URL; - try { - baseUrl = new URL(websiteUrl); - if (!baseUrl.protocol.startsWith('http')) { - baseUrl = new URL(`https://${websiteUrl}`); - } - } catch { - result.error = 'Invalid website URL'; - return result; - } - - // Step 1: Fetch the homepage - const homepage = baseUrl.href; - result.crawledPages.push(homepage); - - const { links: homepageLinks, error: homepageError } = await fetchPageLinks(homepage); - if (homepageError) { - result.error = `Failed to fetch homepage: ${homepageError}`; - return result; - } - - result.foundLinks = homepageLinks; - - // Step 2: Try to extract reactEnv.dispensaryId (embedded Dutchie menu) from homepage HTML - try { - // Use Googlebot User-Agent to bypass age gates on dispensary websites - const resp = await fetch(homepage, { - headers: { - 'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', - 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', - }, - redirect: 'follow', - }); - if (resp.ok) { - const html = await resp.text(); - // Look for dispensaryId directly - the [^}]* pattern fails with nested braces - const reactEnvMatch = /"dispensaryId"\s*:\s*"([a-fA-F0-9]{24})"/i.exec(html); - if (reactEnvMatch && reactEnvMatch[1]) { - result.provider = 'dutchie'; - result.menuUrl = homepage; - result.platformDispensaryId = reactEnvMatch[1]; - console.log(`[WebsiteCrawl] Found reactEnv.dispensaryId=${reactEnvMatch[1]} on homepage ${homepage}`); - return result; - } - } - } catch (err: any) { - console.log(`[WebsiteCrawl] reactEnv check failed for ${homepage}: ${err.message}`); - } - - // Step 2: Check for reactEnv token from fetchPageLinks (encoded as dutchie-reactenv:) - for (const link of homepageLinks) { - const reactEnvToken = /^dutchie-reactenv:(.+)$/.exec(link); - if (reactEnvToken) { - result.menuUrl = homepage; - result.provider = 'dutchie'; - result.platformDispensaryId = reactEnvToken[1]; - console.log(`[WebsiteCrawl] Found reactEnv.dispensaryId=${reactEnvToken[1]} on ${homepage}`); - return result; - } - } - - // Step 3: Check for direct provider matches in homepage links - for (const link of homepageLinks) { - for (const { provider, patterns } of PROVIDER_URL_PATTERNS) { - if (patterns.some(p => p.test(link))) { - console.log(`[WebsiteCrawl] Found ${provider} link on homepage: ${link}`); - result.menuUrl = link; - result.provider = provider; - return result; - } - } - } - - // Step 4: Find menu/order/shop links to follow - const menuLinks = homepageLinks.filter(link => { - // Must be same domain or a known provider domain - try { - const linkUrl = new URL(link); - const isSameDomain = linkUrl.hostname === baseUrl.hostname || - linkUrl.hostname.endsWith(`.${baseUrl.hostname}`); - const isProviderDomain = PROVIDER_URL_PATTERNS.some(({ patterns }) => - patterns.some(p => p.test(link)) - ); - const isMenuPath = MENU_LINK_PATTERNS.some(p => p.test(link)); - - return (isSameDomain && isMenuPath) || isProviderDomain; - } catch { - return false; - } - }); - - console.log(`[WebsiteCrawl] Found ${menuLinks.length} potential menu links to follow`); - - // Step 4: Follow menu links (limit to 3 to avoid excessive crawling) - for (const menuLink of menuLinks.slice(0, 3)) { - // Skip if we've already crawled this page - if (result.crawledPages.includes(menuLink)) continue; - - // Check if this link itself is a provider URL - for (const { provider, patterns } of PROVIDER_URL_PATTERNS) { - if (patterns.some(p => p.test(menuLink))) { - console.log(`[WebsiteCrawl] Menu link is a ${provider} URL: ${menuLink}`); - result.menuUrl = menuLink; - result.provider = provider; - return result; - } - } - - result.crawledPages.push(menuLink); - - // Rate limit - await new Promise(r => setTimeout(r, 500)); - - const { links: pageLinks, error: pageError } = await fetchPageLinks(menuLink); - if (pageError) { - console.log(`[WebsiteCrawl] Failed to fetch ${menuLink}: ${pageError}`); - continue; - } - - result.foundLinks.push(...pageLinks); - - // Check for provider matches on this page - for (const link of pageLinks) { - for (const { provider, patterns } of PROVIDER_URL_PATTERNS) { - if (patterns.some(p => p.test(link))) { - console.log(`[WebsiteCrawl] Found ${provider} link on ${menuLink}: ${link}`); - result.menuUrl = link; - result.provider = provider; - return result; - } - } - } - } - - console.log(`[WebsiteCrawl] No menu provider found on ${websiteUrl}`); - return result; -} - -// ============================================================ -// CORE DETECTION FUNCTIONS -// ============================================================ - -/** - * Detect menu provider from a URL - */ -export function detectProviderFromUrl(menuUrl: string | null | undefined): MenuProvider { - if (!menuUrl) return 'unknown'; - - for (const { provider, patterns } of PROVIDER_URL_PATTERNS) { - for (const pattern of patterns) { - if (pattern.test(menuUrl)) { - return provider; - } - } - } - - // Check if it's a custom website (has a domain but doesn't match known providers) - try { - const url = new URL(menuUrl); - if (url.hostname && !url.hostname.includes('localhost')) { - return 'custom'; - } - } catch { - // Invalid URL - } - - return 'unknown'; -} - -/** - * Detect provider and resolve platform ID for a single dispensary - */ -export async function detectAndResolveDispensary(dispensaryId: number): Promise { - console.log(`[MenuDetection] Processing dispensary ${dispensaryId}...`); - - // Get dispensary record - const { rows } = await query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`, - [dispensaryId] - ); - - if (rows.length === 0) { - return { - dispensaryId, - dispensaryName: 'Unknown', - previousMenuType: null, - detectedProvider: 'unknown', - cName: null, - platformDispensaryId: null, - success: false, - error: 'Dispensary not found', - }; - } - - const dispensary = mapDbRowToDispensary(rows[0]); - let menuUrl = dispensary.menuUrl; - const previousMenuType = dispensary.menuType || null; - const website = dispensary.website; - - // If menu_url is null or empty, try to discover it by crawling the dispensary website - if (!menuUrl || menuUrl.trim() === '') { - console.log(`[MenuDetection] ${dispensary.name}: No menu_url - attempting website crawl`); - - // Check if website is available - if (!website || website.trim() === '') { - console.log(`[MenuDetection] ${dispensary.name}: No website available - marking as not crawlable`); - - await query( - ` - UPDATE dispensaries SET - menu_type = 'unknown', - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'unknown'::text, - 'detection_method', 'no_data'::text, - 'detected_at', NOW(), - 'resolution_error', 'No menu_url and no website available'::text, - 'not_crawlable', true, - 'website_crawl_attempted', false - ), - updated_at = NOW() - WHERE id = $1 - `, - [dispensaryId] - ); - - return { - dispensaryId, - dispensaryName: dispensary.name, - previousMenuType, - detectedProvider: 'unknown', - cName: null, - platformDispensaryId: null, - success: true, - error: 'No menu_url and no website available - marked as not crawlable', - }; - } - - // Crawl the website to find menu provider links - console.log(`[MenuDetection] ${dispensary.name}: Crawling website ${website} for menu links...`); - const crawlResult = await crawlWebsiteForMenuLinks(website); - - if (crawlResult.menuUrl && crawlResult.provider !== 'unknown') { - // SUCCESS: Found a menu URL from website crawl! - console.log(`[MenuDetection] ${dispensary.name}: Found ${crawlResult.provider} menu at ${crawlResult.menuUrl}`); - menuUrl = crawlResult.menuUrl; - - // Update the dispensary with the discovered menu_url - await query( - ` - UPDATE dispensaries SET - menu_url = $1, - menu_type = $2, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', $2::text, - 'detection_method', 'website_crawl'::text, - 'detected_at', NOW(), - 'website_crawled', $3::text, - 'website_crawl_pages', $4::jsonb, - 'not_crawlable', false - ), - updated_at = NOW() - WHERE id = $5 - `, - [ - crawlResult.menuUrl, - crawlResult.provider, - website, - JSON.stringify(crawlResult.crawledPages), - dispensaryId - ] - ); - - // Continue with full detection flow using the discovered menu_url - } else { - // Website crawl failed to find a menu provider - const errorReason = crawlResult.error || 'No menu provider links found on website'; - console.log(`[MenuDetection] ${dispensary.name}: Website crawl failed - ${errorReason}`); - - await query( - ` - UPDATE dispensaries SET - menu_type = 'unknown', - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'unknown'::text, - 'detection_method', 'website_crawl'::text, - 'detected_at', NOW(), - 'website_crawled', $1::text, - 'website_crawl_pages', $2::jsonb, - 'resolution_error', $3::text, - 'not_crawlable', true - ), - updated_at = NOW() - WHERE id = $4 - `, - [ - website, - JSON.stringify(crawlResult.crawledPages), - errorReason, - dispensaryId - ] - ); - - return { - dispensaryId, - dispensaryName: dispensary.name, - previousMenuType, - detectedProvider: 'unknown', - cName: null, - platformDispensaryId: null, - success: true, - error: `Website crawl failed: ${errorReason}`, - }; - } - } - - // Detect provider from URL - const detectedProvider = detectProviderFromUrl(menuUrl); - console.log(`[MenuDetection] ${dispensary.name}: Detected provider = ${detectedProvider} from URL: ${menuUrl}`); - - // Initialize result - const result: DetectionResult = { - dispensaryId, - dispensaryName: dispensary.name, - previousMenuType, - detectedProvider, - cName: null, - platformDispensaryId: null, - success: false, - }; - - // If not dutchie, just update menu_type (non-dutchie providers) - // Note: curaleaf.com and livewithsol.com are detected directly as 'dutchie' via PROVIDER_URL_PATTERNS - if (detectedProvider !== 'dutchie') { - await query( - ` - UPDATE dispensaries SET - menu_type = $1, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', $1::text, - 'detection_method', 'url_pattern'::text, - 'detected_at', NOW(), - 'not_crawlable', false - ), - updated_at = NOW() - WHERE id = $2 - `, - [detectedProvider, dispensaryId] - ); - result.success = true; - console.log(`[MenuDetection] ${dispensary.name}: Updated menu_type to ${detectedProvider}`); - return result; - } - - // For dutchie: extract cName or platformId from menu_url - const extraction = extractFromMenuUrl(menuUrl); - - if (!extraction) { - result.error = `Could not extract cName or platformId from menu_url: ${menuUrl}`; - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - last_id_resolution_at = NOW(), - id_resolution_attempts = COALESCE(id_resolution_attempts, 0) + 1, - id_resolution_error = $1, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'url_pattern'::text, - 'detected_at', NOW(), - 'resolution_error', $1::text, - 'not_crawlable', true - ), - updated_at = NOW() - WHERE id = $2 - `, - [result.error, dispensaryId] - ); - console.log(`[Henry - Entry Point Finder] ${dispensary.name}: ${result.error}`); - return result; - } - - // If URL contains platform_dispensary_id directly (e.g., /api/v2/embedded-menu/.js), skip GraphQL resolution - if (extraction.type === 'platformId') { - const platformId = extraction.value; - result.platformDispensaryId = platformId; - result.success = true; - - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - platform_dispensary_id = $1, - last_id_resolution_at = NOW(), - id_resolution_attempts = COALESCE(id_resolution_attempts, 0) + 1, - id_resolution_error = NULL, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'url_direct_platform_id'::text, - 'detected_at', NOW(), - 'platform_id_source', 'url_embedded'::text, - 'platform_id_resolved', true, - 'platform_id_resolved_at', NOW(), - 'resolution_error', NULL::text, - 'not_crawlable', false - ), - updated_at = NOW() - WHERE id = $2 - `, - [platformId, dispensaryId] - ); - console.log(`[Henry - Entry Point Finder] ${dispensary.name}: Platform ID extracted directly from URL = ${platformId}`); - return result; - } - - // Otherwise, we have a cName that needs GraphQL resolution - const cName = extraction.value; - result.cName = cName; - - // Resolve platform_dispensary_id from cName - console.log(`[MenuDetection] ${dispensary.name}: Resolving platform ID for cName = ${cName}`); - - try { - const platformId = await resolveDispensaryId(cName); - - if (platformId) { - result.platformDispensaryId = platformId; - result.success = true; - - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - platform_dispensary_id = $1, - last_id_resolution_at = NOW(), - id_resolution_attempts = COALESCE(id_resolution_attempts, 0) + 1, - id_resolution_error = NULL, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'url_pattern'::text, - 'detected_at', NOW(), - 'cname_extracted', $2::text, - 'platform_id_resolved', true, - 'platform_id_resolved_at', NOW(), - 'resolution_error', NULL::text, - 'not_crawlable', false - ), - updated_at = NOW() - WHERE id = $3 - `, - [platformId, cName, dispensaryId] - ); - console.log(`[Henry - Entry Point Finder] ${dispensary.name}: Resolved platform ID = ${platformId}`); - } else { - // cName resolution failed - try crawling website as fallback - console.log(`[Henry - Entry Point Finder] ${dispensary.name}: cName "${cName}" not found on Dutchie, trying website crawl fallback...`); - - if (website && website.trim() !== '') { - const fallbackCrawl = await crawlWebsiteForMenuLinks(website); - - if (fallbackCrawl.menuUrl && fallbackCrawl.provider === 'dutchie') { - // Found Dutchie menu via website crawl! - console.log(`[MenuDetection] ${dispensary.name}: Found Dutchie menu via website crawl: ${fallbackCrawl.menuUrl}`); - - // Extract from the new menu URL - const newExtraction = extractFromMenuUrl(fallbackCrawl.menuUrl); - if (newExtraction) { - let fallbackPlatformId: string | null = null; - - if (newExtraction.type === 'platformId') { - fallbackPlatformId = newExtraction.value; - } else { - // Try to resolve the new cName - fallbackPlatformId = await resolveDispensaryId(newExtraction.value); - } - - if (fallbackPlatformId) { - result.platformDispensaryId = fallbackPlatformId; - result.success = true; - result.cName = newExtraction.value; - - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - menu_url = $1, - platform_dispensary_id = $2, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'website_crawl_fallback'::text, - 'detected_at', NOW(), - 'original_cname', $3::text, - 'fallback_cname', $4::text, - 'website_crawled', $5::text, - 'platform_id_resolved', true, - 'platform_id_resolved_at', NOW(), - 'not_crawlable', false - ), - updated_at = NOW() - WHERE id = $6 - `, - [fallbackCrawl.menuUrl, fallbackPlatformId, cName, newExtraction.value, website, dispensaryId] - ); - console.log(`[MenuDetection] ${dispensary.name}: Resolved via website crawl, platform ID = ${fallbackPlatformId}`); - return result; - } - } - } - } - - // Website crawl fallback didn't work either - result.error = `cName "${cName}" could not be resolved - may not exist on Dutchie`; - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - platform_dispensary_id = NULL, - last_id_resolution_at = NOW(), - id_resolution_attempts = COALESCE(id_resolution_attempts, 0) + 1, - id_resolution_error = $2, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'url_pattern'::text, - 'detected_at', NOW(), - 'cname_extracted', $1::text, - 'platform_id_resolved', false, - 'resolution_error', $2::text, - 'website_crawl_attempted', true, - 'not_crawlable', true - ), - updated_at = NOW() - WHERE id = $3 - `, - [cName, result.error, dispensaryId] - ); - console.log(`[Henry - Entry Point Finder] ${dispensary.name}: ${result.error}`); - } - } catch (error: any) { - result.error = `Resolution failed: ${error.message}`; - await query( - ` - UPDATE dispensaries SET - menu_type = 'dutchie', - last_id_resolution_at = NOW(), - id_resolution_attempts = COALESCE(id_resolution_attempts, 0) + 1, - id_resolution_error = $2, - provider_detection_data = COALESCE(provider_detection_data, '{}'::jsonb) || - jsonb_build_object( - 'detected_provider', 'dutchie'::text, - 'detection_method', 'url_pattern'::text, - 'detected_at', NOW(), - 'cname_extracted', $1::text, - 'platform_id_resolved', false, - 'resolution_error', $2::text, - 'not_crawlable', true - ), - updated_at = NOW() - WHERE id = $3 - `, - [cName, result.error, dispensaryId] - ); - console.error(`[Henry - Entry Point Finder] ${dispensary.name}: ${result.error}`); - } - - return result; -} - -/** - * Run bulk detection on all dispensaries with unknown/missing menu_type or platform_dispensary_id - * Also includes dispensaries with no menu_url but with a website (for website crawl discovery) - * - * Enhanced for Henry (Entry Point Finder) to also process: - * - Stores with slug changes that need re-resolution - * - Recently added stores from Alice's discovery - * - Stores that failed resolution and need retry - */ -export async function runBulkDetection(options: { - state?: string; - onlyUnknown?: boolean; - onlyMissingPlatformId?: boolean; - includeWebsiteCrawl?: boolean; // Include dispensaries with website but no menu_url - includeDutchieMissingPlatformId?: boolean; // include menu_type='dutchie' with null platform_id - includeSlugChanges?: boolean; // Include stores where Alice detected slug changes - includeRecentlyAdded?: boolean; // Include stores recently added by Alice - scope?: { states?: string[]; storeIds?: number[] }; // Scope filtering for sharding - limit?: number; -} = {}): Promise { - const { - state, - onlyUnknown = true, - onlyMissingPlatformId = false, - includeWebsiteCrawl = true, - includeDutchieMissingPlatformId = true, - includeSlugChanges = true, - includeRecentlyAdded = true, - scope, - limit, - } = options; - - const scopeDesc = scope?.states?.length - ? ` (states: ${scope.states.join(', ')})` - : scope?.storeIds?.length - ? ` (${scope.storeIds.length} specific stores)` - : state ? ` (state: ${state})` : ''; - - console.log(`[Henry - Entry Point Finder] Starting bulk detection${scopeDesc}...`); - - // Build query to find dispensaries needing detection - // Includes: dispensaries with menu_url OR (no menu_url but has website and not already marked not_crawlable) - // Optionally includes dutchie stores missing platform ID, slug changes, and recently added stores - let whereClause = `WHERE ( - menu_url IS NOT NULL - ${includeWebsiteCrawl ? `OR ( - menu_url IS NULL - AND website IS NOT NULL - AND website != '' - AND (provider_detection_data IS NULL OR NOT (provider_detection_data->>'not_crawlable')::boolean) - )` : ''} - ${includeDutchieMissingPlatformId ? `OR ( - menu_type = 'dutchie' AND platform_dispensary_id IS NULL - )` : ''} - )`; - const params: any[] = []; - let paramIndex = 1; - - // Apply scope filtering (takes precedence over single state filter) - if (scope?.storeIds?.length) { - whereClause += ` AND id = ANY($${paramIndex++})`; - params.push(scope.storeIds); - } else if (scope?.states?.length) { - whereClause += ` AND state = ANY($${paramIndex++})`; - params.push(scope.states); - } else if (state) { - whereClause += ` AND state = $${paramIndex++}`; - params.push(state); - } - - // Handle filters for unknown and/or missing platform IDs - if (onlyUnknown && onlyMissingPlatformId) { - whereClause += ` AND ( - (menu_type IS NULL OR menu_type = '' OR menu_type = 'unknown') - OR (menu_type = 'dutchie' AND platform_dispensary_id IS NULL) - )`; - } else if (onlyUnknown) { - whereClause += ` AND ( - (menu_type IS NULL OR menu_type = '' OR menu_type = 'unknown') - ${includeDutchieMissingPlatformId ? `OR (menu_type = 'dutchie' AND platform_dispensary_id IS NULL)` : ''} - )`; - } else if (onlyMissingPlatformId) { - whereClause += ` AND (menu_type = 'dutchie' AND platform_dispensary_id IS NULL)`; - } else if (includeDutchieMissingPlatformId) { - // Always attempt to resolve dutchie stores missing platform IDs - whereClause += ` AND (menu_type = 'dutchie' AND platform_dispensary_id IS NULL)`; - } - - let query_str = ` - SELECT ${DISPENSARY_COLUMNS} FROM dispensaries - ${whereClause} - ORDER BY name - `; - - if (limit) { - query_str += ` LIMIT $${paramIndex}`; - params.push(limit); - } - - const { rows: dispensaries } = await query(query_str, params); - console.log(`[MenuDetection] Found ${dispensaries.length} dispensaries to process (includeWebsiteCrawl=${includeWebsiteCrawl})`); - - const result: BulkDetectionResult = { - totalProcessed: 0, - totalSucceeded: 0, - totalFailed: 0, - totalSkipped: 0, - results: [], - errors: [], - }; - - for (const row of dispensaries) { - result.totalProcessed++; - - try { - const detectionResult = await detectAndResolveDispensary(row.id); - result.results.push(detectionResult); - - if (detectionResult.success) { - result.totalSucceeded++; - } else { - result.totalFailed++; - if (detectionResult.error) { - result.errors.push(`${detectionResult.dispensaryName}: ${detectionResult.error}`); - } - } - - // Rate limit between requests - await new Promise(r => setTimeout(r, 1000)); - } catch (error: any) { - result.totalFailed++; - result.errors.push(`${row.name || row.id}: ${error.message}`); - } - } - - console.log(`[MenuDetection] Bulk detection complete: ${result.totalSucceeded} succeeded, ${result.totalFailed} failed`); - return result; -} - -// ============================================================ -// SCHEDULED JOB EXECUTOR -// ============================================================ - -/** - * Execute the menu detection job (called by scheduler) - * - * Worker: Henry (Entry Point Finder) - * Uses METHOD 1 (reactEnv extraction) as primary method per user requirements. - * - * Scope filtering: - * - config.scope.states: Array of state codes to limit detection (e.g., ["AZ", "CA"]) - * - config.scope.storeIds: Array of specific store IDs to process - * - * Processes: - * - Stores with unknown/missing menu_type - * - Stores with missing platform_dispensary_id - * - Stores with slug changes that need re-resolution (from Alice) - * - Recently added stores (discovered by Alice) - */ -export async function executeMenuDetectionJob(config: Record = {}): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - const state = config.state || 'AZ'; - const scope = config.scope as { states?: string[]; storeIds?: number[] } | undefined; - const onlyUnknown = config.onlyUnknown !== false; - // Default to true - always try to resolve platform IDs for dutchie stores - const onlyMissingPlatformId = config.onlyMissingPlatformId !== false; - const includeDutchieMissingPlatformId = config.includeDutchieMissingPlatformId !== false; - const includeSlugChanges = config.includeSlugChanges !== false; - const includeRecentlyAdded = config.includeRecentlyAdded !== false; - - const scopeDesc = scope?.states?.length - ? ` (states: ${scope.states.join(', ')})` - : scope?.storeIds?.length - ? ` (${scope.storeIds.length} specific stores)` - : ` (state: ${state})`; - - console.log(`[Henry - Entry Point Finder] Executing scheduled job${scopeDesc}...`); - - try { - const result = await runBulkDetection({ - state: scope ? undefined : state, // Use scope if provided, otherwise fall back to state - scope, - onlyUnknown, - onlyMissingPlatformId, - includeDutchieMissingPlatformId, - includeSlugChanges, - includeRecentlyAdded, - }); - - const status: JobStatus = - result.totalFailed === 0 ? 'success' : - result.totalSucceeded === 0 ? 'error' : 'partial'; - - return { - status, - itemsProcessed: result.totalProcessed, - itemsSucceeded: result.totalSucceeded, - itemsFailed: result.totalFailed, - errorMessage: result.errors.length > 0 ? result.errors.slice(0, 5).join('; ') : undefined, - metadata: { - scope: scope || { states: [state] }, - onlyUnknown, - onlyMissingPlatformId, - includeSlugChanges, - includeRecentlyAdded, - providerCounts: countByProvider(result.results), - }, - }; - } catch (error: any) { - return { - status: 'error', - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 0, - errorMessage: error.message, - metadata: { scope: scope || { states: [state] } }, - }; - } -} - -/** - * Count results by detected provider - */ -function countByProvider(results: DetectionResult[]): Record { - const counts: Record = {}; - for (const r of results) { - counts[r.detectedProvider] = (counts[r.detectedProvider] || 0) + 1; - } - return counts; -} - -// ============================================================ -// UTILITY FUNCTIONS -// ============================================================ - -/** - * Get detection stats for dashboard - */ -export async function getDetectionStats(): Promise<{ - totalDispensaries: number; - withMenuType: number; - withPlatformId: number; - needsDetection: number; - byProvider: Record; -}> { - const { rows } = await query(` - SELECT - COUNT(*) as total, - COUNT(*) FILTER (WHERE menu_type IS NOT NULL AND menu_type != '' AND menu_type != 'unknown') as with_menu_type, - COUNT(*) FILTER (WHERE platform_dispensary_id IS NOT NULL) as with_platform_id, - COUNT(*) FILTER (WHERE menu_url IS NOT NULL AND (menu_type IS NULL OR menu_type = '' OR menu_type = 'unknown')) as needs_detection - FROM dispensaries - WHERE state = 'AZ' - `); - - const stats = rows[0] || {}; - - // Get provider breakdown - const { rows: providerRows } = await query(` - SELECT menu_type, COUNT(*) as count - FROM dispensaries - WHERE state = 'AZ' AND menu_type IS NOT NULL AND menu_type != '' - GROUP BY menu_type - ORDER BY count DESC - `); - - const byProvider: Record = {}; - for (const row of providerRows) { - byProvider[row.menu_type] = parseInt(row.count, 10); - } - - return { - totalDispensaries: parseInt(stats.total || '0', 10), - withMenuType: parseInt(stats.with_menu_type || '0', 10), - withPlatformId: parseInt(stats.with_platform_id || '0', 10), - needsDetection: parseInt(stats.needs_detection || '0', 10), - byProvider, - }; -} - -/** - * Get dispensaries needing detection - * Includes dispensaries with website but no menu_url for website crawl discovery - */ -export async function getDispensariesNeedingDetection(options: { - state?: string; - limit?: number; - includeWebsiteCrawl?: boolean; -} = {}): Promise { - const { state = 'AZ', limit = 100, includeWebsiteCrawl = true } = options; - - const { rows } = await query( - ` - SELECT ${DISPENSARY_COLUMNS} FROM dispensaries - WHERE state = $1 - AND ( - (menu_url IS NOT NULL AND (menu_type IS NULL OR menu_type = '' OR menu_type = 'unknown' - OR (menu_type = 'dutchie' AND platform_dispensary_id IS NULL))) - ${includeWebsiteCrawl ? `OR ( - menu_url IS NULL - AND website IS NOT NULL - AND website != '' - AND (provider_detection_data IS NULL OR NOT (provider_detection_data->>'not_crawlable')::boolean) - )` : ''} - ) - ORDER BY name - LIMIT $2 - `, - [state, limit] - ); - - return rows.map(mapDbRowToDispensary); -} diff --git a/backend/src/dutchie-az/services/product-crawler.ts b/backend/src/dutchie-az/services/product-crawler.ts deleted file mode 100644 index 9b48c246..00000000 --- a/backend/src/dutchie-az/services/product-crawler.ts +++ /dev/null @@ -1,1197 +0,0 @@ -/** - * Dutchie AZ Product Crawler Service - * - * Crawls products from Dutchie dispensaries and stores them in the dutchie_az database. - * Handles normalization from GraphQL response to database entities. - * - * IMPORTANT: Uses chunked batch processing per CLAUDE.md Rule #15 to avoid OOM. - */ - -import { query, getClient } from '../db/connection'; -import { fetchAllProducts, fetchAllProductsBothModes } from './graphql-client'; -import { mapDbRowToDispensary } from './discovery'; -import { - DutchieRawProduct, - DutchieProduct, - DutchieProductSnapshot, - DutchieProductOptionSnapshot, - DutchiePOSChild, - Dispensary, - CrawlMode, - StockStatus, - deriveStockStatus, - calculateTotalQuantity, -} from '../types'; -import { downloadProductImage, imageExists } from '../../utils/image-storage'; - -// Use shared dispensary columns (handles optional columns like provider_detection_data) -import { DISPENSARY_COLUMNS } from '../db/dispensary-columns'; - -// ============================================================ -// BATCH PROCESSING CONFIGURATION -// ============================================================ - -/** Chunk size for batch DB writes (per CLAUDE.md Rule #15) */ -const BATCH_CHUNK_SIZE = 100; - -// ============================================================ -// NORMALIZATION FUNCTIONS -// ============================================================ - -/** - * Convert price to cents - */ -function toCents(price?: number): number | undefined { - if (price === undefined || price === null) return undefined; - return Math.round(price * 100); -} - -/** - * Get min value from array of numbers - */ -function getMin(arr?: number[]): number | undefined { - if (!arr || arr.length === 0) return undefined; - return Math.min(...arr.filter((n) => n !== null && n !== undefined)); -} - -/** - * Get max value from array of numbers - */ -function getMax(arr?: number[]): number | undefined { - if (!arr || arr.length === 0) return undefined; - return Math.max(...arr.filter((n) => n !== null && n !== undefined)); -} - -/** - * Normalize a value to boolean - * Handles Dutchie API returning {} or [] or other non-boolean values - * that would cause "invalid input syntax for type boolean" errors - */ -function normBool(v: any, defaultVal: boolean = false): boolean { - if (v === true) return true; - if (v === false) return false; - // Log unexpected object/array values once for debugging - if (v !== null && v !== undefined && typeof v === 'object') { - console.warn(`[normBool] Unexpected object value, coercing to ${defaultVal}:`, JSON.stringify(v)); - } - return defaultVal; -} - -/** - * Normalize a value to Date or undefined - * Handles Dutchie API returning {} or [] or other non-date values - * that would cause "invalid input syntax for type timestamp" errors - */ -function normDate(v: any): Date | undefined { - if (!v) return undefined; - // Reject objects/arrays that aren't dates - if (typeof v === 'object' && !(v instanceof Date)) { - console.warn(`[normDate] Unexpected object value, ignoring:`, JSON.stringify(v)); - return undefined; - } - // Try parsing - const d = new Date(v); - if (isNaN(d.getTime())) { - console.warn(`[normDate] Invalid date value, ignoring:`, v); - return undefined; - } - return d; -} - -/** - * Extract cName (Dutchie slug) from menuUrl or dispensary slug - * Handles URL formats: - * - https://dutchie.com/embedded-menu/AZ-Deeply-Rooted -> AZ-Deeply-Rooted - * - https://dutchie.com/dispensary/sol-flower-dispensary-mcclintock -> sol-flower-dispensary-mcclintock - * Falls back to dispensary.slug if menuUrl extraction fails - */ -function extractCName(dispensary: Dispensary): string { - if (dispensary.menuUrl) { - try { - const url = new URL(dispensary.menuUrl); - // Extract last path segment: /embedded-menu/X or /dispensary/X - const segments = url.pathname.split('/').filter(Boolean); - if (segments.length >= 2) { - const cName = segments[segments.length - 1]; - if (cName) { - console.log(`[ProductCrawler] Extracted cName "${cName}" from menuUrl`); - return cName; - } - } - } catch (e) { - console.warn(`[ProductCrawler] Failed to parse menuUrl: ${dispensary.menuUrl}`); - } - } - // Fallback to slug - console.log(`[ProductCrawler] Using dispensary slug "${dispensary.slug}" as cName`); - return dispensary.slug; -} - -/** - * Normalize a POSMetaData.children entry to DutchieProductOptionSnapshot - */ -function normalizeOption(child: DutchiePOSChild): DutchieProductOptionSnapshot { - return { - optionId: child.canonicalID || child.canonicalPackageId || child.canonicalSKU || child.option || 'unknown', - canonicalId: child.canonicalID, - canonicalPackageId: child.canonicalPackageId, - canonicalSKU: child.canonicalSKU, - canonicalName: child.canonicalName, - canonicalCategory: child.canonicalCategory, - canonicalCategoryId: child.canonicalCategoryId, - canonicalBrandId: child.canonicalBrandId, - canonicalBrandName: child.canonicalBrandName, - canonicalStrainId: child.canonicalStrainId, - canonicalVendorId: child.canonicalVendorId, - optionLabel: child.option, - packageQuantity: child.packageQuantity, - recEquivalent: child.recEquivalent, - standardEquivalent: child.standardEquivalent, - priceCents: toCents(child.price), - recPriceCents: toCents(child.recPrice), - medPriceCents: toCents(child.medPrice), - quantity: child.quantity, - quantityAvailable: child.quantityAvailable, - kioskQuantityAvailable: child.kioskQuantityAvailable, - activeBatchTags: child.activeBatchTags, - canonicalImgUrl: child.canonicalImgUrl, - canonicalLabResultUrl: child.canonicalLabResultUrl, - canonicalEffectivePotencyMg: child.canonicalEffectivePotencyMg, - rawChildPayload: child, - }; -} - -/** - * Normalize a raw Dutchie product to DutchieProduct (canonical identity) - */ -export function normalizeProduct( - raw: DutchieRawProduct, - dispensaryId: number, - platformDispensaryId: string -): Partial { - return { - dispensaryId, - platform: 'dutchie', - externalProductId: raw._id || raw.id || '', - platformDispensaryId, - cName: raw.cName, - name: raw.Name, - - // Brand - brandName: raw.brandName || raw.brand?.name, - brandId: raw.brandId || raw.brand?.id, - brandLogoUrl: raw.brandLogo || raw.brand?.imageUrl, - - // Classification - type: raw.type, - subcategory: raw.subcategory, - strainType: raw.strainType, - provider: raw.provider, - - // Potency - thc: raw.THC, - thcContent: raw.THCContent?.range?.[0], - cbd: raw.CBD, - cbdContent: raw.CBDContent?.range?.[0], - cannabinoidsV2: raw.cannabinoidsV2, - effects: raw.effects, - - // Status / flags - status: raw.Status, - medicalOnly: normBool(raw.medicalOnly, false), - recOnly: normBool(raw.recOnly, false), - featured: normBool(raw.featured, false), - comingSoon: normBool(raw.comingSoon, false), - certificateOfAnalysisEnabled: normBool(raw.certificateOfAnalysisEnabled, false), - - isBelowThreshold: normBool(raw.isBelowThreshold, false), - isBelowKioskThreshold: normBool(raw.isBelowKioskThreshold, false), - optionsBelowThreshold: normBool(raw.optionsBelowThreshold, false), - optionsBelowKioskThreshold: normBool(raw.optionsBelowKioskThreshold, false), - - // Derived stock status - stockStatus: deriveStockStatus(raw), - totalQuantityAvailable: calculateTotalQuantity(raw), - - // Images - primaryImageUrl: raw.Image || raw.images?.[0]?.url, - images: raw.images, - - // Misc - measurements: raw.measurements, - weight: typeof raw.weight === 'number' ? String(raw.weight) : raw.weight, - pastCNames: raw.pastCNames, - - createdAtDutchie: normDate(raw.createdAt), - updatedAtDutchie: normDate(raw.updatedAt), - - latestRawPayload: raw, - }; -} - -/** - * Normalize a raw Dutchie product to DutchieProductSnapshot (time-series data) - */ -export function normalizeSnapshot( - raw: DutchieRawProduct, - dutchieProductId: number, - dispensaryId: number, - platformDispensaryId: string, - pricingType: 'rec' | 'med' | 'unknown', - crawlMode: CrawlMode = 'mode_a' -): Partial { - const children = raw.POSMetaData?.children || []; - const options = children.map(normalizeOption); - - // Aggregate prices from various sources - const recPrices = raw.recPrices || []; - const medPrices = raw.medicalPrices || []; - const recSpecialPrices = raw.recSpecialPrices || []; - const medSpecialPrices = raw.medicalSpecialPrices || []; - const wholesalePrices = raw.wholesalePrices || []; - - // Also consider child prices - const childRecPrices = children.map((c) => c.recPrice).filter((p) => p !== undefined) as number[]; - const childMedPrices = children.map((c) => c.medPrice).filter((p) => p !== undefined) as number[]; - const childPrices = children.map((c) => c.price).filter((p) => p !== undefined) as number[]; - - // Aggregate inventory - use calculateTotalQuantity for proper null handling - const totalQty = calculateTotalQuantity(raw); - const hasAnyKioskQty = children.some(c => typeof c.kioskQuantityAvailable === 'number'); - const totalKioskQty = hasAnyKioskQty - ? children.reduce((sum, c) => sum + (c.kioskQuantityAvailable || 0), 0) - : null; - - // Determine if on special - const isOnSpecial = - raw.special === true || - (raw.specialData?.saleSpecials && raw.specialData.saleSpecials.length > 0) || - (recSpecialPrices.length > 0 && recSpecialPrices[0] !== null) || - (medSpecialPrices.length > 0 && medSpecialPrices[0] !== null); - - return { - dutchieProductId, - dispensaryId, - platformDispensaryId, - externalProductId: raw._id || raw.id || '', - pricingType, - crawlMode, - - status: raw.Status, - featured: normBool(raw.featured, false), - special: normBool(isOnSpecial, false), - medicalOnly: normBool(raw.medicalOnly, false), - recOnly: normBool(raw.recOnly, false), - - // Product was present in feed - isPresentInFeed: true, - - // Derived stock status - stockStatus: deriveStockStatus(raw), - - // Price summary - recMinPriceCents: toCents(getMin([...recPrices, ...childRecPrices, ...childPrices])), - recMaxPriceCents: toCents(getMax([...recPrices, ...childRecPrices, ...childPrices])), - recMinSpecialPriceCents: toCents(getMin(recSpecialPrices)), - medMinPriceCents: toCents(getMin([...medPrices, ...childMedPrices])), - medMaxPriceCents: toCents(getMax([...medPrices, ...childMedPrices])), - medMinSpecialPriceCents: toCents(getMin(medSpecialPrices)), - wholesaleMinPriceCents: toCents(getMin(wholesalePrices)), - - // Inventory summary - null = unknown, 0 = all OOS - totalQuantityAvailable: totalQty, - totalKioskQuantityAvailable: totalKioskQty, - manualInventory: normBool(raw.manualInventory, false), - isBelowThreshold: normBool(raw.isBelowThreshold, false), - isBelowKioskThreshold: normBool(raw.isBelowKioskThreshold, false), - - options, - rawPayload: raw, - crawledAt: new Date(), - }; -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Upsert a DutchieProduct record - */ -async function upsertProduct(product: Partial): Promise { - const result = await query<{ id: number }>( - ` - INSERT INTO dutchie_products ( - dispensary_id, platform, external_product_id, platform_dispensary_id, - c_name, name, brand_name, brand_id, brand_logo_url, - type, subcategory, strain_type, provider, - thc, thc_content, cbd, cbd_content, cannabinoids_v2, effects, - status, medical_only, rec_only, featured, coming_soon, certificate_of_analysis_enabled, - is_below_threshold, is_below_kiosk_threshold, options_below_threshold, options_below_kiosk_threshold, - stock_status, total_quantity_available, - primary_image_url, images, measurements, weight, past_c_names, - created_at_dutchie, updated_at_dutchie, latest_raw_payload, updated_at - ) VALUES ( - $1, $2, $3, $4, - $5, $6, $7, $8, $9, - $10, $11, $12, $13, - $14, $15, $16, $17, $18, $19, - $20, $21, $22, $23, $24, $25, - $26, $27, $28, $29, - $30, $31, - $32, $33, $34, $35, $36, - $37, $38, $39, NOW() - ) - ON CONFLICT (dispensary_id, external_product_id) DO UPDATE SET - c_name = EXCLUDED.c_name, - name = EXCLUDED.name, - brand_name = EXCLUDED.brand_name, - brand_id = EXCLUDED.brand_id, - brand_logo_url = EXCLUDED.brand_logo_url, - type = EXCLUDED.type, - subcategory = EXCLUDED.subcategory, - strain_type = EXCLUDED.strain_type, - provider = EXCLUDED.provider, - thc = EXCLUDED.thc, - thc_content = EXCLUDED.thc_content, - cbd = EXCLUDED.cbd, - cbd_content = EXCLUDED.cbd_content, - cannabinoids_v2 = EXCLUDED.cannabinoids_v2, - effects = EXCLUDED.effects, - status = EXCLUDED.status, - medical_only = EXCLUDED.medical_only, - rec_only = EXCLUDED.rec_only, - featured = EXCLUDED.featured, - coming_soon = EXCLUDED.coming_soon, - certificate_of_analysis_enabled = EXCLUDED.certificate_of_analysis_enabled, - is_below_threshold = EXCLUDED.is_below_threshold, - is_below_kiosk_threshold = EXCLUDED.is_below_kiosk_threshold, - options_below_threshold = EXCLUDED.options_below_threshold, - options_below_kiosk_threshold = EXCLUDED.options_below_kiosk_threshold, - stock_status = EXCLUDED.stock_status, - total_quantity_available = EXCLUDED.total_quantity_available, - primary_image_url = EXCLUDED.primary_image_url, - images = EXCLUDED.images, - measurements = EXCLUDED.measurements, - weight = EXCLUDED.weight, - past_c_names = EXCLUDED.past_c_names, - created_at_dutchie = EXCLUDED.created_at_dutchie, - updated_at_dutchie = EXCLUDED.updated_at_dutchie, - latest_raw_payload = EXCLUDED.latest_raw_payload, - updated_at = NOW() - RETURNING id - `, - [ - product.dispensaryId, - product.platform, - product.externalProductId, - product.platformDispensaryId, - product.cName, - product.name, - product.brandName, - product.brandId, - product.brandLogoUrl, - product.type, - product.subcategory, - product.strainType, - product.provider, - product.thc, - product.thcContent, - product.cbd, - product.cbdContent, - product.cannabinoidsV2 ? JSON.stringify(product.cannabinoidsV2) : null, - product.effects ? JSON.stringify(product.effects) : null, - product.status, - product.medicalOnly, - product.recOnly, - product.featured, - product.comingSoon, - product.certificateOfAnalysisEnabled, - product.isBelowThreshold, - product.isBelowKioskThreshold, - product.optionsBelowThreshold, - product.optionsBelowKioskThreshold, - product.stockStatus, - product.totalQuantityAvailable, - product.primaryImageUrl, - product.images ? JSON.stringify(product.images) : null, - product.measurements ? JSON.stringify(product.measurements) : null, - product.weight, - product.pastCNames, - product.createdAtDutchie, - product.updatedAtDutchie, - product.latestRawPayload ? JSON.stringify(product.latestRawPayload) : null, - ] - ); - - return result.rows[0].id; -} - -/** - * Download product image and update local image URLs - * Skips download if local image already exists for this product+URL combo - */ -async function downloadAndUpdateProductImage( - productId: number, - dispensaryId: number, - externalProductId: string, - primaryImageUrl: string | undefined -): Promise<{ downloaded: boolean; error?: string }> { - if (!primaryImageUrl) { - return { downloaded: false, error: 'No image URL' }; - } - - try { - // Check if we already have this image locally - const exists = await imageExists(dispensaryId, externalProductId, primaryImageUrl); - if (exists) { - return { downloaded: false }; - } - - // Download and process the image - const result = await downloadProductImage(primaryImageUrl, dispensaryId, externalProductId); - - if (!result.success || !result.urls) { - return { downloaded: false, error: result.error }; - } - - // Update the product record with local image URLs - await query( - ` - UPDATE dutchie_products - SET - local_image_url = $1, - local_image_thumb_url = $2, - local_image_medium_url = $3, - original_image_url = COALESCE(original_image_url, primary_image_url), - updated_at = NOW() - WHERE id = $4 - `, - [result.urls.full, result.urls.thumb, result.urls.medium, productId] - ); - - return { downloaded: true }; - } catch (error: any) { - return { downloaded: false, error: error.message }; - } -} - -/** - * Insert a snapshot record - */ -async function insertSnapshot(snapshot: Partial): Promise { - const result = await query<{ id: number }>( - ` - INSERT INTO dutchie_product_snapshots ( - dutchie_product_id, dispensary_id, platform_dispensary_id, external_product_id, - pricing_type, crawl_mode, status, featured, special, medical_only, rec_only, - is_present_in_feed, stock_status, - rec_min_price_cents, rec_max_price_cents, rec_min_special_price_cents, - med_min_price_cents, med_max_price_cents, med_min_special_price_cents, - wholesale_min_price_cents, - total_quantity_available, total_kiosk_quantity_available, manual_inventory, - is_below_threshold, is_below_kiosk_threshold, - options, raw_payload, crawled_at - ) VALUES ( - $1, $2, $3, $4, - $5, $6, $7, $8, $9, $10, $11, - $12, $13, - $14, $15, $16, - $17, $18, $19, - $20, - $21, $22, $23, - $24, $25, - $26, $27, $28 - ) - RETURNING id - `, - [ - snapshot.dutchieProductId, - snapshot.dispensaryId, - snapshot.platformDispensaryId, - snapshot.externalProductId, - snapshot.pricingType, - snapshot.crawlMode, - snapshot.status, - snapshot.featured, - snapshot.special, - snapshot.medicalOnly, - snapshot.recOnly, - snapshot.isPresentInFeed ?? true, - snapshot.stockStatus, - snapshot.recMinPriceCents, - snapshot.recMaxPriceCents, - snapshot.recMinSpecialPriceCents, - snapshot.medMinPriceCents, - snapshot.medMaxPriceCents, - snapshot.medMinSpecialPriceCents, - snapshot.wholesaleMinPriceCents, - snapshot.totalQuantityAvailable, - snapshot.totalKioskQuantityAvailable, - snapshot.manualInventory, - snapshot.isBelowThreshold, - snapshot.isBelowKioskThreshold, - JSON.stringify(snapshot.options || []), - JSON.stringify(snapshot.rawPayload || {}), - snapshot.crawledAt, - ] - ); - - return result.rows[0].id; -} - -// ============================================================ -// BATCH DATABASE OPERATIONS (per CLAUDE.md Rule #15) -// ============================================================ - -/** - * Helper to chunk an array into smaller arrays - */ -function chunkArray(array: T[], size: number): T[][] { - const chunks: T[][] = []; - for (let i = 0; i < array.length; i += size) { - chunks.push(array.slice(i, i + size)); - } - return chunks; -} - -/** - * Batch upsert products - processes in chunks to avoid OOM - * Returns a Map of externalProductId -> database id - */ -async function batchUpsertProducts( - products: Partial[] -): Promise> { - const productIdMap = new Map(); - const chunks = chunkArray(products, BATCH_CHUNK_SIZE); - - console.log(`[ProductCrawler] Batch upserting ${products.length} products in ${chunks.length} chunks of ${BATCH_CHUNK_SIZE}...`); - - for (let i = 0; i < chunks.length; i++) { - const chunk = chunks[i]; - - // Process each product in the chunk - for (const product of chunk) { - try { - const id = await upsertProduct(product); - if (product.externalProductId) { - productIdMap.set(product.externalProductId, id); - } - } catch (error: any) { - console.error(`[ProductCrawler] Error upserting product ${product.externalProductId}:`, error.message); - } - } - - // Log progress - if ((i + 1) % 5 === 0 || i === chunks.length - 1) { - console.log(`[ProductCrawler] Upserted chunk ${i + 1}/${chunks.length} (${productIdMap.size} products so far)`); - } - } - - return productIdMap; -} - -/** - * Batch insert snapshots - processes in chunks to avoid OOM - */ -async function batchInsertSnapshots( - snapshots: Partial[] -): Promise { - const chunks = chunkArray(snapshots, BATCH_CHUNK_SIZE); - let inserted = 0; - - console.log(`[ProductCrawler] Batch inserting ${snapshots.length} snapshots in ${chunks.length} chunks of ${BATCH_CHUNK_SIZE}...`); - - for (let i = 0; i < chunks.length; i++) { - const chunk = chunks[i]; - - // Process each snapshot in the chunk - for (const snapshot of chunk) { - try { - await insertSnapshot(snapshot); - inserted++; - } catch (error: any) { - console.error(`[ProductCrawler] Error inserting snapshot for ${snapshot.externalProductId}:`, error.message); - } - } - - // Log progress - if ((i + 1) % 5 === 0 || i === chunks.length - 1) { - console.log(`[ProductCrawler] Inserted snapshot chunk ${i + 1}/${chunks.length} (${inserted} snapshots so far)`); - } - } - - return inserted; -} - -/** - * Update dispensary last_crawled_at and product_count - */ -async function updateDispensaryCrawlStats( - dispensaryId: number, - productCount: number -): Promise { - // Update last_crawl_at to track when we last crawled - // Skip product_count as that column may not exist - await query( - ` - UPDATE dispensaries - SET last_crawl_at = NOW(), updated_at = NOW() - WHERE id = $1 - `, - [dispensaryId] - ); -} - -/** - * Mark products as missing from feed (visibility-loss detection) - * Creates a snapshot with isPresentInFeed=false and stockStatus='missing_from_feed' - * for products that were NOT in the UNION of Mode A and Mode B product lists - * - * Bella (Product Sync) visibility tracking: - * - Sets visibility_lost=TRUE and visibility_lost_at=NOW() for disappearing products - * - Records visibility event in snapshot metadata JSONB - * - NEVER deletes products, just marks them as visibility-lost - * - * IMPORTANT: Uses UNION of both modes to avoid false positives - * If the union is empty (possible outage), we skip marking to avoid data corruption - */ -async function markMissingProducts( - dispensaryId: number, - platformDispensaryId: string, - modeAProductIds: Set, - modeBProductIds: Set, - pricingType: 'rec' | 'med', - workerName: string = 'Bella' -): Promise<{ markedMissing: number; newlyLost: number }> { - // Build UNION of Mode A + Mode B product IDs - const unionProductIds = new Set([...Array.from(modeAProductIds), ...Array.from(modeBProductIds)]); - - // OUTAGE DETECTION: If union is empty, something went wrong - don't mark anything as missing - if (unionProductIds.size === 0) { - console.warn(`[${workerName} - Product Sync] OUTAGE DETECTED: Both Mode A and Mode B returned 0 products. Skipping visibility-loss marking.`); - return { markedMissing: 0, newlyLost: 0 }; - } - - // Get all existing products for this dispensary that were not in the UNION - // Also check if they were already marked as visibility_lost to track new losses - const { rows: missingProducts } = await query<{ - id: number; - external_product_id: string; - name: string; - visibility_lost: boolean; - }>( - ` - SELECT id, external_product_id, name, COALESCE(visibility_lost, FALSE) as visibility_lost - FROM dutchie_products - WHERE dispensary_id = $1 - AND external_product_id NOT IN (SELECT unnest($2::text[])) - `, - [dispensaryId, Array.from(unionProductIds)] - ); - - if (missingProducts.length === 0) { - return { markedMissing: 0, newlyLost: 0 }; - } - - // Separate newly lost products from already-lost products - const newlyLostProducts = missingProducts.filter(p => !p.visibility_lost); - const alreadyLostProducts = missingProducts.filter(p => p.visibility_lost); - - console.log(`[${workerName} - Product Sync] Visibility check: ${missingProducts.length} products missing (${newlyLostProducts.length} newly lost, ${alreadyLostProducts.length} already lost)`); - - const crawledAt = new Date(); - - // Build all missing snapshots with visibility_events metadata - const missingSnapshots: Partial[] = missingProducts.map(product => { - const isNewlyLost = !product.visibility_lost; - return { - dutchieProductId: product.id, - dispensaryId, - platformDispensaryId, - externalProductId: product.external_product_id, - pricingType, - crawlMode: 'mode_a' as CrawlMode, - status: undefined, - featured: false, - special: false, - medicalOnly: false, - recOnly: false, - isPresentInFeed: false, - stockStatus: 'missing_from_feed' as StockStatus, - totalQuantityAvailable: undefined, - manualInventory: false, - isBelowThreshold: false, - isBelowKioskThreshold: false, - options: [], - rawPayload: { - _missingFromFeed: true, - lastKnownName: product.name, - visibility_events: isNewlyLost ? [{ - event_type: 'visibility_lost', - timestamp: crawledAt.toISOString(), - worker_name: workerName, - }] : [], - }, - crawledAt, - }; - }); - - // Batch insert missing snapshots - const snapshotsInserted = await batchInsertSnapshots(missingSnapshots); - - // Batch update product visibility status in chunks - const productIds = missingProducts.map(p => p.id); - const productChunks = chunkArray(productIds, BATCH_CHUNK_SIZE); - - console.log(`[${workerName} - Product Sync] Updating ${productIds.length} product visibility in ${productChunks.length} chunks...`); - - for (const chunk of productChunks) { - // Update all products: set stock_status to missing - // Only set visibility_lost_at for NEWLY lost products (not already lost) - await query( - ` - UPDATE dutchie_products - SET - stock_status = 'missing_from_feed', - total_quantity_available = NULL, - visibility_lost = TRUE, - visibility_lost_at = CASE - WHEN visibility_lost IS NULL OR visibility_lost = FALSE THEN NOW() - ELSE visibility_lost_at -- Keep existing timestamp for already-lost products - END, - updated_at = NOW() - WHERE id = ANY($1::int[]) - `, - [chunk] - ); - } - - console.log(`[${workerName} - Product Sync] Marked ${snapshotsInserted} products as missing, ${newlyLostProducts.length} newly visibility-lost`); - return { markedMissing: snapshotsInserted, newlyLost: newlyLostProducts.length }; -} - -/** - * Restore visibility for products that reappeared in the feed - * Called when products that were previously visibility_lost=TRUE are now found in the feed - * - * Bella (Product Sync) visibility tracking: - * - Sets visibility_lost=FALSE and visibility_restored_at=NOW() - * - Logs the restoration event - */ -async function restoreVisibilityForProducts( - dispensaryId: number, - productIds: Set, - workerName: string = 'Bella' -): Promise { - if (productIds.size === 0) { - return 0; - } - - // Find products that were visibility_lost and are now in the feed - const { rows: restoredProducts } = await query<{ id: number; external_product_id: string }>( - ` - SELECT id, external_product_id - FROM dutchie_products - WHERE dispensary_id = $1 - AND visibility_lost = TRUE - AND external_product_id = ANY($2::text[]) - `, - [dispensaryId, Array.from(productIds)] - ); - - if (restoredProducts.length === 0) { - return 0; - } - - console.log(`[${workerName} - Product Sync] Restoring visibility for ${restoredProducts.length} products that reappeared`); - - // Batch update restored products - const restoredIds = restoredProducts.map(p => p.id); - const chunks = chunkArray(restoredIds, BATCH_CHUNK_SIZE); - - for (const chunk of chunks) { - await query( - ` - UPDATE dutchie_products - SET - visibility_lost = FALSE, - visibility_restored_at = NOW(), - updated_at = NOW() - WHERE id = ANY($1::int[]) - `, - [chunk] - ); - } - - console.log(`[${workerName} - Product Sync] Restored visibility for ${restoredProducts.length} products`); - return restoredProducts.length; -} - -// ============================================================ -// CRAWL ORCHESTRATION -// ============================================================ - -export interface CrawlResult { - success: boolean; - dispensaryId: number; - productsFound: number; - productsFetched: number; // Alias for productsFound (used by worker) - productsUpserted: number; - snapshotsCreated: number; - modeAProducts?: number; - modeBProducts?: number; - missingProductsMarked?: number; - visibilityLostCount?: number; // Products newly marked as visibility_lost - visibilityRestoredCount?: number; // Products restored from visibility_lost - imagesDownloaded?: number; - imageErrors?: number; - errorMessage?: string; - httpStatus?: number; // HTTP status code for error classification - durationMs: number; -} - -/** - * Process a batch of products from a single crawl mode - * IMPORTANT: Stores ALL products, never filters before DB - * Uses chunked batch processing per CLAUDE.md Rule #15 to avoid OOM - * Returns the set of external product IDs that were processed - */ -async function processProducts( - products: DutchieRawProduct[], - dispensary: Dispensary, - pricingType: 'rec' | 'med', - crawlMode: CrawlMode, - options: { downloadImages?: boolean } = {} -): Promise<{ upserted: number; snapshots: number; productIds: Set; imagesDownloaded: number; imageErrors: number }> { - const { downloadImages = true } = options; - const productIds = new Set(); - let imagesDownloaded = 0; - let imageErrors = 0; - - console.log(`[ProductCrawler] Processing ${products.length} products using chunked batch processing...`); - - // Step 1: Normalize all products and collect IDs - const normalizedProducts: Partial[] = []; - const rawByExternalId = new Map(); - - for (const raw of products) { - const externalId = raw._id || raw.id || ''; - productIds.add(externalId); - rawByExternalId.set(externalId, raw); - - const normalized = normalizeProduct( - raw, - dispensary.id, - dispensary.platformDispensaryId! - ); - normalizedProducts.push(normalized); - } - - // Step 2: Batch upsert products (chunked) - const productIdMap = await batchUpsertProducts(normalizedProducts); - const upserted = productIdMap.size; - - // Step 3: Create and batch insert snapshots (chunked) - // IMPORTANT: Do this BEFORE image downloads to ensure snapshots are created even if images fail - const snapshots: Partial[] = []; - for (const [externalId, productId] of Array.from(productIdMap.entries())) { - const raw = rawByExternalId.get(externalId); - if (raw) { - const snapshot = normalizeSnapshot( - raw, - productId, - dispensary.id, - dispensary.platformDispensaryId!, - pricingType, - crawlMode - ); - snapshots.push(snapshot); - } - } - - const snapshotsInserted = await batchInsertSnapshots(snapshots); - - // Step 4: Download images in chunks (if enabled) - // This is done AFTER snapshots to ensure core data is saved even if image downloads fail - if (downloadImages) { - const imageChunks = chunkArray(Array.from(productIdMap.entries()), BATCH_CHUNK_SIZE); - console.log(`[ProductCrawler] Downloading images in ${imageChunks.length} chunks...`); - - for (let i = 0; i < imageChunks.length; i++) { - const chunk = imageChunks[i]; - for (const [externalId, productId] of chunk) { - const normalized = normalizedProducts.find(p => p.externalProductId === externalId); - if (normalized?.primaryImageUrl) { - try { - const imageResult = await downloadAndUpdateProductImage( - productId, - dispensary.id, - externalId, - normalized.primaryImageUrl - ); - if (imageResult.downloaded) { - imagesDownloaded++; - } else if (imageResult.error && imageResult.error !== 'No image URL') { - imageErrors++; - } - } catch (error: any) { - imageErrors++; - } - } - } - - if ((i + 1) % 5 === 0 || i === imageChunks.length - 1) { - console.log(`[ProductCrawler] Image download chunk ${i + 1}/${imageChunks.length} (${imagesDownloaded} downloaded, ${imageErrors} errors)`); - } - } - } - - // Clear references to help GC - normalizedProducts.length = 0; - rawByExternalId.clear(); - - return { upserted, snapshots: snapshotsInserted, productIds, imagesDownloaded, imageErrors }; -} - -/** - * Crawl all products for a single dispensary using BOTH modes - * Mode A: UI parity (Status: Active) - * Mode B: MAX COVERAGE (no Status filter, bypass thresholds) - * - * This ensures we capture ALL products including out-of-stock items - */ -export interface CrawlProgressCallback { - productsFound: number; - productsUpserted: number; - snapshotsCreated: number; - currentPage: number; - totalPages?: number; -} - -export async function crawlDispensaryProducts( - dispensary: Dispensary, - pricingType: 'rec' | 'med' = 'rec', - options: { - useBothModes?: boolean; - downloadImages?: boolean; - onProgress?: (progress: CrawlProgressCallback) => Promise; - } = {} -): Promise { - const { useBothModes = true, downloadImages = true, onProgress } = options; - const startTime = Date.now(); - - if (!dispensary.platformDispensaryId) { - return { - success: false, - dispensaryId: dispensary.id, - productsFound: 0, - productsFetched: 0, - productsUpserted: 0, - snapshotsCreated: 0, - errorMessage: 'Missing platformDispensaryId', - durationMs: Date.now() - startTime, - }; - } - - try { - console.log(`[ProductCrawler] Crawling ${dispensary.name} (${dispensary.platformDispensaryId})...`); - - let totalUpserted = 0; - let totalSnapshots = 0; - let totalImagesDownloaded = 0; - let totalImageErrors = 0; - let modeAProducts = 0; - let modeBProducts = 0; - let missingMarked = 0; - - // Track product IDs separately for each mode (needed for missing product detection) - const modeAProductIds = new Set(); - const modeBProductIds = new Set(); - - // Extract cName for this specific dispensary (used for Puppeteer session & headers) - const cName = extractCName(dispensary); - console.log(`[ProductCrawler] Using cName="${cName}" for dispensary ${dispensary.name}`); - - if (useBothModes) { - // Run two-mode crawl for maximum coverage - const bothResults = await fetchAllProductsBothModes( - dispensary.platformDispensaryId, - pricingType, - { cName } - ); - - modeAProducts = bothResults.modeA.products.length; - modeBProducts = bothResults.modeB.products.length; - - console.log(`[ProductCrawler] Two-mode crawl: Mode A=${modeAProducts}, Mode B=${modeBProducts}, Merged=${bothResults.merged.products.length}`); - - // Collect Mode A product IDs - for (const p of bothResults.modeA.products) { - modeAProductIds.add(p._id); - } - - // Collect Mode B product IDs - for (const p of bothResults.modeB.products) { - modeBProductIds.add(p._id); - } - - // Process MERGED products (includes options from both modes) - if (bothResults.merged.products.length > 0) { - const mergedResult = await processProducts( - bothResults.merged.products, - dispensary, - pricingType, - 'mode_a', // Use mode_a for merged products (convention) - { downloadImages } - ); - totalUpserted = mergedResult.upserted; - totalSnapshots = mergedResult.snapshots; - totalImagesDownloaded = mergedResult.imagesDownloaded; - totalImageErrors = mergedResult.imageErrors; - - // Report progress - if (onProgress) { - await onProgress({ - productsFound: bothResults.merged.products.length, - productsUpserted: totalUpserted, - snapshotsCreated: totalSnapshots, - currentPage: 1, - totalPages: 1, - }); - } - } - } else { - // Single mode crawl (Mode A only) - const { products, crawlMode } = await fetchAllProducts( - dispensary.platformDispensaryId, - pricingType, - { crawlMode: 'mode_a', cName } - ); - - modeAProducts = products.length; - - // Collect Mode A product IDs - for (const p of products) { - modeAProductIds.add(p._id); - } - - const result = await processProducts(products, dispensary, pricingType, crawlMode, { downloadImages }); - totalUpserted = result.upserted; - totalSnapshots = result.snapshots; - totalImagesDownloaded = result.imagesDownloaded; - totalImageErrors = result.imageErrors; - - // Report progress - if (onProgress) { - await onProgress({ - productsFound: products.length, - productsUpserted: totalUpserted, - snapshotsCreated: totalSnapshots, - currentPage: 1, - totalPages: 1, - }); - } - } - - // Build union of all product IDs found in both modes - const allFoundProductIds = new Set([ - ...Array.from(modeAProductIds), - ...Array.from(modeBProductIds), - ]); - - // VISIBILITY RESTORATION: Check if any previously-lost products have reappeared - const visibilityRestored = await restoreVisibilityForProducts( - dispensary.id, - allFoundProductIds, - 'Bella' - ); - - // Mark products as missing using UNION of Mode A + Mode B - // The function handles outage detection (empty union = skip marking) - // Now also tracks newly lost products vs already-lost products - const missingResult = await markMissingProducts( - dispensary.id, - dispensary.platformDispensaryId, - modeAProductIds, - modeBProductIds, - pricingType, - 'Bella' - ); - missingMarked = missingResult.markedMissing; - const newlyLostCount = missingResult.newlyLost; - totalSnapshots += missingMarked; - - // Update dispensary stats - await updateDispensaryCrawlStats(dispensary.id, totalUpserted); - - console.log(`[Bella - Product Sync] Completed: ${totalUpserted} products, ${totalSnapshots} snapshots, ${missingMarked} missing, ${newlyLostCount} newly lost, ${visibilityRestored} restored, ${totalImagesDownloaded} images`); - - const totalProductsFound = modeAProducts + modeBProducts; - return { - success: true, - dispensaryId: dispensary.id, - productsFound: totalProductsFound, - productsFetched: totalProductsFound, - productsUpserted: totalUpserted, - snapshotsCreated: totalSnapshots, - modeAProducts, - modeBProducts, - missingProductsMarked: missingMarked, - visibilityLostCount: newlyLostCount, - visibilityRestoredCount: visibilityRestored, - imagesDownloaded: totalImagesDownloaded, - imageErrors: totalImageErrors, - durationMs: Date.now() - startTime, - }; - } catch (error: any) { - console.error(`[ProductCrawler] Failed to crawl ${dispensary.name}:`, error.message); - return { - success: false, - dispensaryId: dispensary.id, - productsFound: 0, - productsFetched: 0, - productsUpserted: 0, - snapshotsCreated: 0, - errorMessage: error.message, - durationMs: Date.now() - startTime, - }; - } -} - -/** - * Crawl all Arizona dispensaries - */ -export async function crawlAllArizonaDispensaries( - pricingType: 'rec' | 'med' = 'rec' -): Promise { - const results: CrawlResult[] = []; - - // Get all AZ dispensaries with platform IDs - const { rows: rawRows } = await query( - ` - SELECT ${DISPENSARY_COLUMNS} FROM dispensaries - WHERE state = 'AZ' AND menu_type = 'dutchie' AND platform_dispensary_id IS NOT NULL - ORDER BY id - ` - ); - const dispensaries = rawRows.map(mapDbRowToDispensary); - - console.log(`[ProductCrawler] Starting crawl of ${dispensaries.length} dispensaries...`); - - for (const dispensary of dispensaries) { - const result = await crawlDispensaryProducts(dispensary, pricingType); - results.push(result); - - // Delay between dispensaries - await new Promise((r) => setTimeout(r, 2000)); - } - - const successful = results.filter((r) => r.success).length; - const totalProducts = results.reduce((sum, r) => sum + r.productsUpserted, 0); - const totalSnapshots = results.reduce((sum, r) => sum + r.snapshotsCreated, 0); - - console.log(`[ProductCrawler] Completed: ${successful}/${dispensaries.length} stores, ${totalProducts} products, ${totalSnapshots} snapshots`); - - return results; -} diff --git a/backend/src/dutchie-az/services/retry-manager.ts b/backend/src/dutchie-az/services/retry-manager.ts deleted file mode 100644 index 95bad61b..00000000 --- a/backend/src/dutchie-az/services/retry-manager.ts +++ /dev/null @@ -1,435 +0,0 @@ -/** - * Unified Retry Manager - * - * Handles retry logic with exponential backoff, jitter, and - * intelligent error-based decisions (rotate proxy, rotate UA, etc.) - * - * Phase 1: Crawler Reliability & Stabilization - */ - -import { - CrawlErrorCodeType, - CrawlErrorCode, - classifyError, - getErrorMetadata, - isRetryable, - shouldRotateProxy, - shouldRotateUserAgent, - getBackoffMultiplier, -} from './error-taxonomy'; -import { DEFAULT_CONFIG } from './store-validator'; - -// ============================================================ -// RETRY CONFIGURATION -// ============================================================ - -export interface RetryConfig { - maxRetries: number; - baseBackoffMs: number; - maxBackoffMs: number; - backoffMultiplier: number; - jitterFactor: number; // 0.0 - 1.0 (percentage of backoff to randomize) -} - -export const DEFAULT_RETRY_CONFIG: RetryConfig = { - maxRetries: DEFAULT_CONFIG.maxRetries, - baseBackoffMs: DEFAULT_CONFIG.baseBackoffMs, - maxBackoffMs: DEFAULT_CONFIG.maxBackoffMs, - backoffMultiplier: DEFAULT_CONFIG.backoffMultiplier, - jitterFactor: 0.25, // +/- 25% jitter -}; - -// ============================================================ -// RETRY CONTEXT -// ============================================================ - -/** - * Context for tracking retry state across attempts - */ -export interface RetryContext { - attemptNumber: number; - maxAttempts: number; - lastErrorCode: CrawlErrorCodeType | null; - lastHttpStatus: number | null; - totalBackoffMs: number; - proxyRotated: boolean; - userAgentRotated: boolean; - startedAt: Date; -} - -/** - * Decision about what to do after an error - */ -export interface RetryDecision { - shouldRetry: boolean; - reason: string; - backoffMs: number; - rotateProxy: boolean; - rotateUserAgent: boolean; - errorCode: CrawlErrorCodeType; - attemptNumber: number; -} - -// ============================================================ -// RETRY MANAGER CLASS -// ============================================================ - -export class RetryManager { - private config: RetryConfig; - private context: RetryContext; - - constructor(config: Partial = {}) { - this.config = { ...DEFAULT_RETRY_CONFIG, ...config }; - this.context = this.createInitialContext(); - } - - /** - * Create initial retry context - */ - private createInitialContext(): RetryContext { - return { - attemptNumber: 0, - maxAttempts: this.config.maxRetries + 1, // +1 for initial attempt - lastErrorCode: null, - lastHttpStatus: null, - totalBackoffMs: 0, - proxyRotated: false, - userAgentRotated: false, - startedAt: new Date(), - }; - } - - /** - * Reset retry state for a new operation - */ - reset(): void { - this.context = this.createInitialContext(); - } - - /** - * Get current attempt number (1-based) - */ - getAttemptNumber(): number { - return this.context.attemptNumber + 1; - } - - /** - * Check if we should attempt (call before each attempt) - */ - shouldAttempt(): boolean { - return this.context.attemptNumber < this.context.maxAttempts; - } - - /** - * Record an attempt (call at start of each attempt) - */ - recordAttempt(): void { - this.context.attemptNumber++; - } - - /** - * Evaluate an error and decide what to do - */ - evaluateError( - error: Error | string | null, - httpStatus?: number - ): RetryDecision { - const errorCode = classifyError(error, httpStatus); - const metadata = getErrorMetadata(errorCode); - const attemptNumber = this.context.attemptNumber; - - // Update context - this.context.lastErrorCode = errorCode; - this.context.lastHttpStatus = httpStatus || null; - - // Check if error is retryable - if (!isRetryable(errorCode)) { - return { - shouldRetry: false, - reason: `Error ${errorCode} is not retryable: ${metadata.description}`, - backoffMs: 0, - rotateProxy: false, - rotateUserAgent: false, - errorCode, - attemptNumber, - }; - } - - // Check if we've exhausted retries - if (!this.shouldAttempt()) { - return { - shouldRetry: false, - reason: `Max retries (${this.config.maxRetries}) exhausted`, - backoffMs: 0, - rotateProxy: false, - rotateUserAgent: false, - errorCode, - attemptNumber, - }; - } - - // Calculate backoff with exponential increase and jitter - const baseBackoff = this.calculateBackoff(attemptNumber, errorCode); - const backoffWithJitter = this.addJitter(baseBackoff); - - // Track total backoff - this.context.totalBackoffMs += backoffWithJitter; - - // Determine rotation needs - const rotateProxy = shouldRotateProxy(errorCode); - const rotateUserAgent = shouldRotateUserAgent(errorCode); - - if (rotateProxy) this.context.proxyRotated = true; - if (rotateUserAgent) this.context.userAgentRotated = true; - - const rotationInfo = []; - if (rotateProxy) rotationInfo.push('rotate proxy'); - if (rotateUserAgent) rotationInfo.push('rotate UA'); - const rotationStr = rotationInfo.length > 0 ? ` (${rotationInfo.join(', ')})` : ''; - - return { - shouldRetry: true, - reason: `Retrying after ${errorCode}${rotationStr}, backoff ${backoffWithJitter}ms`, - backoffMs: backoffWithJitter, - rotateProxy, - rotateUserAgent, - errorCode, - attemptNumber, - }; - } - - /** - * Calculate exponential backoff for an attempt - */ - private calculateBackoff(attemptNumber: number, errorCode: CrawlErrorCodeType): number { - // Base exponential: baseBackoff * multiplier^(attempt-1) - const exponential = this.config.baseBackoffMs * - Math.pow(this.config.backoffMultiplier, attemptNumber - 1); - - // Apply error-specific multiplier - const errorMultiplier = getBackoffMultiplier(errorCode); - const adjusted = exponential * errorMultiplier; - - // Cap at max backoff - return Math.min(adjusted, this.config.maxBackoffMs); - } - - /** - * Add jitter to backoff to prevent thundering herd - */ - private addJitter(backoffMs: number): number { - const jitterRange = backoffMs * this.config.jitterFactor; - // Random between -jitterRange and +jitterRange - const jitter = (Math.random() * 2 - 1) * jitterRange; - return Math.max(0, Math.round(backoffMs + jitter)); - } - - /** - * Get retry context summary - */ - getSummary(): RetryContextSummary { - const elapsedMs = Date.now() - this.context.startedAt.getTime(); - return { - attemptsMade: this.context.attemptNumber, - maxAttempts: this.context.maxAttempts, - lastErrorCode: this.context.lastErrorCode, - lastHttpStatus: this.context.lastHttpStatus, - totalBackoffMs: this.context.totalBackoffMs, - totalElapsedMs: elapsedMs, - proxyWasRotated: this.context.proxyRotated, - userAgentWasRotated: this.context.userAgentRotated, - }; - } -} - -export interface RetryContextSummary { - attemptsMade: number; - maxAttempts: number; - lastErrorCode: CrawlErrorCodeType | null; - lastHttpStatus: number | null; - totalBackoffMs: number; - totalElapsedMs: number; - proxyWasRotated: boolean; - userAgentWasRotated: boolean; -} - -// ============================================================ -// CONVENIENCE FUNCTIONS -// ============================================================ - -/** - * Sleep for specified milliseconds - */ -export function sleep(ms: number): Promise { - return new Promise(resolve => setTimeout(resolve, ms)); -} - -/** - * Execute a function with automatic retry logic - */ -export async function withRetry( - fn: (attemptNumber: number) => Promise, - config: Partial = {}, - callbacks?: { - onRetry?: (decision: RetryDecision) => void | Promise; - onRotateProxy?: () => void | Promise; - onRotateUserAgent?: () => void | Promise; - } -): Promise<{ result: T; summary: RetryContextSummary }> { - const manager = new RetryManager(config); - - while (manager.shouldAttempt()) { - manager.recordAttempt(); - const attemptNumber = manager.getAttemptNumber(); - - try { - const result = await fn(attemptNumber); - return { result, summary: manager.getSummary() }; - } catch (error) { - const err = error instanceof Error ? error : new Error(String(error)); - const httpStatus = (error as any)?.status || (error as any)?.statusCode; - - const decision = manager.evaluateError(err, httpStatus); - - if (!decision.shouldRetry) { - // Re-throw with enhanced context - const enhancedError = new RetryExhaustedError( - `${err.message} (${decision.reason})`, - err, - manager.getSummary() - ); - throw enhancedError; - } - - // Notify callbacks - if (callbacks?.onRetry) { - await callbacks.onRetry(decision); - } - if (decision.rotateProxy && callbacks?.onRotateProxy) { - await callbacks.onRotateProxy(); - } - if (decision.rotateUserAgent && callbacks?.onRotateUserAgent) { - await callbacks.onRotateUserAgent(); - } - - // Log retry decision - console.log( - `[RetryManager] Attempt ${attemptNumber} failed: ${decision.errorCode}. ` + - `${decision.reason}. Waiting ${decision.backoffMs}ms before retry.` - ); - - // Wait before retry - await sleep(decision.backoffMs); - } - } - - // Should not reach here, but handle edge case - throw new RetryExhaustedError( - 'Max retries exhausted', - null, - manager.getSummary() - ); -} - -// ============================================================ -// CUSTOM ERROR CLASS -// ============================================================ - -export class RetryExhaustedError extends Error { - public readonly originalError: Error | null; - public readonly summary: RetryContextSummary; - public readonly errorCode: CrawlErrorCodeType; - - constructor( - message: string, - originalError: Error | null, - summary: RetryContextSummary - ) { - super(message); - this.name = 'RetryExhaustedError'; - this.originalError = originalError; - this.summary = summary; - this.errorCode = summary.lastErrorCode || CrawlErrorCode.UNKNOWN_ERROR; - } -} - -// ============================================================ -// BACKOFF CALCULATOR (for external use) -// ============================================================ - -/** - * Calculate next crawl time based on consecutive failures - */ -export function calculateNextCrawlDelay( - consecutiveFailures: number, - baseFrequencyMinutes: number, - maxBackoffMultiplier: number = 4.0 -): number { - // Each failure doubles the delay, up to max multiplier - const multiplier = Math.min( - Math.pow(2, consecutiveFailures), - maxBackoffMultiplier - ); - - const delayMinutes = baseFrequencyMinutes * multiplier; - - // Add jitter (0-10% of delay) - const jitterMinutes = delayMinutes * Math.random() * 0.1; - - return Math.round(delayMinutes + jitterMinutes); -} - -/** - * Calculate next crawl timestamp - */ -export function calculateNextCrawlAt( - consecutiveFailures: number, - baseFrequencyMinutes: number -): Date { - const delayMinutes = calculateNextCrawlDelay(consecutiveFailures, baseFrequencyMinutes); - return new Date(Date.now() + delayMinutes * 60 * 1000); -} - -// ============================================================ -// STATUS DETERMINATION -// ============================================================ - -/** - * Determine crawl status based on failure count - */ -export function determineCrawlStatus( - consecutiveFailures: number, - thresholds: { degraded: number; failed: number } = { degraded: 3, failed: 10 } -): 'active' | 'degraded' | 'failed' { - if (consecutiveFailures >= thresholds.failed) { - return 'failed'; - } - if (consecutiveFailures >= thresholds.degraded) { - return 'degraded'; - } - return 'active'; -} - -/** - * Determine if store should be auto-recovered - * (Called periodically to check if failed stores can be retried) - */ -export function shouldAttemptRecovery( - lastFailureAt: Date | null, - consecutiveFailures: number, - recoveryIntervalHours: number = 24 -): boolean { - if (!lastFailureAt) return true; - - // Wait longer for more failures - const waitHours = recoveryIntervalHours * Math.min(consecutiveFailures, 5); - const recoveryTime = new Date(lastFailureAt.getTime() + waitHours * 60 * 60 * 1000); - - return new Date() >= recoveryTime; -} - -// ============================================================ -// SINGLETON INSTANCE -// ============================================================ - -export const retryManager = new RetryManager(); diff --git a/backend/src/dutchie-az/services/scheduler.ts b/backend/src/dutchie-az/services/scheduler.ts deleted file mode 100644 index 25ce5f42..00000000 --- a/backend/src/dutchie-az/services/scheduler.ts +++ /dev/null @@ -1,1066 +0,0 @@ -/** - * Dutchie AZ Scheduler Service - * - * Handles scheduled crawling with JITTER - no fixed intervals! - * Each job re-schedules itself with a NEW random offset after each run. - * This makes timing "wander" around the clock, avoiding detectable patterns. - * - * Jitter Logic: - * nextRunAt = lastRunAt + baseIntervalMinutes + random(-jitterMinutes, +jitterMinutes) - * - * Example: 4-hour base with ±30min jitter = runs anywhere from 3h30m to 4h30m apart - */ - -import { query, getClient, getPool } from '../db/connection'; -import { crawlDispensaryProducts, CrawlResult } from './product-crawler'; -import { mapDbRowToDispensary } from './discovery'; -import { executeMenuDetectionJob } from './menu-detection'; -import { bulkEnqueueJobs, enqueueJob, getQueueStats } from './job-queue'; -import { JobSchedule, JobStatus, Dispensary } from '../types'; -import { DtLocationDiscoveryService } from '../discovery/DtLocationDiscoveryService'; -import { StateQueryService } from '../../multi-state/state-query-service'; - -// Scheduler poll interval (how often we check for due jobs) -const SCHEDULER_POLL_INTERVAL_MS = 60 * 1000; // 1 minute - -// Track running state -let isSchedulerRunning = false; -let schedulerInterval: NodeJS.Timeout | null = null; - -// ============================================================ -// JITTER CALCULATION -// ============================================================ - -/** - * Generate a random jitter value in minutes - * Returns a value between -jitterMinutes and +jitterMinutes - */ -function getRandomJitterMinutes(jitterMinutes: number): number { - // random() returns [0, 1), we want [-jitter, +jitter] - return (Math.random() * 2 - 1) * jitterMinutes; -} - -/** - * Calculate next run time with jitter - * nextRunAt = baseTime + baseIntervalMinutes + random(-jitter, +jitter) - */ -function calculateNextRunAt( - baseTime: Date, - baseIntervalMinutes: number, - jitterMinutes: number -): Date { - const jitter = getRandomJitterMinutes(jitterMinutes); - const totalMinutes = baseIntervalMinutes + jitter; - const totalMs = totalMinutes * 60 * 1000; - return new Date(baseTime.getTime() + totalMs); -} - -// ============================================================ -// DATABASE OPERATIONS -// ============================================================ - -/** - * Get all job schedules - */ -export async function getAllSchedules(): Promise { - const { rows } = await query(` - SELECT - id, job_name, description, enabled, - base_interval_minutes, jitter_minutes, - worker_name, worker_role, - last_run_at, last_status, last_error_message, last_duration_ms, - next_run_at, job_config, created_at, updated_at - FROM job_schedules - ORDER BY job_name - `); - - return rows.map(row => ({ - id: row.id, - jobName: row.job_name, - description: row.description, - enabled: row.enabled, - baseIntervalMinutes: row.base_interval_minutes, - jitterMinutes: row.jitter_minutes, - workerName: row.worker_name, - workerRole: row.worker_role, - lastRunAt: row.last_run_at, - lastStatus: row.last_status, - lastErrorMessage: row.last_error_message, - lastDurationMs: row.last_duration_ms, - nextRunAt: row.next_run_at, - jobConfig: row.job_config, - createdAt: row.created_at, - updatedAt: row.updated_at, - })); -} - -/** - * Get a single schedule by ID - */ -export async function getScheduleById(id: number): Promise { - const { rows } = await query( - `SELECT * FROM job_schedules WHERE id = $1`, - [id] - ); - - if (rows.length === 0) return null; - - const row = rows[0]; - return { - id: row.id, - jobName: row.job_name, - description: row.description, - enabled: row.enabled, - baseIntervalMinutes: row.base_interval_minutes, - jitterMinutes: row.jitter_minutes, - workerName: row.worker_name, - workerRole: row.worker_role, - lastRunAt: row.last_run_at, - lastStatus: row.last_status, - lastErrorMessage: row.last_error_message, - lastDurationMs: row.last_duration_ms, - nextRunAt: row.next_run_at, - jobConfig: row.job_config, - createdAt: row.created_at, - updatedAt: row.updated_at, - }; -} - -/** - * Create a new schedule - */ -export async function createSchedule(schedule: { - jobName: string; - description?: string; - enabled?: boolean; - baseIntervalMinutes: number; - jitterMinutes: number; - workerName?: string; - workerRole?: string; - jobConfig?: Record; - startImmediately?: boolean; -}): Promise { - // Calculate initial nextRunAt - const nextRunAt = schedule.startImmediately - ? new Date() // Start immediately - : calculateNextRunAt(new Date(), schedule.baseIntervalMinutes, schedule.jitterMinutes); - - const { rows } = await query( - ` - INSERT INTO job_schedules ( - job_name, description, enabled, - base_interval_minutes, jitter_minutes, - worker_name, worker_role, - next_run_at, job_config - ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) - RETURNING * - `, - [ - schedule.jobName, - schedule.description || null, - schedule.enabled ?? true, - schedule.baseIntervalMinutes, - schedule.jitterMinutes, - schedule.workerName || null, - schedule.workerRole || null, - nextRunAt, - schedule.jobConfig ? JSON.stringify(schedule.jobConfig) : null, - ] - ); - - const row = rows[0]; - const workerInfo = schedule.workerName ? ` (Worker: ${schedule.workerName})` : ''; - console.log(`[Scheduler] Created schedule "${schedule.jobName}"${workerInfo} - next run at ${nextRunAt.toISOString()}`); - - return { - id: row.id, - jobName: row.job_name, - description: row.description, - enabled: row.enabled, - baseIntervalMinutes: row.base_interval_minutes, - jitterMinutes: row.jitter_minutes, - workerName: row.worker_name, - workerRole: row.worker_role, - lastRunAt: row.last_run_at, - lastStatus: row.last_status, - lastErrorMessage: row.last_error_message, - lastDurationMs: row.last_duration_ms, - nextRunAt: row.next_run_at, - jobConfig: row.job_config, - createdAt: row.created_at, - updatedAt: row.updated_at, - }; -} - -/** - * Update a schedule - */ -export async function updateSchedule( - id: number, - updates: { - description?: string; - enabled?: boolean; - baseIntervalMinutes?: number; - jitterMinutes?: number; - jobConfig?: Record; - } -): Promise { - const setClauses: string[] = []; - const params: any[] = []; - let paramIndex = 1; - - if (updates.description !== undefined) { - setClauses.push(`description = $${paramIndex++}`); - params.push(updates.description); - } - if (updates.enabled !== undefined) { - setClauses.push(`enabled = $${paramIndex++}`); - params.push(updates.enabled); - } - if (updates.baseIntervalMinutes !== undefined) { - setClauses.push(`base_interval_minutes = $${paramIndex++}`); - params.push(updates.baseIntervalMinutes); - } - if (updates.jitterMinutes !== undefined) { - setClauses.push(`jitter_minutes = $${paramIndex++}`); - params.push(updates.jitterMinutes); - } - if (updates.jobConfig !== undefined) { - setClauses.push(`job_config = $${paramIndex++}`); - params.push(JSON.stringify(updates.jobConfig)); - } - - if (setClauses.length === 0) { - return getScheduleById(id); - } - - setClauses.push(`updated_at = NOW()`); - params.push(id); - - const { rows } = await query( - `UPDATE job_schedules SET ${setClauses.join(', ')} WHERE id = $${paramIndex} RETURNING *`, - params - ); - - if (rows.length === 0) return null; - - const row = rows[0]; - return { - id: row.id, - jobName: row.job_name, - description: row.description, - enabled: row.enabled, - baseIntervalMinutes: row.base_interval_minutes, - jitterMinutes: row.jitter_minutes, - lastRunAt: row.last_run_at, - lastStatus: row.last_status, - lastErrorMessage: row.last_error_message, - lastDurationMs: row.last_duration_ms, - nextRunAt: row.next_run_at, - jobConfig: row.job_config, - createdAt: row.created_at, - updatedAt: row.updated_at, - }; -} - -/** - * Delete a schedule - */ -export async function deleteSchedule(id: number): Promise { - const result = await query(`DELETE FROM job_schedules WHERE id = $1`, [id]); - return (result.rowCount || 0) > 0; -} - -/** - * Mark a schedule as running - */ -async function markScheduleRunning(id: number): Promise { - await query( - `UPDATE job_schedules SET last_status = 'running', updated_at = NOW() WHERE id = $1`, - [id] - ); -} - -/** - * Update schedule after job completion with NEW jittered next_run_at - */ -async function updateScheduleAfterRun( - id: number, - status: JobStatus, - durationMs: number, - errorMessage?: string -): Promise { - // Get current schedule to calculate new nextRunAt - const schedule = await getScheduleById(id); - if (!schedule) return; - - const now = new Date(); - const newNextRunAt = calculateNextRunAt( - now, - schedule.baseIntervalMinutes, - schedule.jitterMinutes - ); - - console.log(`[Scheduler] Schedule "${schedule.jobName}" completed (${status}). Next run: ${newNextRunAt.toISOString()}`); - - await query( - ` - UPDATE job_schedules SET - last_run_at = $2, - last_status = $3, - last_error_message = $4, - last_duration_ms = $5, - next_run_at = $6, - updated_at = NOW() - WHERE id = $1 - `, - [id, now, status, errorMessage || null, durationMs, newNextRunAt] - ); -} - -/** - * Create a job run log entry with worker metadata propagated from schedule - */ -async function createRunLog( - scheduleId: number, - jobName: string, - status: 'pending' | 'running', - workerName?: string, - workerRole?: string -): Promise { - const { rows } = await query<{ id: number }>( - ` - INSERT INTO job_run_logs (schedule_id, job_name, status, worker_name, run_role, started_at) - VALUES ($1, $2, $3, $4, $5, NOW()) - RETURNING id - `, - [scheduleId, jobName, status, workerName || null, workerRole || null] - ); - return rows[0].id; -} - -/** - * Update a job run log entry - */ -async function updateRunLog( - runLogId: number, - status: 'success' | 'error' | 'partial', - results: { - durationMs: number; - errorMessage?: string; - itemsProcessed?: number; - itemsSucceeded?: number; - itemsFailed?: number; - metadata?: any; - } -): Promise { - await query( - ` - UPDATE job_run_logs SET - status = $2, - completed_at = NOW(), - duration_ms = $3, - error_message = $4, - items_processed = $5, - items_succeeded = $6, - items_failed = $7, - metadata = $8 - WHERE id = $1 - `, - [ - runLogId, - status, - results.durationMs, - results.errorMessage || null, - results.itemsProcessed || 0, - results.itemsSucceeded || 0, - results.itemsFailed || 0, - results.metadata ? JSON.stringify(results.metadata) : null, - ] - ); -} - -/** - * Get job run logs - */ -export async function getRunLogs(options: { - scheduleId?: number; - jobName?: string; - limit?: number; - offset?: number; -}): Promise<{ logs: any[]; total: number }> { - const { scheduleId, jobName, limit = 50, offset = 0 } = options; - - let whereClause = 'WHERE 1=1'; - const params: any[] = []; - let paramIndex = 1; - - if (scheduleId) { - whereClause += ` AND schedule_id = $${paramIndex++}`; - params.push(scheduleId); - } - if (jobName) { - whereClause += ` AND job_name = $${paramIndex++}`; - params.push(jobName); - } - - params.push(limit, offset); - - const { rows } = await query( - ` - SELECT * FROM job_run_logs - ${whereClause} - ORDER BY created_at DESC - LIMIT $${paramIndex} OFFSET $${paramIndex + 1} - `, - params - ); - - const { rows: countRows } = await query( - `SELECT COUNT(*) as total FROM job_run_logs ${whereClause}`, - params.slice(0, -2) - ); - - return { - logs: rows, - total: parseInt(countRows[0]?.total || '0', 10), - }; -} - -// ============================================================ -// JOB EXECUTION -// ============================================================ - -/** - * Execute a job based on its name - */ -async function executeJob(schedule: JobSchedule): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - const config = schedule.jobConfig || {}; - - switch (schedule.jobName) { - case 'dutchie_az_product_crawl': - return executeProductCrawl(config); - case 'dutchie_az_discovery': - return executeDiscovery(config); - case 'dutchie_az_menu_detection': - return executeMenuDetectionJob(config); - case 'dutchie_store_discovery': - return executeStoreDiscovery(config); - case 'analytics_refresh': - return executeAnalyticsRefresh(config); - default: - throw new Error(`Unknown job type: ${schedule.jobName}`); - } -} - -/** - * Execute the AZ Dutchie product crawl job (Worker: Bella) - * - * NEW BEHAVIOR: Instead of running crawls directly, this now ENQUEUES jobs - * into the crawl_jobs queue. Workers (running as separate replicas) will - * pick up and process these jobs. - * - * Scope filtering: - * - config.scope.states: Array of state codes to limit crawl (e.g., ["AZ", "CA"]) - * - config.scope.storeIds: Array of specific store IDs to crawl - * - * This allows: - * - Multiple workers to process jobs in parallel - * - No double-crawls (DB-level locking per dispensary) - * - Better scalability (add more worker replicas) - * - Sharding by state or store for parallel execution - * - Live monitoring of individual job progress - */ -async function executeProductCrawl(config: Record): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - const pricingType = config.pricingType || 'rec'; - const useBothModes = config.useBothModes !== false; - const scope = config.scope as { states?: string[]; storeIds?: number[] } | undefined; - - const scopeDesc = scope?.states?.length - ? ` (states: ${scope.states.join(', ')})` - : scope?.storeIds?.length - ? ` (${scope.storeIds.length} specific stores)` - : ' (all AZ stores)'; - - console.log(`[Bella - Product Sync] Starting product crawl job${scopeDesc}...`); - - // Build query based on scope - let whereClause = ` - WHERE menu_type = 'dutchie' - AND platform_dispensary_id IS NOT NULL - AND failed_at IS NULL - `; - const params: any[] = []; - let paramIndex = 1; - - // Apply scope filtering - if (scope?.storeIds?.length) { - whereClause += ` AND id = ANY($${paramIndex++})`; - params.push(scope.storeIds); - } else if (scope?.states?.length) { - whereClause += ` AND state = ANY($${paramIndex++})`; - params.push(scope.states); - } else { - // Default to AZ if no scope specified - whereClause += ` AND state = 'AZ'`; - } - - // Get all "ready" dispensaries matching scope - const { rows: rawRows } = await query( - ` - SELECT id FROM dispensaries - ${whereClause} - ORDER BY last_crawl_at ASC NULLS FIRST - `, - params - ); - const dispensaryIds = rawRows.map((r: any) => r.id); - - if (dispensaryIds.length === 0) { - return { - status: 'success', - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 0, - metadata: { - message: 'No ready dispensaries to crawl. Run menu detection to discover more.', - scope: scope || 'all', - }, - }; - } - - console.log(`[Bella - Product Sync] Enqueueing crawl jobs for ${dispensaryIds.length} dispensaries...`); - - // Bulk enqueue jobs (skips dispensaries that already have pending/running jobs) - const { enqueued, skipped } = await bulkEnqueueJobs( - 'dutchie_product_crawl', - dispensaryIds, - { - priority: 0, - metadata: { pricingType, useBothModes }, - } - ); - - console.log(`[Bella - Product Sync] Enqueued ${enqueued} jobs, skipped ${skipped} (already queued)`); - - // Get current queue stats - const queueStats = await getQueueStats(); - - return { - status: 'success', - itemsProcessed: dispensaryIds.length, - itemsSucceeded: enqueued, - itemsFailed: 0, // Enqueue itself doesn't fail - metadata: { - enqueued, - skipped, - queueStats, - pricingType, - useBothModes, - scope: scope || 'all', - message: `Enqueued ${enqueued} jobs. Workers will process them. Check /scraper-monitor for progress.`, - }, - }; -} - -/** - * Execute the AZ Dutchie discovery job (placeholder) - */ -async function executeDiscovery(_config: Record): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - // Placeholder - implement discovery logic - return { - status: 'success', - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 0, - metadata: { message: 'Discovery not yet implemented' }, - }; -} - -/** - * Execute the Store Discovery job (Worker: Alice) - * - * Full discovery workflow: - * 1. Fetch master cities page from https://dutchie.com/cities - * 2. Upsert discovered states/cities into dutchie_discovery_cities - * 3. Crawl each city page to discover all stores - * 4. Detect new stores, slug changes, and removed stores - * 5. Mark retired stores (never delete) - * - * Scope filtering: - * - config.scope.states: Array of state codes to limit discovery (e.g., ["AZ", "CA"]) - * - config.scope.storeIds: Array of specific store IDs to process - */ -async function executeStoreDiscovery(config: Record): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - const delayMs = config.delayMs || 2000; // Delay between cities - const scope = config.scope as { states?: string[]; storeIds?: number[] } | undefined; - - const scopeDesc = scope?.states?.length - ? ` (states: ${scope.states.join(', ')})` - : scope?.storeIds?.length - ? ` (${scope.storeIds.length} specific stores)` - : ' (all states)'; - - console.log(`[Alice - Store Discovery] Starting store discovery job${scopeDesc}...`); - - try { - const pool = getPool(); - const discoveryService = new DtLocationDiscoveryService(pool); - - // Get stats before - const statsBefore = await discoveryService.getStats(); - console.log(`[Alice - Store Discovery] Current stats: ${statsBefore.total} total locations, ${statsBefore.withCoordinates} with coordinates`); - - // Run full discovery with change detection - const result = await discoveryService.runFullDiscoveryWithChangeDetection({ - scope, - delayMs, - }); - - console.log(`[Alice - Store Discovery] Completed: ${result.statesDiscovered} states, ${result.citiesDiscovered} cities`); - console.log(`[Alice - Store Discovery] Stores found: ${result.totalLocationsFound} total`); - console.log(`[Alice - Store Discovery] Changes: +${result.newStoreCount} new, ~${result.updatedStoreCount} updated, =${result.slugChangedCount} slug changes, -${result.removedStoreCount} retired`); - - const totalChanges = result.newStoreCount + result.updatedStoreCount + result.slugChangedCount; - - return { - status: result.errors.length > 0 ? 'partial' : 'success', - itemsProcessed: result.totalLocationsFound, - itemsSucceeded: totalChanges, - itemsFailed: result.errors.length, - errorMessage: result.errors.length > 0 ? result.errors.slice(0, 5).join('; ') : undefined, - metadata: { - statesDiscovered: result.statesDiscovered, - citiesDiscovered: result.citiesDiscovered, - totalLocationsFound: result.totalLocationsFound, - newStoreCount: result.newStoreCount, - updatedStoreCount: result.updatedStoreCount, - slugChangedCount: result.slugChangedCount, - removedStoreCount: result.removedStoreCount, - durationMs: result.durationMs, - errorCount: result.errors.length, - scope: scope || 'all', - statsBefore: { - total: statsBefore.total, - withCoordinates: statsBefore.withCoordinates, - }, - }, - }; - } catch (error: any) { - console.error('[Alice - Store Discovery] Job failed:', error.message); - return { - status: 'error', - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 1, - errorMessage: error.message, - metadata: { error: error.message, scope: scope || 'all' }, - }; - } -} - -/** - * Execute the Analytics Refresh job (Worker: Oscar) - * - * Refreshes materialized views and analytics data. - * Uses StateQueryService to refresh mv_state_metrics and other views. - */ -async function executeAnalyticsRefresh(config: Record): Promise<{ - status: JobStatus; - itemsProcessed: number; - itemsSucceeded: number; - itemsFailed: number; - errorMessage?: string; - metadata?: any; -}> { - console.log('[Oscar - Analytics Refresh] Starting analytics refresh job...'); - - const startTime = Date.now(); - const refreshedViews: string[] = []; - const errors: string[] = []; - - try { - const pool = getPool(); - const stateService = new StateQueryService(pool); - - // Refresh state metrics materialized view - console.log('[Oscar - Analytics Refresh] Refreshing mv_state_metrics...'); - try { - await stateService.refreshMetrics(); - refreshedViews.push('mv_state_metrics'); - console.log('[Oscar - Analytics Refresh] mv_state_metrics refreshed successfully'); - } catch (error: any) { - console.error('[Oscar - Analytics Refresh] Failed to refresh mv_state_metrics:', error.message); - errors.push(`mv_state_metrics: ${error.message}`); - } - - // Refresh other analytics views if configured - if (config.refreshBrandViews !== false) { - console.log('[Oscar - Analytics Refresh] Refreshing brand analytics views...'); - try { - // Check if v_brand_state_presence exists and refresh if needed - await pool.query(` - SELECT 1 FROM pg_matviews WHERE matviewname = 'v_brand_state_presence' LIMIT 1 - `).then(async (result) => { - if (result.rows.length > 0) { - await pool.query('REFRESH MATERIALIZED VIEW CONCURRENTLY v_brand_state_presence'); - refreshedViews.push('v_brand_state_presence'); - console.log('[Oscar - Analytics Refresh] v_brand_state_presence refreshed'); - } - }).catch(() => { - // View doesn't exist, skip - }); - } catch (error: any) { - errors.push(`v_brand_state_presence: ${error.message}`); - } - } - - const durationMs = Date.now() - startTime; - - console.log(`[Oscar - Analytics Refresh] Completed: ${refreshedViews.length} views refreshed in ${Math.round(durationMs / 1000)}s`); - - return { - status: errors.length > 0 ? (refreshedViews.length > 0 ? 'partial' : 'error') : 'success', - itemsProcessed: refreshedViews.length + errors.length, - itemsSucceeded: refreshedViews.length, - itemsFailed: errors.length, - errorMessage: errors.length > 0 ? errors.join('; ') : undefined, - metadata: { - refreshedViews, - errorCount: errors.length, - errors: errors.length > 0 ? errors : undefined, - durationMs, - }, - }; - } catch (error: any) { - console.error('[Oscar - Analytics Refresh] Job failed:', error.message); - return { - status: 'error', - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 1, - errorMessage: error.message, - metadata: { error: error.message }, - }; - } -} - -// ============================================================ -// SCHEDULER RUNNER -// ============================================================ - -/** - * Check for due jobs and run them - */ -async function checkAndRunDueJobs(): Promise { - try { - // Get enabled schedules where nextRunAt <= now - const { rows } = await query( - ` - SELECT * FROM job_schedules - WHERE enabled = true - AND next_run_at IS NOT NULL - AND next_run_at <= NOW() - AND (last_status IS NULL OR last_status != 'running') - ORDER BY next_run_at ASC - ` - ); - - if (rows.length === 0) return; - - console.log(`[Scheduler] Found ${rows.length} due job(s)`); - - for (const row of rows) { - const schedule: JobSchedule = { - id: row.id, - jobName: row.job_name, - description: row.description, - enabled: row.enabled, - baseIntervalMinutes: row.base_interval_minutes, - jitterMinutes: row.jitter_minutes, - lastRunAt: row.last_run_at, - lastStatus: row.last_status, - lastErrorMessage: row.last_error_message, - lastDurationMs: row.last_duration_ms, - nextRunAt: row.next_run_at, - jobConfig: row.job_config, - createdAt: row.created_at, - updatedAt: row.updated_at, - }; - - await runScheduledJob(schedule); - } - } catch (error) { - console.error('[Scheduler] Error checking for due jobs:', error); - } -} - -/** - * Run a single scheduled job - */ -async function runScheduledJob(schedule: JobSchedule): Promise { - const startTime = Date.now(); - const workerInfo = schedule.workerName ? ` [Worker: ${schedule.workerName}]` : ''; - - console.log(`[Scheduler]${workerInfo} Starting job "${schedule.jobName}"...`); - - // Mark as running - await markScheduleRunning(schedule.id); - - // Create run log entry with worker metadata propagated from schedule - const runLogId = await createRunLog( - schedule.id, - schedule.jobName, - 'running', - schedule.workerName, - schedule.workerRole - ); - - try { - // Execute the job - const result = await executeJob(schedule); - - const durationMs = Date.now() - startTime; - - // Determine final status (exclude 'running' and null) - const finalStatus: 'success' | 'error' | 'partial' = - result.status === 'running' || result.status === null - ? 'success' - : result.status; - - // Update run log - await updateRunLog(runLogId, finalStatus, { - durationMs, - errorMessage: result.errorMessage, - itemsProcessed: result.itemsProcessed, - itemsSucceeded: result.itemsSucceeded, - itemsFailed: result.itemsFailed, - metadata: result.metadata, - }); - - // Update schedule with NEW jittered next_run_at - await updateScheduleAfterRun( - schedule.id, - result.status, - durationMs, - result.errorMessage - ); - - console.log(`[Scheduler] Job "${schedule.jobName}" completed in ${Math.round(durationMs / 1000)}s (${result.status})`); - - } catch (error: any) { - const durationMs = Date.now() - startTime; - - console.error(`[Scheduler] Job "${schedule.jobName}" failed:`, error.message); - - // Update run log with error - await updateRunLog(runLogId, 'error', { - durationMs, - errorMessage: error.message, - itemsProcessed: 0, - itemsSucceeded: 0, - itemsFailed: 0, - }); - - // Update schedule with NEW jittered next_run_at - await updateScheduleAfterRun(schedule.id, 'error', durationMs, error.message); - } -} - -// ============================================================ -// PUBLIC API -// ============================================================ - -/** - * Start the scheduler - */ -export function startScheduler(): void { - if (isSchedulerRunning) { - console.log('[Scheduler] Scheduler is already running'); - return; - } - - isSchedulerRunning = true; - console.log(`[Scheduler] Starting scheduler (polling every ${SCHEDULER_POLL_INTERVAL_MS / 1000}s)...`); - - // Immediately check for due jobs - checkAndRunDueJobs(); - - // Set up interval to check for due jobs - schedulerInterval = setInterval(checkAndRunDueJobs, SCHEDULER_POLL_INTERVAL_MS); -} - -/** - * Stop the scheduler - */ -export function stopScheduler(): void { - if (!isSchedulerRunning) { - console.log('[Scheduler] Scheduler is not running'); - return; - } - - isSchedulerRunning = false; - - if (schedulerInterval) { - clearInterval(schedulerInterval); - schedulerInterval = null; - } - - console.log('[Scheduler] Scheduler stopped'); -} - -/** - * Get scheduler status - */ -export function getSchedulerStatus(): { - running: boolean; - pollIntervalMs: number; -} { - return { - running: isSchedulerRunning, - pollIntervalMs: SCHEDULER_POLL_INTERVAL_MS, - }; -} - -/** - * Trigger immediate execution of a schedule - */ -export async function triggerScheduleNow(scheduleId: number): Promise<{ - success: boolean; - message: string; -}> { - const schedule = await getScheduleById(scheduleId); - if (!schedule) { - return { success: false, message: 'Schedule not found' }; - } - - if (schedule.lastStatus === 'running') { - return { success: false, message: 'Job is already running' }; - } - - // Run the job - await runScheduledJob(schedule); - - return { success: true, message: 'Job triggered successfully' }; -} - -/** - * Initialize default schedules if they don't exist - * - * Named Workers: - * - Bella: GraphQL Product Sync (crawls products from Dutchie) - 4hr - * - Henry: Entry Point Finder (detects menu providers and resolves platform IDs) - 24hr - * - Alice: Store Discovery (discovers new locations from city pages) - 24hr - * - Oscar: Analytics Refresh (refreshes materialized views) - 1hr - */ -export async function initializeDefaultSchedules(): Promise { - const schedules = await getAllSchedules(); - - // Check if product crawl schedule exists (Worker: Bella) - const productCrawlExists = schedules.some(s => s.jobName === 'dutchie_az_product_crawl'); - if (!productCrawlExists) { - await createSchedule({ - jobName: 'dutchie_az_product_crawl', - description: 'Crawl all AZ Dutchie dispensary products', - enabled: true, - baseIntervalMinutes: 240, // 4 hours - jitterMinutes: 30, // ±30 minutes - workerName: 'Bella', - workerRole: 'GraphQL Product Sync', - jobConfig: { pricingType: 'rec', useBothModes: true }, - startImmediately: false, - }); - console.log('[Scheduler] Created default product crawl schedule (Worker: Bella)'); - } - - // Check if menu detection schedule exists (Worker: Henry) - const menuDetectionExists = schedules.some(s => s.jobName === 'dutchie_az_menu_detection'); - if (!menuDetectionExists) { - await createSchedule({ - jobName: 'dutchie_az_menu_detection', - description: 'Detect menu providers and resolve platform IDs for AZ dispensaries', - enabled: true, - baseIntervalMinutes: 1440, // 24 hours - jitterMinutes: 60, // ±1 hour - workerName: 'Henry', - workerRole: 'Entry Point Finder', - jobConfig: { state: 'AZ', onlyUnknown: true }, - startImmediately: false, - }); - console.log('[Scheduler] Created default menu detection schedule (Worker: Henry)'); - } - - // Check if store discovery schedule exists (Worker: Alice) - const storeDiscoveryExists = schedules.some(s => s.jobName === 'dutchie_store_discovery'); - if (!storeDiscoveryExists) { - await createSchedule({ - jobName: 'dutchie_store_discovery', - description: 'Discover new Dutchie dispensary locations from city pages', - enabled: true, - baseIntervalMinutes: 1440, // 24 hours - jitterMinutes: 120, // ±2 hours - workerName: 'Alice', - workerRole: 'Store Discovery', - jobConfig: { delayMs: 2000 }, - startImmediately: false, - }); - console.log('[Scheduler] Created default store discovery schedule (Worker: Alice)'); - } - - // Check if analytics refresh schedule exists (Worker: Oscar) - const analyticsRefreshExists = schedules.some(s => s.jobName === 'analytics_refresh'); - if (!analyticsRefreshExists) { - await createSchedule({ - jobName: 'analytics_refresh', - description: 'Refresh analytics materialized views (mv_state_metrics, etc.)', - enabled: true, - baseIntervalMinutes: 60, // 1 hour - jitterMinutes: 10, // ±10 minutes - workerName: 'Oscar', - workerRole: 'Analytics Refresh', - jobConfig: { refreshBrandViews: true }, - startImmediately: false, - }); - console.log('[Scheduler] Created default analytics refresh schedule (Worker: Oscar)'); - } -} - -// Re-export for backward compatibility -export { crawlDispensaryProducts as crawlSingleDispensary } from './product-crawler'; - -export async function triggerImmediateCrawl(): Promise<{ success: boolean; message: string }> { - const schedules = await getAllSchedules(); - const productCrawl = schedules.find(s => s.jobName === 'dutchie_az_product_crawl'); - if (productCrawl) { - return triggerScheduleNow(productCrawl.id); - } - return { success: false, message: 'Product crawl schedule not found' }; -} diff --git a/backend/src/dutchie-az/services/store-validator.ts b/backend/src/dutchie-az/services/store-validator.ts deleted file mode 100644 index e3dbd878..00000000 --- a/backend/src/dutchie-az/services/store-validator.ts +++ /dev/null @@ -1,465 +0,0 @@ -/** - * Store Configuration Validator - * - * Validates and sanitizes store configurations before crawling. - * Applies defaults for missing values and logs warnings. - * - * Phase 1: Crawler Reliability & Stabilization - */ - -import { CrawlErrorCode, CrawlErrorCodeType } from './error-taxonomy'; - -// ============================================================ -// DEFAULT CONFIGURATION -// ============================================================ - -/** - * Default crawl configuration values - */ -export const DEFAULT_CONFIG = { - // Scheduling - crawlFrequencyMinutes: 240, // 4 hours - minCrawlGapMinutes: 2, // Minimum 2 minutes between crawls - - // Retries - maxRetries: 3, - baseBackoffMs: 1000, // 1 second - maxBackoffMs: 60000, // 1 minute - backoffMultiplier: 2.0, // Exponential backoff - - // Timeouts - requestTimeoutMs: 30000, // 30 seconds - pageLoadTimeoutMs: 60000, // 60 seconds - - // Limits - maxProductsPerPage: 100, - maxPages: 50, - - // Proxy - proxyRotationEnabled: true, - proxyRotationOnFailure: true, - - // User Agent - userAgentRotationEnabled: true, - userAgentRotationOnFailure: true, -} as const; - -// ============================================================ -// STORE CONFIG INTERFACE -// ============================================================ - -/** - * Raw store configuration from database - */ -export interface RawStoreConfig { - id: number; - name: string; - slug?: string; - platform?: string; - menuType?: string; - platformDispensaryId?: string; - menuUrl?: string; - website?: string; - - // Crawl config - crawlFrequencyMinutes?: number; - maxRetries?: number; - currentProxyId?: number; - currentUserAgent?: string; - - // Status - crawlStatus?: string; - consecutiveFailures?: number; - backoffMultiplier?: number; - lastCrawlAt?: Date; - lastSuccessAt?: Date; - lastFailureAt?: Date; - lastErrorCode?: string; - nextCrawlAt?: Date; -} - -/** - * Validated and sanitized store configuration - */ -export interface ValidatedStoreConfig { - id: number; - name: string; - slug: string; - platform: string; - menuType: string; - platformDispensaryId: string; - menuUrl: string; - - // Crawl config (with defaults applied) - crawlFrequencyMinutes: number; - maxRetries: number; - currentProxyId: number | null; - currentUserAgent: string | null; - - // Status - crawlStatus: 'active' | 'degraded' | 'paused' | 'failed'; - consecutiveFailures: number; - backoffMultiplier: number; - lastCrawlAt: Date | null; - lastSuccessAt: Date | null; - lastFailureAt: Date | null; - lastErrorCode: CrawlErrorCodeType | null; - nextCrawlAt: Date | null; - - // Validation metadata - isValid: boolean; - validationErrors: ValidationError[]; - validationWarnings: ValidationWarning[]; -} - -// ============================================================ -// VALIDATION TYPES -// ============================================================ - -export interface ValidationError { - field: string; - message: string; - code: CrawlErrorCodeType; -} - -export interface ValidationWarning { - field: string; - message: string; - appliedDefault?: any; -} - -export interface ValidationResult { - isValid: boolean; - config: ValidatedStoreConfig | null; - errors: ValidationError[]; - warnings: ValidationWarning[]; -} - -// ============================================================ -// VALIDATOR CLASS -// ============================================================ - -export class StoreValidator { - private errors: ValidationError[] = []; - private warnings: ValidationWarning[] = []; - - /** - * Validate and sanitize a store configuration - */ - validate(raw: RawStoreConfig): ValidationResult { - this.errors = []; - this.warnings = []; - - // Required field validation - this.validateRequired(raw); - - // If critical errors, return early - if (this.errors.length > 0) { - return { - isValid: false, - config: null, - errors: this.errors, - warnings: this.warnings, - }; - } - - // Build validated config with defaults - const config = this.buildValidatedConfig(raw); - - return { - isValid: this.errors.length === 0, - config, - errors: this.errors, - warnings: this.warnings, - }; - } - - /** - * Validate required fields - */ - private validateRequired(raw: RawStoreConfig): void { - if (!raw.id) { - this.addError('id', 'Store ID is required', CrawlErrorCode.INVALID_CONFIG); - } - - if (!raw.name) { - this.addError('name', 'Store name is required', CrawlErrorCode.INVALID_CONFIG); - } - - if (!raw.platformDispensaryId) { - this.addError( - 'platformDispensaryId', - 'Platform dispensary ID is required for crawling', - CrawlErrorCode.MISSING_PLATFORM_ID - ); - } - - if (!raw.menuType || raw.menuType === 'unknown') { - this.addError( - 'menuType', - 'Menu type must be detected before crawling', - CrawlErrorCode.INVALID_CONFIG - ); - } - } - - /** - * Build validated config with defaults applied - */ - private buildValidatedConfig(raw: RawStoreConfig): ValidatedStoreConfig { - // Slug - const slug = raw.slug || this.generateSlug(raw.name); - if (!raw.slug) { - this.addWarning('slug', 'Slug was missing, generated from name', slug); - } - - // Platform - const platform = raw.platform || 'dutchie'; - if (!raw.platform) { - this.addWarning('platform', 'Platform was missing, defaulting to dutchie', platform); - } - - // Menu URL - const menuUrl = raw.menuUrl || this.generateMenuUrl(raw.platformDispensaryId!, platform); - if (!raw.menuUrl) { - this.addWarning('menuUrl', 'Menu URL was missing, generated from platform ID', menuUrl); - } - - // Crawl frequency - const crawlFrequencyMinutes = this.validateNumeric( - raw.crawlFrequencyMinutes, - 'crawlFrequencyMinutes', - DEFAULT_CONFIG.crawlFrequencyMinutes, - 60, // min: 1 hour - 1440 // max: 24 hours - ); - - // Max retries - const maxRetries = this.validateNumeric( - raw.maxRetries, - 'maxRetries', - DEFAULT_CONFIG.maxRetries, - 1, // min - 10 // max - ); - - // Backoff multiplier - const backoffMultiplier = this.validateNumeric( - raw.backoffMultiplier, - 'backoffMultiplier', - 1.0, - 1.0, // min - 10.0 // max - ); - - // Crawl status - const crawlStatus = this.validateCrawlStatus(raw.crawlStatus); - - // Consecutive failures - const consecutiveFailures = Math.max(0, raw.consecutiveFailures || 0); - - // Last error code - const lastErrorCode = this.validateErrorCode(raw.lastErrorCode); - - return { - id: raw.id, - name: raw.name, - slug, - platform, - menuType: raw.menuType!, - platformDispensaryId: raw.platformDispensaryId!, - menuUrl, - - crawlFrequencyMinutes, - maxRetries, - currentProxyId: raw.currentProxyId || null, - currentUserAgent: raw.currentUserAgent || null, - - crawlStatus, - consecutiveFailures, - backoffMultiplier, - lastCrawlAt: raw.lastCrawlAt || null, - lastSuccessAt: raw.lastSuccessAt || null, - lastFailureAt: raw.lastFailureAt || null, - lastErrorCode, - nextCrawlAt: raw.nextCrawlAt || null, - - isValid: true, - validationErrors: [], - validationWarnings: this.warnings, - }; - } - - /** - * Validate numeric value with bounds - */ - private validateNumeric( - value: number | undefined, - field: string, - defaultValue: number, - min: number, - max: number - ): number { - if (value === undefined || value === null) { - this.addWarning(field, `Missing, defaulting to ${defaultValue}`, defaultValue); - return defaultValue; - } - - if (value < min) { - this.addWarning(field, `Value ${value} below minimum ${min}, using minimum`, min); - return min; - } - - if (value > max) { - this.addWarning(field, `Value ${value} above maximum ${max}, using maximum`, max); - return max; - } - - return value; - } - - /** - * Validate crawl status - */ - private validateCrawlStatus(status?: string): 'active' | 'degraded' | 'paused' | 'failed' { - const validStatuses = ['active', 'degraded', 'paused', 'failed']; - if (!status || !validStatuses.includes(status)) { - if (status) { - this.addWarning('crawlStatus', `Invalid status "${status}", defaulting to active`, 'active'); - } - return 'active'; - } - return status as 'active' | 'degraded' | 'paused' | 'failed'; - } - - /** - * Validate error code - */ - private validateErrorCode(code?: string): CrawlErrorCodeType | null { - if (!code) return null; - const validCodes = Object.values(CrawlErrorCode); - if (!validCodes.includes(code as CrawlErrorCodeType)) { - this.addWarning('lastErrorCode', `Invalid error code "${code}"`, null); - return CrawlErrorCode.UNKNOWN_ERROR; - } - return code as CrawlErrorCodeType; - } - - /** - * Generate slug from name - */ - private generateSlug(name: string): string { - return name - .toLowerCase() - .replace(/[^a-z0-9]+/g, '-') - .replace(/^-+|-+$/g, '') - .substring(0, 100); - } - - /** - * Generate menu URL from platform ID - */ - private generateMenuUrl(platformId: string, platform: string): string { - if (platform === 'dutchie') { - return `https://dutchie.com/embedded-menu/${platformId}`; - } - return `https://${platform}.com/menu/${platformId}`; - } - - /** - * Add validation error - */ - private addError(field: string, message: string, code: CrawlErrorCodeType): void { - this.errors.push({ field, message, code }); - console.warn(`[StoreValidator] ERROR ${field}: ${message}`); - } - - /** - * Add validation warning - */ - private addWarning(field: string, message: string, appliedDefault?: any): void { - this.warnings.push({ field, message, appliedDefault }); - // Log at debug level - warnings are expected for incomplete configs - console.debug(`[StoreValidator] WARNING ${field}: ${message}`); - } -} - -// ============================================================ -// CONVENIENCE FUNCTIONS -// ============================================================ - -/** - * Validate a single store config - */ -export function validateStoreConfig(raw: RawStoreConfig): ValidationResult { - const validator = new StoreValidator(); - return validator.validate(raw); -} - -/** - * Validate multiple store configs - */ -export function validateStoreConfigs(raws: RawStoreConfig[]): { - valid: ValidatedStoreConfig[]; - invalid: { raw: RawStoreConfig; errors: ValidationError[] }[]; - warnings: { storeId: number; warnings: ValidationWarning[] }[]; -} { - const valid: ValidatedStoreConfig[] = []; - const invalid: { raw: RawStoreConfig; errors: ValidationError[] }[] = []; - const warnings: { storeId: number; warnings: ValidationWarning[] }[] = []; - - for (const raw of raws) { - const result = validateStoreConfig(raw); - - if (result.isValid && result.config) { - valid.push(result.config); - if (result.warnings.length > 0) { - warnings.push({ storeId: raw.id, warnings: result.warnings }); - } - } else { - invalid.push({ raw, errors: result.errors }); - } - } - - return { valid, invalid, warnings }; -} - -/** - * Quick check if a store is crawlable - */ -export function isCrawlable(raw: RawStoreConfig): boolean { - return !!( - raw.id && - raw.name && - raw.platformDispensaryId && - raw.menuType && - raw.menuType !== 'unknown' && - raw.crawlStatus !== 'failed' && - raw.crawlStatus !== 'paused' - ); -} - -/** - * Get reason why store is not crawlable - */ -export function getNotCrawlableReason(raw: RawStoreConfig): string | null { - if (!raw.platformDispensaryId) { - return 'Missing platform_dispensary_id'; - } - if (!raw.menuType || raw.menuType === 'unknown') { - return 'Menu type not detected'; - } - if (raw.crawlStatus === 'failed') { - return 'Store is marked as failed'; - } - if (raw.crawlStatus === 'paused') { - return 'Crawling is paused'; - } - return null; -} - -// ============================================================ -// SINGLETON INSTANCE -// ============================================================ - -export const storeValidator = new StoreValidator(); diff --git a/backend/src/dutchie-az/services/worker.ts b/backend/src/dutchie-az/services/worker.ts deleted file mode 100644 index 9269e4c6..00000000 --- a/backend/src/dutchie-az/services/worker.ts +++ /dev/null @@ -1,750 +0,0 @@ -/** - * Worker Service - * - * Polls the job queue and processes crawl jobs. - * Each worker instance runs independently, claiming jobs atomically. - * - * Phase 1: Enhanced with self-healing logic, error taxonomy, and retry management. - */ - -import { - claimNextJob, - completeJob, - failJob, - updateJobProgress, - heartbeat, - getWorkerId, - getWorkerHostname, - recoverStaleJobs, - QueuedJob, -} from './job-queue'; -import { crawlDispensaryProducts } from './product-crawler'; -import { mapDbRowToDispensary } from './discovery'; -import { query } from '../db/connection'; - -// Phase 1: Error taxonomy and retry management -import { - CrawlErrorCode, - CrawlErrorCodeType, - classifyError, - isRetryable, - shouldRotateProxy, - shouldRotateUserAgent, - createSuccessResult, - createFailureResult, - CrawlResult, -} from './error-taxonomy'; -import { - RetryManager, - RetryDecision, - calculateNextCrawlAt, - determineCrawlStatus, - shouldAttemptRecovery, - sleep, -} from './retry-manager'; -import { - CrawlRotator, - userAgentRotator, - updateDispensaryRotation, -} from './proxy-rotator'; -import { DEFAULT_CONFIG, validateStoreConfig, isCrawlable } from './store-validator'; - -// Use shared dispensary columns (handles optional columns like provider_detection_data) -// NOTE: Using WITH_FAILED variant for worker compatibility checks -import { DISPENSARY_COLUMNS_WITH_FAILED as DISPENSARY_COLUMNS } from '../db/dispensary-columns'; - -// ============================================================ -// WORKER CONFIG -// ============================================================ - -const POLL_INTERVAL_MS = 5000; // Check for jobs every 5 seconds -const HEARTBEAT_INTERVAL_MS = 60000; // Send heartbeat every 60 seconds -const STALE_CHECK_INTERVAL_MS = 300000; // Check for stale jobs every 5 minutes -const SHUTDOWN_GRACE_PERIOD_MS = 30000; // Wait 30s for job to complete on shutdown - -// ============================================================ -// WORKER STATE -// ============================================================ - -let isRunning = false; -let currentJob: QueuedJob | null = null; -let pollTimer: NodeJS.Timeout | null = null; -let heartbeatTimer: NodeJS.Timeout | null = null; -let staleCheckTimer: NodeJS.Timeout | null = null; -let shutdownPromise: Promise | null = null; - -// ============================================================ -// WORKER LIFECYCLE -// ============================================================ - -/** - * Start the worker - */ -export async function startWorker(): Promise { - if (isRunning) { - console.log('[Worker] Already running'); - return; - } - - const workerId = getWorkerId(); - const hostname = getWorkerHostname(); - - console.log(`[Worker] Starting worker ${workerId} on ${hostname}`); - isRunning = true; - - // Set up graceful shutdown - setupShutdownHandlers(); - - // Start polling for jobs - pollTimer = setInterval(pollForJobs, POLL_INTERVAL_MS); - - // Start stale job recovery (only one worker should do this, but it's idempotent) - staleCheckTimer = setInterval(async () => { - try { - await recoverStaleJobs(15); - } catch (error) { - console.error('[Worker] Error recovering stale jobs:', error); - } - }, STALE_CHECK_INTERVAL_MS); - - // Immediately poll for a job - await pollForJobs(); - - console.log(`[Worker] Worker ${workerId} started, polling every ${POLL_INTERVAL_MS}ms`); -} - -/** - * Stop the worker gracefully - */ -export async function stopWorker(): Promise { - if (!isRunning) return; - - console.log('[Worker] Stopping worker...'); - isRunning = false; - - // Clear timers - if (pollTimer) { - clearInterval(pollTimer); - pollTimer = null; - } - if (heartbeatTimer) { - clearInterval(heartbeatTimer); - heartbeatTimer = null; - } - if (staleCheckTimer) { - clearInterval(staleCheckTimer); - staleCheckTimer = null; - } - - // Wait for current job to complete - if (currentJob) { - console.log(`[Worker] Waiting for job ${currentJob.id} to complete...`); - const startWait = Date.now(); - - while (currentJob && Date.now() - startWait < SHUTDOWN_GRACE_PERIOD_MS) { - await new Promise(r => setTimeout(r, 1000)); - } - - if (currentJob) { - console.log(`[Worker] Job ${currentJob.id} did not complete in time, marking for retry`); - await failJob(currentJob.id, 'Worker shutdown'); - } - } - - console.log('[Worker] Worker stopped'); -} - -/** - * Get worker status - */ -export function getWorkerStatus(): { - isRunning: boolean; - workerId: string; - hostname: string; - currentJob: QueuedJob | null; -} { - return { - isRunning, - workerId: getWorkerId(), - hostname: getWorkerHostname(), - currentJob, - }; -} - -// ============================================================ -// JOB PROCESSING -// ============================================================ - -/** - * Poll for and process the next available job - */ -async function pollForJobs(): Promise { - if (!isRunning || currentJob) { - return; // Already processing a job - } - - try { - const workerId = getWorkerId(); - - // Try to claim a job - const job = await claimNextJob({ - workerId, - jobTypes: ['dutchie_product_crawl', 'menu_detection', 'menu_detection_single'], - lockDurationMinutes: 30, - }); - - if (!job) { - return; // No jobs available - } - - currentJob = job; - console.log(`[Worker] Processing job ${job.id} (type=${job.jobType}, dispensary=${job.dispensaryId})`); - - // Start heartbeat for this job - heartbeatTimer = setInterval(async () => { - if (currentJob) { - try { - await heartbeat(currentJob.id); - } catch (error) { - console.error('[Worker] Heartbeat error:', error); - } - } - }, HEARTBEAT_INTERVAL_MS); - - // Process the job - await processJob(job); - - } catch (error: any) { - console.error('[Worker] Error polling for jobs:', error); - - if (currentJob) { - try { - await failJob(currentJob.id, error.message); - } catch (failError) { - console.error('[Worker] Error failing job:', failError); - } - } - } finally { - // Clear heartbeat timer - if (heartbeatTimer) { - clearInterval(heartbeatTimer); - heartbeatTimer = null; - } - currentJob = null; - } -} - -/** - * Process a single job - */ -async function processJob(job: QueuedJob): Promise { - try { - switch (job.jobType) { - case 'dutchie_product_crawl': - await processProductCrawlJob(job); - break; - - case 'menu_detection': - await processMenuDetectionJob(job); - break; - - case 'menu_detection_single': - await processSingleDetectionJob(job); - break; - - default: - throw new Error(`Unknown job type: ${job.jobType}`); - } - } catch (error: any) { - console.error(`[Worker] Job ${job.id} failed:`, error); - await failJob(job.id, error.message); - } -} - -// Thresholds for crawl status transitions -const DEGRADED_THRESHOLD = 3; // Mark as degraded after 3 consecutive failures -const FAILED_THRESHOLD = 10; // Mark as failed after 10 consecutive failures - -// For backwards compatibility -const MAX_CONSECUTIVE_FAILURES = FAILED_THRESHOLD; - -/** - * Record a successful crawl - resets failure counter and restores active status - */ -async function recordCrawlSuccess( - dispensaryId: number, - result: CrawlResult -): Promise { - // Calculate next crawl time (use store's frequency or default) - const { rows: storeRows } = await query( - `SELECT crawl_frequency_minutes FROM dispensaries WHERE id = $1`, - [dispensaryId] - ); - const frequencyMinutes = storeRows[0]?.crawl_frequency_minutes || DEFAULT_CONFIG.crawlFrequencyMinutes; - const nextCrawlAt = calculateNextCrawlAt(0, frequencyMinutes); - - // Reset failure state and schedule next crawl - await query( - `UPDATE dispensaries - SET consecutive_failures = 0, - crawl_status = 'active', - backoff_multiplier = 1.0, - last_crawl_at = NOW(), - last_success_at = NOW(), - last_error_code = NULL, - next_crawl_at = $2, - total_attempts = COALESCE(total_attempts, 0) + 1, - total_successes = COALESCE(total_successes, 0) + 1, - updated_at = NOW() - WHERE id = $1`, - [dispensaryId, nextCrawlAt] - ); - - // Log to crawl_attempts table for analytics - await logCrawlAttempt(dispensaryId, result); - - console.log(`[Worker] Dispensary ${dispensaryId} crawl success. Next crawl at ${nextCrawlAt.toISOString()}`); -} - -/** - * Record a crawl failure with self-healing logic - * - Rotates proxy/UA based on error type - * - Transitions through: active -> degraded -> failed - * - Calculates backoff for next attempt - */ -async function recordCrawlFailure( - dispensaryId: number, - errorMessage: string, - errorCode?: CrawlErrorCodeType, - httpStatus?: number, - context?: { - proxyUsed?: string; - userAgentUsed?: string; - attemptNumber?: number; - } -): Promise<{ wasFlagged: boolean; newStatus: string; shouldRotateProxy: boolean; shouldRotateUA: boolean }> { - // Classify the error if not provided - const code = errorCode || classifyError(errorMessage, httpStatus); - - // Get current state - const { rows: storeRows } = await query( - `SELECT - consecutive_failures, - crawl_status, - backoff_multiplier, - crawl_frequency_minutes, - current_proxy_id, - current_user_agent - FROM dispensaries WHERE id = $1`, - [dispensaryId] - ); - - if (storeRows.length === 0) { - return { wasFlagged: false, newStatus: 'unknown', shouldRotateProxy: false, shouldRotateUA: false }; - } - - const store = storeRows[0]; - const currentFailures = (store.consecutive_failures || 0) + 1; - const frequencyMinutes = store.crawl_frequency_minutes || DEFAULT_CONFIG.crawlFrequencyMinutes; - - // Determine if we should rotate proxy/UA based on error type - const rotateProxy = shouldRotateProxy(code); - const rotateUA = shouldRotateUserAgent(code); - - // Get new proxy/UA if rotation is needed - let newProxyId = store.current_proxy_id; - let newUserAgent = store.current_user_agent; - - if (rotateUA) { - newUserAgent = userAgentRotator.getNext(); - console.log(`[Worker] Rotating user agent for dispensary ${dispensaryId} after ${code}`); - } - - // Determine new crawl status - const newStatus = determineCrawlStatus(currentFailures, { - degraded: DEGRADED_THRESHOLD, - failed: FAILED_THRESHOLD, - }); - - // Calculate backoff multiplier and next crawl time - const newBackoffMultiplier = Math.min( - (store.backoff_multiplier || 1.0) * 1.5, - 4.0 // Max 4x backoff - ); - const nextCrawlAt = calculateNextCrawlAt(currentFailures, frequencyMinutes); - - // Update dispensary with new failure state - if (newStatus === 'failed') { - // Mark as failed - won't be crawled again until manual intervention - await query( - `UPDATE dispensaries - SET consecutive_failures = $2, - crawl_status = $3, - backoff_multiplier = $4, - last_failure_at = NOW(), - last_error_code = $5, - failed_at = NOW(), - failure_notes = $6, - next_crawl_at = NULL, - current_proxy_id = $7, - current_user_agent = $8, - total_attempts = COALESCE(total_attempts, 0) + 1, - updated_at = NOW() - WHERE id = $1`, - [ - dispensaryId, - currentFailures, - newStatus, - newBackoffMultiplier, - code, - `Auto-flagged after ${currentFailures} consecutive failures. Last error: ${errorMessage}`, - newProxyId, - newUserAgent, - ] - ); - console.log(`[Worker] Dispensary ${dispensaryId} marked as FAILED after ${currentFailures} failures (${code})`); - } else { - // Update failure count but keep crawling (active or degraded) - await query( - `UPDATE dispensaries - SET consecutive_failures = $2, - crawl_status = $3, - backoff_multiplier = $4, - last_failure_at = NOW(), - last_error_code = $5, - next_crawl_at = $6, - current_proxy_id = $7, - current_user_agent = $8, - total_attempts = COALESCE(total_attempts, 0) + 1, - updated_at = NOW() - WHERE id = $1`, - [ - dispensaryId, - currentFailures, - newStatus, - newBackoffMultiplier, - code, - nextCrawlAt, - newProxyId, - newUserAgent, - ] - ); - - if (newStatus === 'degraded') { - console.log(`[Worker] Dispensary ${dispensaryId} marked as DEGRADED (${currentFailures}/${FAILED_THRESHOLD} failures). Next crawl: ${nextCrawlAt.toISOString()}`); - } else { - console.log(`[Worker] Dispensary ${dispensaryId} failure recorded (${currentFailures}/${DEGRADED_THRESHOLD}). Next crawl: ${nextCrawlAt.toISOString()}`); - } - } - - // Log to crawl_attempts table - const result = createFailureResult( - dispensaryId, - new Date(), - errorMessage, - httpStatus, - context - ); - await logCrawlAttempt(dispensaryId, result); - - return { - wasFlagged: newStatus === 'failed', - newStatus, - shouldRotateProxy: rotateProxy, - shouldRotateUA: rotateUA, - }; -} - -/** - * Log a crawl attempt to the crawl_attempts table for analytics - */ -async function logCrawlAttempt( - dispensaryId: number, - result: CrawlResult -): Promise { - try { - await query( - `INSERT INTO crawl_attempts ( - dispensary_id, started_at, finished_at, duration_ms, - error_code, error_message, http_status, - attempt_number, proxy_used, user_agent_used, - products_found, products_upserted, snapshots_created, - created_at - ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, NOW())`, - [ - dispensaryId, - result.startedAt, - result.finishedAt, - result.durationMs, - result.errorCode, - result.errorMessage || null, - result.httpStatus || null, - result.attemptNumber, - result.proxyUsed || null, - result.userAgentUsed || null, - result.productsFound || 0, - result.productsUpserted || 0, - result.snapshotsCreated || 0, - ] - ); - } catch (error) { - // Don't fail the job if logging fails - console.error(`[Worker] Failed to log crawl attempt for dispensary ${dispensaryId}:`, error); - } -} - -/** - * Process a product crawl job for a single dispensary - */ -async function processProductCrawlJob(job: QueuedJob): Promise { - const startedAt = new Date(); - const userAgent = userAgentRotator.getCurrent(); - - if (!job.dispensaryId) { - throw new Error('Product crawl job requires dispensary_id'); - } - - // Get dispensary details - const { rows } = await query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`, - [job.dispensaryId] - ); - - if (rows.length === 0) { - throw new Error(`Dispensary ${job.dispensaryId} not found`); - } - - const dispensary = mapDbRowToDispensary(rows[0]); - const rawDispensary = rows[0]; - - // Check if dispensary is already flagged as failed - if (rawDispensary.failed_at) { - console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already flagged as failed`); - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - return; - } - - // Check crawl status - skip if paused or failed - if (rawDispensary.crawl_status === 'paused' || rawDispensary.crawl_status === 'failed') { - console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - crawl_status is ${rawDispensary.crawl_status}`); - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - return; - } - - if (!dispensary.platformDispensaryId) { - // Record failure with error taxonomy - const { wasFlagged } = await recordCrawlFailure( - job.dispensaryId, - 'Missing platform_dispensary_id', - CrawlErrorCode.MISSING_PLATFORM_ID, - undefined, - { userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 } - ); - if (wasFlagged) { - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - return; - } - throw new Error(`Dispensary ${job.dispensaryId} has no platform_dispensary_id`); - } - - // Get crawl options from job metadata - const pricingType = job.metadata?.pricingType || 'rec'; - const useBothModes = job.metadata?.useBothModes !== false; - - try { - // Crawl the dispensary - const result = await crawlDispensaryProducts(dispensary, pricingType, { - useBothModes, - onProgress: async (progress) => { - // Update progress for live monitoring - await updateJobProgress(job.id, { - productsFound: progress.productsFound, - productsUpserted: progress.productsUpserted, - snapshotsCreated: progress.snapshotsCreated, - currentPage: progress.currentPage, - totalPages: progress.totalPages, - }); - }, - }); - - if (result.success) { - // Success! Create result and record - const crawlResult = createSuccessResult( - job.dispensaryId, - startedAt, - { - productsFound: result.productsFetched, - productsUpserted: result.productsUpserted, - snapshotsCreated: result.snapshotsCreated, - }, - { - attemptNumber: job.retryCount + 1, - userAgentUsed: userAgent, - } - ); - await recordCrawlSuccess(job.dispensaryId, crawlResult); - await completeJob(job.id, { - productsFound: result.productsFetched, - productsUpserted: result.productsUpserted, - snapshotsCreated: result.snapshotsCreated, - // Visibility tracking stats for dashboard - visibilityLostCount: result.visibilityLostCount || 0, - visibilityRestoredCount: result.visibilityRestoredCount || 0, - }); - } else { - // Crawl returned failure - classify error and record - const errorCode = classifyError(result.errorMessage || 'Crawl failed', result.httpStatus); - const { wasFlagged } = await recordCrawlFailure( - job.dispensaryId, - result.errorMessage || 'Crawl failed', - errorCode, - result.httpStatus, - { userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 } - ); - - if (wasFlagged) { - // Dispensary is now flagged - complete the job - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - } else if (!isRetryable(errorCode)) { - // Non-retryable error - complete as failed - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - } else { - // Retryable error - let job queue handle retry - throw new Error(result.errorMessage || 'Crawl failed'); - } - } - } catch (error: any) { - // Record the failure with error taxonomy - const errorCode = classifyError(error.message); - const { wasFlagged } = await recordCrawlFailure( - job.dispensaryId, - error.message, - errorCode, - undefined, - { userAgentUsed: userAgent, attemptNumber: job.retryCount + 1 } - ); - - if (wasFlagged) { - // Dispensary is now flagged - complete the job - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - } else if (!isRetryable(errorCode)) { - // Non-retryable error - complete as failed - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - } else { - throw error; - } - } -} - -/** - * Process a menu detection job (bulk) - */ -async function processMenuDetectionJob(job: QueuedJob): Promise { - const { executeMenuDetectionJob } = await import('./menu-detection'); - - const config = job.metadata || {}; - const result = await executeMenuDetectionJob(config); - - if (result.status === 'error') { - throw new Error(result.errorMessage || 'Menu detection failed'); - } - - await completeJob(job.id, { - productsFound: result.itemsProcessed, - productsUpserted: result.itemsSucceeded, - }); -} - -/** - * Process a single dispensary menu detection job - * This is the parallelizable version - each worker can detect one dispensary at a time - */ -async function processSingleDetectionJob(job: QueuedJob): Promise { - if (!job.dispensaryId) { - throw new Error('Single detection job requires dispensary_id'); - } - - const { detectAndResolveDispensary } = await import('./menu-detection'); - - // Get dispensary details - const { rows } = await query( - `SELECT ${DISPENSARY_COLUMNS} FROM dispensaries WHERE id = $1`, - [job.dispensaryId] - ); - - if (rows.length === 0) { - throw new Error(`Dispensary ${job.dispensaryId} not found`); - } - - const dispensary = rows[0]; - - // Skip if already detected or failed - if (dispensary.failed_at) { - console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already flagged as failed`); - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - return; - } - - if (dispensary.menu_type && dispensary.menu_type !== 'unknown') { - console.log(`[Worker] Skipping dispensary ${job.dispensaryId} - already detected as ${dispensary.menu_type}`); - await completeJob(job.id, { productsFound: 0, productsUpserted: 1 }); - return; - } - - console.log(`[Worker] Detecting menu for dispensary ${job.dispensaryId} (${dispensary.name})...`); - - try { - const result = await detectAndResolveDispensary(job.dispensaryId); - - if (result.success) { - console.log(`[Worker] Dispensary ${job.dispensaryId}: detected ${result.detectedProvider}, platformId=${result.platformDispensaryId || 'none'}`); - await completeJob(job.id, { - productsFound: 1, - productsUpserted: result.platformDispensaryId ? 1 : 0, - }); - } else { - // Detection failed - record failure - await recordCrawlFailure(job.dispensaryId, result.error || 'Detection failed'); - throw new Error(result.error || 'Detection failed'); - } - } catch (error: any) { - // Record the failure - const wasFlagged = await recordCrawlFailure(job.dispensaryId, error.message); - if (wasFlagged) { - // Dispensary is now flagged - complete the job rather than fail it - await completeJob(job.id, { productsFound: 0, productsUpserted: 0 }); - } else { - throw error; - } - } -} - -// ============================================================ -// SHUTDOWN HANDLING -// ============================================================ - -function setupShutdownHandlers(): void { - const shutdown = async (signal: string) => { - if (shutdownPromise) return shutdownPromise; - - console.log(`\n[Worker] Received ${signal}, shutting down...`); - shutdownPromise = stopWorker(); - await shutdownPromise; - process.exit(0); - }; - - process.on('SIGTERM', () => shutdown('SIGTERM')); - process.on('SIGINT', () => shutdown('SIGINT')); -} - -// ============================================================ -// STANDALONE WORKER ENTRY POINT -// ============================================================ - -if (require.main === module) { - // Run as standalone worker - startWorker().catch((error) => { - console.error('[Worker] Fatal error:', error); - process.exit(1); - }); -} diff --git a/backend/src/dutchie-az/types/index.ts b/backend/src/dutchie-az/types/index.ts deleted file mode 100644 index df7f68b4..00000000 --- a/backend/src/dutchie-az/types/index.ts +++ /dev/null @@ -1,751 +0,0 @@ -/** - * Dutchie AZ Data Types - * - * Complete TypeScript interfaces for the isolated Dutchie Arizona data pipeline. - * These types map directly to Dutchie's GraphQL FilteredProducts response. - */ - -// ============================================================ -// GRAPHQL RESPONSE TYPES (from Dutchie API) -// ============================================================ - -/** - * Raw Dutchie brand object from GraphQL - */ -export interface DutchieBrand { - id: string; - _id?: string; - name: string; - parentBrandId?: string; - imageUrl?: string; - description?: string; - __typename?: string; -} - -/** - * Raw Dutchie image object from GraphQL - */ -export interface DutchieImage { - url: string; - description?: string; - active?: boolean; - __typename?: string; -} - -/** - * POSMetaData.children - option-level inventory/pricing - */ -export interface DutchiePOSChild { - activeBatchTags?: any; - canonicalBrandId?: string; - canonicalBrandName?: string; - canonicalCategory?: string; - canonicalCategoryId?: string; - canonicalEffectivePotencyMg?: number; - canonicalID?: string; - canonicalPackageId?: string; - canonicalImgUrl?: string; - canonicalLabResultUrl?: string; - canonicalName?: string; - canonicalSKU?: string; - canonicalProductTags?: string[]; - canonicalStrainId?: string; - canonicalVendorId?: string; - kioskQuantityAvailable?: number; - medPrice?: number; - option?: string; - packageQuantity?: number; - price?: number; - quantity?: number; - quantityAvailable?: number; - recEquivalent?: number; - recPrice?: number; - standardEquivalent?: number; - __typename?: string; -} - -/** - * POSMetaData object from GraphQL - */ -export interface DutchiePOSMetaData { - activeBatchTags?: any; - canonicalBrandId?: string; - canonicalBrandName?: string; - canonicalCategory?: string; - canonicalCategoryId?: string; - canonicalID?: string; - canonicalPackageId?: string; - canonicalImgUrl?: string; - canonicalLabResultUrl?: string; - canonicalName?: string; - canonicalProductTags?: string[]; - canonicalSKU?: string; - canonicalStrainId?: string; - canonicalVendorId?: string; - children?: DutchiePOSChild[]; - integrationID?: string; - __typename?: string; -} - -/** - * THC/CBD Content structure - */ -export interface DutchiePotencyContent { - unit?: string; - range?: number[]; -} - -/** - * CannabinoidV2 structure - */ -export interface DutchieCannabinoidV2 { - value: number; - unit: string; - cannabinoid: { - name: string; - }; -} - -/** - * Special data structure - */ -export interface DutchieSpecialData { - saleSpecials?: Array<{ - specialId: string; - specialName: string; - discount: number; - percentDiscount: boolean; - dollarDiscount: boolean; - specialType: string; - }>; - bogoSpecials?: any; -} - -/** - * Complete raw product from Dutchie GraphQL FilteredProducts - */ -export interface DutchieRawProduct { - _id: string; - id?: string; - AdditionalOptions?: any; - duplicatedProductId?: string; - libraryProductId?: string; - libraryProductScore?: number; - - // Brand - brand?: DutchieBrand; - brandId?: string; - brandName?: string; - brandLogo?: string; - - // Potency - CBD?: number; - CBDContent?: DutchiePotencyContent; - THC?: number; - THCContent?: DutchiePotencyContent; - cannabinoidsV2?: DutchieCannabinoidV2[]; - - // Flags - certificateOfAnalysisEnabled?: boolean; - collectionCardBadge?: string; - comingSoon?: boolean; - featured?: boolean; - medicalOnly?: boolean; - recOnly?: boolean; - nonArmsLength?: boolean; - vapeTaxApplicable?: boolean; - useBetterPotencyTaxes?: boolean; - - // Timestamps - createdAt?: string; - updatedAt?: string; - - // Dispensary - DispensaryID: string; - enterpriseProductId?: string; - - // Images - Image?: string; - images?: DutchieImage[]; - - // Measurements - measurements?: { - netWeight?: { - unit: string; - values: number[]; - }; - volume?: any; - }; - weight?: number | string; - - // Product identity - Name: string; - cName: string; - pastCNames?: string[]; - - // Options - Options?: string[]; - rawOptions?: string[]; - limitsPerCustomer?: any; - manualInventory?: boolean; - - // POS data - POSMetaData?: DutchiePOSMetaData; - - // Pricing - Prices?: number[]; - recPrices?: number[]; - medicalPrices?: number[]; - recSpecialPrices?: number[]; - medicalSpecialPrices?: number[]; - wholesalePrices?: number[]; - pricingTierData?: any; - specialIdsPerOption?: any; - - // Specials - special?: boolean; - specialData?: DutchieSpecialData; - - // Classification - Status?: string; - strainType?: string; - subcategory?: string; - type?: string; - provider?: string; - effects?: Record; - - // Threshold flags - isBelowThreshold?: boolean; - isBelowKioskThreshold?: boolean; - optionsBelowThreshold?: boolean; - optionsBelowKioskThreshold?: boolean; - - // Misc - bottleDepositTaxCents?: number; - __typename?: string; -} - -// ============================================================ -// DERIVED TYPES -// ============================================================ - -/** - * StockStatus - derived from POSMetaData.children quantityAvailable - * - 'in_stock': At least one option has quantityAvailable > 0 - * - 'out_of_stock': All options have quantityAvailable === 0 - * - 'unknown': No POSMetaData.children or quantityAvailable data - * - 'missing_from_feed': Product was not present in the latest crawl feed - */ -export type StockStatus = 'in_stock' | 'out_of_stock' | 'unknown' | 'missing_from_feed'; - -/** - * CrawlMode - defines how products are fetched from Dutchie - * - 'mode_a': UI parity - Status: 'Active', threshold removal ON - * - 'mode_b': MAX COVERAGE - No Status filter, bypass thresholds - */ -export type CrawlMode = 'mode_a' | 'mode_b'; - -/** - * Per-option stock status type - */ -export type OptionStockStatus = 'in_stock' | 'out_of_stock' | 'unknown'; - -/** - * Get available quantity for a single option - * Priority: quantityAvailable > kioskQuantityAvailable > quantity - */ -export function getOptionQuantity(child: DutchiePOSChild): number | null { - if (typeof child.quantityAvailable === 'number') return child.quantityAvailable; - if (typeof child.kioskQuantityAvailable === 'number') return child.kioskQuantityAvailable; - if (typeof child.quantity === 'number') return child.quantity; - return null; // No quantity data available -} - -/** - * Derive stock status for a single option - * Returns: 'in_stock' if qty > 0, 'out_of_stock' if qty === 0, 'unknown' if no data - */ -export function deriveOptionStockStatus(child: DutchiePOSChild): OptionStockStatus { - const qty = getOptionQuantity(child); - if (qty === null) return 'unknown'; - return qty > 0 ? 'in_stock' : 'out_of_stock'; -} - -/** - * Derive product-level stock status from POSMetaData.children - * - * Logic per spec: - * - If ANY child is "in_stock" → product is "in_stock" - * - Else if ALL children are "out_of_stock" → product is "out_of_stock" - * - Else → product is "unknown" - * - * IMPORTANT: Threshold flags (isBelowThreshold, etc.) do NOT override stock status. - * They only indicate "low stock" - if qty > 0, status stays "in_stock". - */ -export function deriveStockStatus(product: DutchieRawProduct): StockStatus { - const children = product.POSMetaData?.children; - - // No children data - unknown - if (!children || children.length === 0) { - return 'unknown'; - } - - // Get stock status for each option - const optionStatuses = children.map(deriveOptionStockStatus); - - // If ANY option is in_stock → product is in_stock - if (optionStatuses.some(status => status === 'in_stock')) { - return 'in_stock'; - } - - // If ALL options are out_of_stock → product is out_of_stock - if (optionStatuses.every(status => status === 'out_of_stock')) { - return 'out_of_stock'; - } - - // Otherwise (mix of out_of_stock and unknown) → unknown - return 'unknown'; -} - -/** - * Calculate total quantity available across all options - * Returns null if no children data (unknown inventory), 0 if children exist but all have 0 qty - */ -export function calculateTotalQuantity(product: DutchieRawProduct): number | null { - const children = product.POSMetaData?.children; - // No children = unknown inventory, return null (NOT 0) - if (!children || children.length === 0) return null; - - // Check if any child has quantity data - const hasAnyQtyData = children.some(child => getOptionQuantity(child) !== null); - if (!hasAnyQtyData) return null; // All children lack qty data = unknown - - return children.reduce((sum, child) => { - const qty = getOptionQuantity(child); - return sum + (qty ?? 0); - }, 0); -} - -/** - * Calculate total kiosk quantity available across all options - */ -export function calculateTotalKioskQuantity(product: DutchieRawProduct): number | null { - const children = product.POSMetaData?.children; - if (!children || children.length === 0) return null; - - const hasAnyKioskQty = children.some(child => typeof child.kioskQuantityAvailable === 'number'); - if (!hasAnyKioskQty) return null; - - return children.reduce((sum, child) => sum + (child.kioskQuantityAvailable ?? 0), 0); -} - -// ============================================================ -// DATABASE ENTITY TYPES -// ============================================================ - -/** - * Dispensary - represents a Dutchie store in Arizona - */ -export interface Dispensary { - id: number; - platform: 'dutchie'; - name: string; - dbaName?: string; - slug: string; - city: string; - state: string; - postalCode?: string; - latitude?: number; - longitude?: number; - address?: string; - platformDispensaryId?: string; // Resolved internal ID (e.g., "6405ef617056e8014d79101b") - isDelivery?: boolean; - isPickup?: boolean; - rawMetadata?: any; // Full discovery node - lastCrawledAt?: Date; - productCount?: number; - createdAt: Date; - updatedAt: Date; - menuType?: string; - menuUrl?: string; - scrapeEnabled?: boolean; - providerDetectionData?: any; - platformDispensaryIdResolvedAt?: Date; - website?: string; // The dispensary's own website (from raw_metadata or direct column) -} - -/** - * DutchieProduct - canonical product identity per store - */ -export interface DutchieProduct { - id: number; - dispensaryId: number; - platform: 'dutchie'; - - externalProductId: string; // from _id or id - platformDispensaryId: string; // mirror of Dispensary.platformDispensaryId - cName?: string; // cName / slug - name: string; // Name - - // Brand - brandName?: string; - brandId?: string; - brandLogoUrl?: string; - - // Classification - type?: string; - subcategory?: string; - strainType?: string; - provider?: string; - - // Potency - thc?: number; - thcContent?: number; - cbd?: number; - cbdContent?: number; - cannabinoidsV2?: DutchieCannabinoidV2[]; - effects?: Record; - - // Status / flags - status?: string; - medicalOnly: boolean; - recOnly: boolean; - featured: boolean; - comingSoon: boolean; - certificateOfAnalysisEnabled: boolean; - - isBelowThreshold: boolean; - isBelowKioskThreshold: boolean; - optionsBelowThreshold: boolean; - optionsBelowKioskThreshold: boolean; - - // Derived stock status (from POSMetaData.children quantityAvailable) - stockStatus: StockStatus; - totalQuantityAvailable?: number | null; // null = unknown (no children), 0 = all OOS - - // Images - primaryImageUrl?: string; - images?: DutchieImage[]; - - // Misc - measurements?: any; - weight?: string; - pastCNames?: string[]; - - createdAtDutchie?: Date; - updatedAtDutchie?: Date; - - latestRawPayload?: any; // Full product node from last crawl - - createdAt: Date; - updatedAt: Date; -} - -/** - * DutchieProductOptionSnapshot - child-level option data from POSMetaData.children - */ -export interface DutchieProductOptionSnapshot { - optionId: string; // canonicalID or canonicalPackageId or canonicalSKU - canonicalId?: string; - canonicalPackageId?: string; - canonicalSKU?: string; - canonicalName?: string; - - canonicalCategory?: string; - canonicalCategoryId?: string; - canonicalBrandId?: string; - canonicalBrandName?: string; - canonicalStrainId?: string; - canonicalVendorId?: string; - - optionLabel?: string; // from option field - packageQuantity?: number; - recEquivalent?: number; - standardEquivalent?: number; - - priceCents?: number; // price * 100 - recPriceCents?: number; // recPrice * 100 - medPriceCents?: number; // medPrice * 100 - - quantity?: number; - quantityAvailable?: number; - kioskQuantityAvailable?: number; - - activeBatchTags?: any; - canonicalImgUrl?: string; - canonicalLabResultUrl?: string; - canonicalEffectivePotencyMg?: number; - - rawChildPayload?: any; // Full POSMetaData.children node -} - -/** - * DutchieProductSnapshot - per crawl, includes options[] - */ -export interface DutchieProductSnapshot { - id: number; - dutchieProductId: number; - dispensaryId: number; - platformDispensaryId: string; - externalProductId: string; - pricingType: 'rec' | 'med' | 'unknown'; - crawlMode: CrawlMode; // Which crawl mode captured this snapshot - - status?: string; - featured: boolean; - special: boolean; - medicalOnly: boolean; - recOnly: boolean; - - // Flag indicating if product was present in feed (false = missing_from_feed snapshot) - isPresentInFeed: boolean; - - // Derived stock status for this snapshot - stockStatus: StockStatus; - - // Price summary (aggregated from children, in cents) - recMinPriceCents?: number; - recMaxPriceCents?: number; - recMinSpecialPriceCents?: number; - medMinPriceCents?: number; - medMaxPriceCents?: number; - medMinSpecialPriceCents?: number; - wholesaleMinPriceCents?: number; - - // Inventory summary (aggregated from POSMetaData.children) - totalQuantityAvailable?: number | null; // null = unknown (no children), 0 = all OOS - totalKioskQuantityAvailable?: number | null; - manualInventory: boolean; - isBelowThreshold: boolean; - isBelowKioskThreshold: boolean; - - // Option-level data - options: DutchieProductOptionSnapshot[]; - - // Full raw product node at this crawl time - rawPayload: any; - - crawledAt: Date; - createdAt: Date; - updatedAt: Date; -} - -/** - * CrawlJob - tracks crawl execution status - */ -export interface CrawlJob { - id: number; - jobType: 'discovery' | 'product_crawl' | 'resolve_ids'; - dispensaryId?: number; - status: 'pending' | 'running' | 'completed' | 'failed'; - startedAt?: Date; - completedAt?: Date; - errorMessage?: string; - productsFound?: number; - snapshotsCreated?: number; - metadata?: any; - createdAt: Date; - updatedAt: Date; -} - -/** - * JobSchedule - recurring job configuration with jitter support - * Times "wander" around the clock due to random jitter after each run - */ -export type JobStatus = 'success' | 'error' | 'partial' | 'running' | null; - -export interface JobSchedule { - id: number; - jobName: string; - description?: string; - enabled: boolean; - - // Timing configuration - baseIntervalMinutes: number; // e.g., 240 (4 hours) - jitterMinutes: number; // e.g., 30 (±30 minutes) - - // Worker identity - workerName?: string; // e.g., "Alice", "Henry", "Bella", "Oscar" - workerRole?: string; // e.g., "Store Discovery Worker", "GraphQL Product Sync" - - // Last run tracking - lastRunAt?: Date; - lastStatus?: JobStatus; - lastErrorMessage?: string; - lastDurationMs?: number; - - // Next run (calculated with jitter) - nextRunAt?: Date; - - // Job-specific config - jobConfig?: Record; - - createdAt: Date; - updatedAt: Date; -} - -/** - * JobRunLog - history of job executions - */ -export interface JobRunLog { - id: number; - scheduleId: number; - jobName: string; - status: 'pending' | 'running' | 'success' | 'error' | 'partial'; - startedAt?: Date; - completedAt?: Date; - durationMs?: number; - errorMessage?: string; - - // Worker identity (propagated from schedule) - workerName?: string; // e.g., "Alice", "Henry", "Bella", "Oscar" - runRole?: string; // e.g., "Store Discovery Worker" - - // Results summary - itemsProcessed?: number; - itemsSucceeded?: number; - itemsFailed?: number; - - metadata?: any; - createdAt: Date; -} - -// ============================================================ -// GRAPHQL OPERATION TYPES -// ============================================================ - -export interface FilteredProductsVariables { - includeEnterpriseSpecials: boolean; - productsFilter: { - dispensaryId: string; - pricingType: 'rec' | 'med'; - strainTypes?: string[]; - subcategories?: string[]; - Status?: string; - types?: string[]; - useCache?: boolean; - isDefaultSort?: boolean; - sortBy?: string; - sortDirection?: number; - bypassOnlineThresholds?: boolean; - isKioskMenu?: boolean; - removeProductsBelowOptionThresholds?: boolean; - }; - page: number; - perPage: number; -} - -export interface GetAddressBasedDispensaryDataVariables { - input: { - dispensaryId: string; // The slug like "AZ-Deeply-Rooted" - }; -} - -export interface ConsumerDispensariesVariables { - filter: { - lat: number; - lng: number; - radius: number; // in meters or km - isDelivery?: boolean; - searchText?: string; - }; -} - -// ============================================================ -// API RESPONSE TYPES -// ============================================================ - -export interface DashboardStats { - dispensaryCount: number; - productCount: number; - snapshotCount24h: number; - lastCrawlTime?: Date; - failedJobCount: number; - brandCount: number; - categoryCount: number; -} - -export interface CategorySummary { - type: string; - subcategory: string; - productCount: number; - dispensaryCount: number; - avgPrice?: number; -} - -export interface BrandSummary { - brandName: string; - brandId?: string; - brandLogoUrl?: string; - productCount: number; - dispensaryCount: number; -} - -// ============================================================ -// CRAWLER PROFILE TYPES -// ============================================================ - -/** - * DispensaryCrawlerProfile - per-store crawler configuration - * - * Allows each dispensary to have customized crawler settings without - * affecting shared crawler logic. A dispensary can have multiple profiles - * but only one is active at a time (via dispensaries.active_crawler_profile_id). - */ -export interface DispensaryCrawlerProfile { - id: number; - dispensaryId: number; - profileName: string; - crawlerType: string; // 'dutchie', 'treez', 'jane', 'sandbox', 'custom' - profileKey: string | null; // Optional key for per-store module mapping - config: Record; // Crawler-specific configuration - timeoutMs: number | null; - downloadImages: boolean; - trackStock: boolean; - version: number; - enabled: boolean; - createdAt: Date; - updatedAt: Date; -} - -/** - * DispensaryCrawlerProfileCreate - input type for creating a new profile - */ -export interface DispensaryCrawlerProfileCreate { - dispensaryId: number; - profileName: string; - crawlerType: string; - profileKey?: string | null; - config?: Record; - timeoutMs?: number | null; - downloadImages?: boolean; - trackStock?: boolean; - version?: number; - enabled?: boolean; -} - -/** - * DispensaryCrawlerProfileUpdate - input type for updating an existing profile - */ -export interface DispensaryCrawlerProfileUpdate { - profileName?: string; - crawlerType?: string; - profileKey?: string | null; - config?: Record; - timeoutMs?: number | null; - downloadImages?: boolean; - trackStock?: boolean; - version?: number; - enabled?: boolean; -} - -/** - * CrawlerProfileOptions - runtime options derived from a profile - * Used when invoking the actual crawler - */ -export interface CrawlerProfileOptions { - timeoutMs: number; - downloadImages: boolean; - trackStock: boolean; - config: Record; -} diff --git a/backend/src/hydration/incremental-sync.ts b/backend/src/hydration/incremental-sync.ts index d8db1045..e0a5a074 100644 --- a/backend/src/hydration/incremental-sync.ts +++ b/backend/src/hydration/incremental-sync.ts @@ -669,12 +669,4 @@ export async function syncRecentCrawls( return { synced, errors }; } -// ============================================================ -// EXPORTS -// ============================================================ - -export { - CrawlResult, - SyncOptions, - SyncResult, -}; +// Types CrawlResult, SyncOptions, and SyncResult are already exported at their declarations diff --git a/backend/src/index.ts b/backend/src/index.ts index 0c16274e..ff07e99c 100755 --- a/backend/src/index.ts +++ b/backend/src/index.ts @@ -6,6 +6,7 @@ import { initializeMinio, isMinioEnabled } from './utils/minio'; import { initializeImageStorage } from './utils/image-storage'; import { logger } from './services/logger'; import { cleanupOrphanedJobs } from './services/proxyTestQueue'; +import healthRoutes from './routes/health'; dotenv.config(); @@ -58,22 +59,15 @@ import scraperMonitorRoutes from './routes/scraper-monitor'; import apiTokensRoutes from './routes/api-tokens'; import apiPermissionsRoutes from './routes/api-permissions'; import parallelScrapeRoutes from './routes/parallel-scrape'; -import scheduleRoutes from './routes/schedule'; import crawlerSandboxRoutes from './routes/crawler-sandbox'; import versionRoutes from './routes/version'; import publicApiRoutes from './routes/public-api'; import usersRoutes from './routes/users'; import staleProcessesRoutes from './routes/stale-processes'; import orchestratorAdminRoutes from './routes/orchestrator-admin'; -import adminRoutes from './routes/admin'; -import healthRoutes from './routes/health'; import workersRoutes from './routes/workers'; -import { dutchieAZRouter, startScheduler as startDutchieAZScheduler, initializeDefaultSchedules } from './dutchie-az'; -import { getPool } from './dutchie-az/db/connection'; -import { createAnalyticsRouter } from './dutchie-az/routes/analytics'; import { createMultiStateRoutes } from './multi-state'; import { trackApiUsage, checkRateLimit } from './middleware/apiTokenTracker'; -import { startCrawlScheduler } from './services/crawl-scheduler'; import { validateWordPressPermissions } from './middleware/wordpressPermissions'; import { markTrustedDomains } from './middleware/trustedDomains'; import { createSystemRouter, createPrometheusRouter } from './system/routes'; @@ -81,7 +75,7 @@ import { createPortalRoutes } from './portals'; import { createStatesRouter } from './routes/states'; import { createAnalyticsV2Router } from './routes/analytics-v2'; import { createDiscoveryRoutes } from './discovery'; -import { createDutchieDiscoveryRoutes, promoteDiscoveryLocation } from './dutchie-az/discovery'; +import { getPool } from './db/pool'; // Consumer API routes (findadispo.com, findagram.co) import consumerAuthRoutes from './routes/consumer-auth'; @@ -132,41 +126,22 @@ app.use('/api/scraper-monitor', scraperMonitorRoutes); app.use('/api/api-tokens', apiTokensRoutes); app.use('/api/api-permissions', apiPermissionsRoutes); app.use('/api/parallel-scrape', parallelScrapeRoutes); -app.use('/api/schedule', scheduleRoutes); app.use('/api/crawler-sandbox', crawlerSandboxRoutes); app.use('/api/version', versionRoutes); app.use('/api/users', usersRoutes); app.use('/api/stale-processes', staleProcessesRoutes); -// Admin routes - operator actions (crawl triggers, health checks) -app.use('/api/admin', adminRoutes); +// Admin routes - orchestrator actions app.use('/api/admin/orchestrator', orchestratorAdminRoutes); // SEO orchestrator routes app.use('/api/seo', seoRoutes); -// Provider-agnostic worker management routes (replaces /api/dutchie-az/admin/schedules) +// Provider-agnostic worker management routes app.use('/api/workers', workersRoutes); // Monitor routes - aliased from workers for convenience app.use('/api/monitor', workersRoutes); console.log('[Workers] Routes registered at /api/workers and /api/monitor'); -// Market data pipeline routes (provider-agnostic) -app.use('/api/markets', dutchieAZRouter); -// Legacy aliases (deprecated - remove after frontend migration) -app.use('/api/az', dutchieAZRouter); -app.use('/api/dutchie-az', dutchieAZRouter); - -// Phase 3: Analytics Dashboards - price trends, penetration, category growth, etc. -try { - const analyticsRouter = createAnalyticsRouter(getPool()); - app.use('/api/markets/analytics', analyticsRouter); - // Legacy alias for backwards compatibility - app.use('/api/az/analytics', analyticsRouter); - console.log('[Analytics] Routes registered at /api/markets/analytics'); -} catch (error) { - console.warn('[Analytics] Failed to register routes:', error); -} - // Phase 3: Analytics V2 - Enhanced analytics with rec/med state segmentation try { const analyticsV2Router = createAnalyticsV2Router(getPool()); @@ -239,43 +214,7 @@ try { } // Platform-specific Discovery Routes -// Uses neutral slugs to avoid trademark issues in URLs: -// dt = Dutchie, jn = Jane, wm = Weedmaps, etc. -// Routes: /api/discovery/platforms/:platformSlug/* -try { - const dtDiscoveryRoutes = createDutchieDiscoveryRoutes(getPool()); - app.use('/api/discovery/platforms/dt', dtDiscoveryRoutes); - console.log('[Discovery] Platform routes registered at /api/discovery/platforms/dt'); -} catch (error) { - console.warn('[Discovery] Failed to register platform routes:', error); -} - -// Orchestrator promotion endpoint (platform-agnostic) -// Route: /api/orchestrator/platforms/:platformSlug/promote/:id -app.post('/api/orchestrator/platforms/:platformSlug/promote/:id', async (req, res) => { - try { - const { platformSlug, id } = req.params; - - // Validate platform slug - const validPlatforms = ['dt']; // dt = Dutchie - if (!validPlatforms.includes(platformSlug)) { - return res.status(400).json({ - success: false, - error: `Invalid platform slug: ${platformSlug}. Valid slugs: ${validPlatforms.join(', ')}` - }); - } - - const result = await promoteDiscoveryLocation(getPool(), parseInt(id, 10)); - if (result.success) { - res.json(result); - } else { - res.status(400).json(result); - } - } catch (error: any) { - console.error('[Orchestrator] Promotion error:', error); - res.status(500).json({ success: false, error: error.message }); - } -}); +// TODO: Rebuild with /platforms/dutchie/ module async function startServer() { try { @@ -288,15 +227,6 @@ async function startServer() { // Clean up any orphaned proxy test jobs from previous server runs await cleanupOrphanedJobs(); - // Start the crawl scheduler (checks every minute for jobs to run) - startCrawlScheduler(); - logger.info('system', 'Crawl scheduler started'); - - // Start the Dutchie AZ scheduler (enqueues jobs for workers) - await initializeDefaultSchedules(); - startDutchieAZScheduler(); - logger.info('system', 'Dutchie AZ scheduler started'); - app.listen(PORT, () => { logger.info('system', `Server running on port ${PORT}`); console.log(`šŸš€ Server running on port ${PORT}`); diff --git a/backend/src/platforms/dutchie/client.ts b/backend/src/platforms/dutchie/client.ts new file mode 100644 index 00000000..65dd028c --- /dev/null +++ b/backend/src/platforms/dutchie/client.ts @@ -0,0 +1,544 @@ +/** + * ============================================================ + * DUTCHIE PLATFORM CLIENT - LOCKED MODULE + * ============================================================ + * + * DO NOT MODIFY THIS FILE WITHOUT EXPLICIT AUTHORIZATION. + * + * This is the canonical HTTP client for all Dutchie communication. + * All Dutchie workers (Alice, Bella, etc.) MUST use this client. + * + * IMPLEMENTATION: + * - Uses curl via child_process.execSync (bypasses TLS fingerprinting) + * - NO Puppeteer, NO axios, NO fetch + * - Fingerprint rotation on 403 + * - Residential IP compatible + * + * USAGE: + * import { curlPost, curlGet, executeGraphQL } from '@dutchie/client'; + * + * ============================================================ + */ + +import { execSync } from 'child_process'; + +// ============================================================ +// TYPES +// ============================================================ + +export interface CurlResponse { + status: number; + data: any; + error?: string; +} + +export interface Fingerprint { + userAgent: string; + acceptLanguage: string; + secChUa?: string; + secChUaPlatform?: string; + secChUaMobile?: string; +} + +// ============================================================ +// CONFIGURATION +// ============================================================ + +export const DUTCHIE_CONFIG = { + graphqlEndpoint: 'https://dutchie.com/api-3/graphql', + baseUrl: 'https://dutchie.com', + timeout: 30000, + maxRetries: 3, + perPage: 100, + maxPages: 200, + pageDelayMs: 500, + modeDelayMs: 2000, +}; + +// ============================================================ +// PROXY SUPPORT +// ============================================================ +// Integrates with the CrawlRotator system from proxy-rotator.ts +// On 403 errors: +// 1. Record failure on current proxy +// 2. Rotate to next proxy +// 3. Retry with new proxy +// ============================================================ + +import type { CrawlRotator, Proxy } from '../../services/crawl-rotator'; + +let currentProxy: string | null = null; +let crawlRotator: CrawlRotator | null = null; + +/** + * Set proxy for all Dutchie requests + * Format: http://user:pass@host:port or socks5://host:port + */ +export function setProxy(proxy: string | null): void { + currentProxy = proxy; + if (proxy) { + console.log(`[Dutchie Client] Proxy set: ${proxy.replace(/:[^:@]+@/, ':***@')}`); + } else { + console.log('[Dutchie Client] Proxy disabled (direct connection)'); + } +} + +/** + * Get current proxy URL + */ +export function getProxy(): string | null { + return currentProxy; +} + +/** + * Set CrawlRotator for proxy rotation on 403s + * This enables automatic proxy rotation when blocked + */ +export function setCrawlRotator(rotator: CrawlRotator | null): void { + crawlRotator = rotator; + if (rotator) { + console.log('[Dutchie Client] CrawlRotator attached - proxy rotation enabled'); + // Set initial proxy from rotator + const proxy = rotator.proxy.getCurrent(); + if (proxy) { + currentProxy = rotator.proxy.getProxyUrl(proxy); + console.log(`[Dutchie Client] Initial proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`); + } + } +} + +/** + * Get attached CrawlRotator + */ +export function getCrawlRotator(): CrawlRotator | null { + return crawlRotator; +} + +/** + * Rotate to next proxy (called on 403) + */ +async function rotateProxyOn403(error?: string): Promise { + if (!crawlRotator) { + return false; + } + + // Record failure on current proxy + await crawlRotator.recordFailure(error || '403 Forbidden'); + + // Rotate to next proxy + const nextProxy = crawlRotator.rotateProxy(); + if (nextProxy) { + currentProxy = crawlRotator.proxy.getProxyUrl(nextProxy); + console.log(`[Dutchie Client] Rotated proxy: ${currentProxy.replace(/:[^:@]+@/, ':***@')}`); + return true; + } + + console.warn('[Dutchie Client] No more proxies available'); + return false; +} + +/** + * Record success on current proxy + */ +async function recordProxySuccess(responseTimeMs?: number): Promise { + if (crawlRotator) { + await crawlRotator.recordSuccess(responseTimeMs); + } +} + +/** + * Build curl proxy argument + */ +function getProxyArg(): string { + if (!currentProxy) return ''; + return `--proxy '${currentProxy}'`; +} + +export const GRAPHQL_HASHES = { + FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0', + GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', + ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b', + DispensaryInfo: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b', +}; + +// ============================================================ +// FINGERPRINTS - Browser profiles for anti-detect +// ============================================================ + +const FINGERPRINTS: Fingerprint[] = [ + // Chrome Windows (latest) - typical residential user, use first + { + userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', + acceptLanguage: 'en-US,en;q=0.9', + secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"', + secChUaPlatform: '"Windows"', + secChUaMobile: '?0', + }, + // Chrome Mac (latest) + { + userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36', + acceptLanguage: 'en-US,en;q=0.9', + secChUa: '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"', + secChUaPlatform: '"macOS"', + secChUaMobile: '?0', + }, + // Chrome Windows (120) + { + userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', + acceptLanguage: 'en-US,en;q=0.9', + secChUa: '"Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"', + secChUaPlatform: '"Windows"', + secChUaMobile: '?0', + }, + // Firefox Windows + { + userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0', + acceptLanguage: 'en-US,en;q=0.5', + }, + // Safari Mac + { + userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15', + acceptLanguage: 'en-US,en;q=0.9', + }, + // Edge Windows + { + userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0', + acceptLanguage: 'en-US,en;q=0.9', + secChUa: '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"', + secChUaPlatform: '"Windows"', + secChUaMobile: '?0', + }, +]; + +let currentFingerprintIndex = 0; + +export function getFingerprint(): Fingerprint { + return FINGERPRINTS[currentFingerprintIndex]; +} + +export function rotateFingerprint(): Fingerprint { + currentFingerprintIndex = (currentFingerprintIndex + 1) % FINGERPRINTS.length; + const fp = FINGERPRINTS[currentFingerprintIndex]; + console.log(`[Dutchie Client] Rotated to fingerprint: ${fp.userAgent.slice(0, 50)}...`); + return fp; +} + +export function resetFingerprint(): void { + currentFingerprintIndex = 0; +} + +// ============================================================ +// CURL HTTP CLIENT +// ============================================================ + +/** + * Build headers for Dutchie requests + */ +export function buildHeaders(refererPath: string, fingerprint?: Fingerprint): Record { + const fp = fingerprint || getFingerprint(); + const refererUrl = `https://dutchie.com${refererPath}`; + + const headers: Record = { + 'accept': 'application/json, text/plain, */*', + 'accept-language': fp.acceptLanguage, + 'content-type': 'application/json', + 'origin': 'https://dutchie.com', + 'referer': refererUrl, + 'user-agent': fp.userAgent, + 'apollographql-client-name': 'Marketplace (production)', + }; + + if (fp.secChUa) { + headers['sec-ch-ua'] = fp.secChUa; + headers['sec-ch-ua-mobile'] = fp.secChUaMobile || '?0'; + headers['sec-ch-ua-platform'] = fp.secChUaPlatform || '"Windows"'; + headers['sec-fetch-dest'] = 'empty'; + headers['sec-fetch-mode'] = 'cors'; + headers['sec-fetch-site'] = 'same-site'; + } + + return headers; +} + +/** + * Execute HTTP POST using curl (bypasses TLS fingerprinting) + */ +export function curlPost(url: string, body: any, headers: Record, timeout = 30000): CurlResponse { + const filteredHeaders = Object.entries(headers) + .filter(([k]) => k.toLowerCase() !== 'accept-encoding') + .map(([k, v]) => `-H '${k}: ${v}'`) + .join(' '); + + const bodyJson = JSON.stringify(body).replace(/'/g, "'\\''"); + const timeoutSec = Math.ceil(timeout / 1000); + const separator = '___HTTP_STATUS___'; + const proxyArg = getProxyArg(); + const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} -d '${bodyJson}' '${url}'`; + + try { + const output = execSync(cmd, { + encoding: 'utf-8', + maxBuffer: 10 * 1024 * 1024, + timeout: timeout + 5000 + }); + + const separatorIndex = output.lastIndexOf(separator); + if (separatorIndex === -1) { + const lines = output.trim().split('\n'); + const statusCode = parseInt(lines.pop() || '0', 10); + const responseBody = lines.join('\n'); + try { + return { status: statusCode, data: JSON.parse(responseBody) }; + } catch { + return { status: statusCode, data: responseBody }; + } + } + + const responseBody = output.slice(0, separatorIndex); + const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10); + + try { + return { status: statusCode, data: JSON.parse(responseBody) }; + } catch { + return { status: statusCode, data: responseBody }; + } + } catch (error: any) { + return { + status: 0, + data: null, + error: error.message || 'curl request failed' + }; + } +} + +/** + * Execute HTTP GET using curl (bypasses TLS fingerprinting) + * Returns HTML or JSON depending on response content-type + */ +export function curlGet(url: string, headers: Record, timeout = 30000): CurlResponse { + const filteredHeaders = Object.entries(headers) + .filter(([k]) => k.toLowerCase() !== 'accept-encoding') + .map(([k, v]) => `-H '${k}: ${v}'`) + .join(' '); + + const timeoutSec = Math.ceil(timeout / 1000); + const separator = '___HTTP_STATUS___'; + const proxyArg = getProxyArg(); + const cmd = `curl -s --compressed ${proxyArg} -w '${separator}%{http_code}' --max-time ${timeoutSec} ${filteredHeaders} '${url}'`; + + try { + const output = execSync(cmd, { + encoding: 'utf-8', + maxBuffer: 10 * 1024 * 1024, + timeout: timeout + 5000 + }); + + const separatorIndex = output.lastIndexOf(separator); + if (separatorIndex === -1) { + const lines = output.trim().split('\n'); + const statusCode = parseInt(lines.pop() || '0', 10); + const responseBody = lines.join('\n'); + return { status: statusCode, data: responseBody }; + } + + const responseBody = output.slice(0, separatorIndex); + const statusCode = parseInt(output.slice(separatorIndex + separator.length).trim(), 10); + + // Try to parse as JSON, otherwise return as string (HTML) + try { + return { status: statusCode, data: JSON.parse(responseBody) }; + } catch { + return { status: statusCode, data: responseBody }; + } + } catch (error: any) { + return { + status: 0, + data: null, + error: error.message || 'curl request failed' + }; + } +} + +// ============================================================ +// GRAPHQL EXECUTION +// ============================================================ + +export interface ExecuteGraphQLOptions { + maxRetries?: number; + retryOn403?: boolean; + cName: string; +} + +/** + * Execute GraphQL query with curl (bypasses TLS fingerprinting) + */ +export async function executeGraphQL( + operationName: string, + variables: any, + hash: string, + options: ExecuteGraphQLOptions +): Promise { + const { maxRetries = 3, retryOn403 = true, cName } = options; + + const body = { + operationName, + variables, + extensions: { + persistedQuery: { version: 1, sha256Hash: hash }, + }, + }; + + let lastError: Error | null = null; + let attempt = 0; + + while (attempt <= maxRetries) { + const fingerprint = getFingerprint(); + const headers = buildHeaders(`/embedded-menu/${cName}`, fingerprint); + + console.log(`[Dutchie Client] curl POST ${operationName} (attempt ${attempt + 1}/${maxRetries + 1})`); + + const response = curlPost(DUTCHIE_CONFIG.graphqlEndpoint, body, headers, DUTCHIE_CONFIG.timeout); + + console.log(`[Dutchie Client] Response status: ${response.status}`); + + if (response.error) { + console.error(`[Dutchie Client] curl error: ${response.error}`); + lastError = new Error(response.error); + attempt++; + if (attempt <= maxRetries) { + await sleep(1000 * attempt); + } + continue; + } + + if (response.status === 200) { + if (response.data?.errors?.length > 0) { + console.warn(`[Dutchie Client] GraphQL errors: ${JSON.stringify(response.data.errors[0])}`); + } + return response.data; + } + + if (response.status === 403 && retryOn403) { + console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`); + rotateFingerprint(); + attempt++; + await sleep(1000 * attempt); + continue; + } + + const bodyPreview = typeof response.data === 'string' + ? response.data.slice(0, 200) + : JSON.stringify(response.data).slice(0, 200); + console.error(`[Dutchie Client] HTTP ${response.status}: ${bodyPreview}`); + lastError = new Error(`HTTP ${response.status}`); + + attempt++; + if (attempt <= maxRetries) { + await sleep(1000 * attempt); + } + } + + throw lastError || new Error('Max retries exceeded'); +} + +// ============================================================ +// HTML PAGE FETCHING +// ============================================================ + +export interface FetchPageOptions { + maxRetries?: number; + retryOn403?: boolean; +} + +/** + * Fetch HTML page from Dutchie (for city pages, dispensary pages, etc.) + * Returns raw HTML string + */ +export async function fetchPage( + path: string, + options: FetchPageOptions = {} +): Promise<{ html: string; status: number } | null> { + const { maxRetries = 3, retryOn403 = true } = options; + const url = `${DUTCHIE_CONFIG.baseUrl}${path}`; + + let attempt = 0; + + while (attempt <= maxRetries) { + const fingerprint = getFingerprint(); + const headers: Record = { + 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', + 'accept-language': fingerprint.acceptLanguage, + 'user-agent': fingerprint.userAgent, + }; + + if (fingerprint.secChUa) { + headers['sec-ch-ua'] = fingerprint.secChUa; + headers['sec-ch-ua-mobile'] = fingerprint.secChUaMobile || '?0'; + headers['sec-ch-ua-platform'] = fingerprint.secChUaPlatform || '"Windows"'; + headers['sec-fetch-dest'] = 'document'; + headers['sec-fetch-mode'] = 'navigate'; + headers['sec-fetch-site'] = 'none'; + headers['sec-fetch-user'] = '?1'; + headers['upgrade-insecure-requests'] = '1'; + } + + console.log(`[Dutchie Client] curl GET ${path} (attempt ${attempt + 1}/${maxRetries + 1})`); + + const response = curlGet(url, headers, DUTCHIE_CONFIG.timeout); + + console.log(`[Dutchie Client] Response status: ${response.status}`); + + if (response.error) { + console.error(`[Dutchie Client] curl error: ${response.error}`); + attempt++; + if (attempt <= maxRetries) { + await sleep(1000 * attempt); + } + continue; + } + + if (response.status === 200) { + return { html: response.data, status: response.status }; + } + + if (response.status === 403 && retryOn403) { + console.warn(`[Dutchie Client] 403 blocked - rotating fingerprint...`); + rotateFingerprint(); + attempt++; + await sleep(1000 * attempt); + continue; + } + + console.error(`[Dutchie Client] HTTP ${response.status}`); + attempt++; + if (attempt <= maxRetries) { + await sleep(1000 * attempt); + } + } + + return null; +} + +/** + * Extract __NEXT_DATA__ from HTML page + */ +export function extractNextData(html: string): any | null { + const match = html.match(/ - + +
diff --git a/cannaiq/src/components/Layout.tsx b/cannaiq/src/components/Layout.tsx index edc78d05..2fad5b8b 100755 --- a/cannaiq/src/components/Layout.tsx +++ b/cannaiq/src/components/Layout.tsx @@ -18,7 +18,8 @@ import { MousePointerClick, FileText, Menu, - X + X, + Users } from 'lucide-react'; interface LayoutProps { @@ -148,6 +149,7 @@ export function Layout({ children }: LayoutProps) { } label="Orchestrator" isActive={isActive('/admin/orchestrator')} /> + } label="Workers" isActive={isActive('/workers')} /> } label="SEO Pages" isActive={isActive('/admin/seo')} /> } label="Proxies" isActive={isActive('/proxies')} /> } label="Settings" isActive={isActive('/settings')} /> diff --git a/cannaiq/src/components/WorkerRoleBadge.tsx b/cannaiq/src/components/WorkerRoleBadge.tsx index 237db9dc..d59c0819 100644 --- a/cannaiq/src/components/WorkerRoleBadge.tsx +++ b/cannaiq/src/components/WorkerRoleBadge.tsx @@ -25,6 +25,7 @@ interface RoleConfig { } const roleMapping: Record = { + // Lowercase snake_case versions product_sync: { label: 'Products', bg: '#dbeafe', @@ -50,6 +51,32 @@ const roleMapping: Record = { bg: '#ede9fe', color: '#5b21b6', }, + // Human-readable versions from API + 'Analytics Refresh': { + label: 'Analytics', + bg: '#ede9fe', + color: '#5b21b6', + }, + 'Store Discovery': { + label: 'Discovery', + bg: '#d1fae5', + color: '#065f46', + }, + 'ProductSync': { + label: 'Products', + bg: '#dbeafe', + color: '#1e40af', + }, + 'StoreDiscovery': { + label: 'Discovery', + bg: '#d1fae5', + color: '#065f46', + }, + 'General': { + label: 'General', + bg: '#f3f4f6', + color: '#374151', + }, }; const defaultConfig: RoleConfig = { @@ -119,11 +146,6 @@ export function formatScope(metadata: any): string { return metadata.state; } - // Check for description - if (metadata.description) { - return metadata.description; - } - // Check for scope.states if (metadata.scope?.states && Array.isArray(metadata.scope.states)) { if (metadata.scope.states.length <= 5) { @@ -132,6 +154,18 @@ export function formatScope(metadata: any): string { return `${metadata.scope.states.slice(0, 3).join(', ')} +${metadata.scope.states.length - 3} more`; } + // Check for pricingType (product crawl scope - rec/med) + if (metadata.pricingType) { + return metadata.pricingType === 'rec' ? 'All Rec' : + metadata.pricingType === 'med' ? 'All Med' : + `All ${metadata.pricingType}`; + } + + // Check for description + if (metadata.description) { + return metadata.description; + } + return '-'; } diff --git a/cannaiq/src/pages/WorkersDashboard.tsx b/cannaiq/src/pages/WorkersDashboard.tsx index 638c8a8d..ea7d02ea 100644 --- a/cannaiq/src/pages/WorkersDashboard.tsx +++ b/cannaiq/src/pages/WorkersDashboard.tsx @@ -122,37 +122,39 @@ export function WorkersDashboard() { const fetchData = useCallback(async () => { try { - const [workersRes, summaryRes] = await Promise.all([ - api.get('/api/workers'), - api.get('/api/monitor/summary'), - ]); + // Use the schedules endpoint which has proper worker names + const schedulesRes = await api.get('/api/az/admin/schedules'); - // Map new API response format to component's expected format - const workers = workersRes.data.workers || []; - setSchedules(workers.map((w: any) => ({ - id: w.id, - job_name: w.worker_name, - description: w.description, - worker_name: w.worker_name, - worker_role: w.run_role, - enabled: w.enabled, - base_interval_minutes: w.base_interval_minutes, - jitter_minutes: w.jitter_minutes, - next_run_at: w.next_run_at, - last_run_at: w.last_run_at, - last_status: w.last_status, - job_config: { scope: w.scope }, + // Map schedules API response format to component's expected format + const schedulesList = schedulesRes.data.schedules || []; + setSchedules(schedulesList.map((s: any) => ({ + id: s.id, + job_name: s.jobName, + description: s.description, + worker_name: s.workerName, // Bella, Henry, Oscar, Alice + worker_role: s.workerRole, + enabled: s.enabled, + base_interval_minutes: s.baseIntervalMinutes, + jitter_minutes: s.jitterMinutes, + next_run_at: s.nextRunAt, + last_run_at: s.lastRunAt, + last_status: s.lastStatus, + job_config: s.jobConfig || {}, }))); - // Map new summary format - const summary = summaryRes.data.summary; + // Calculate summary from schedules data + const enabledSchedules = schedulesList.filter((s: any) => s.enabled); + const successSchedules = schedulesList.filter((s: any) => s.lastStatus === 'success'); + const failedSchedules = schedulesList.filter((s: any) => s.lastStatus === 'error'); + const runningSchedules = schedulesList.filter((s: any) => s.lastStatus === 'running'); + setSummary({ - running_scheduled_jobs: summary?.jobs_24h?.running || 0, - running_dispensary_crawl_jobs: summary?.active_crawl_jobs || 0, - successful_jobs_24h: summary?.jobs_24h?.success || 0, - failed_jobs_24h: summary?.jobs_24h?.failed || 0, - successful_crawls_24h: summary?.jobs_24h?.success || 0, - failed_crawls_24h: summary?.jobs_24h?.failed || 0, + running_scheduled_jobs: runningSchedules.length, + running_dispensary_crawl_jobs: 0, + successful_jobs_24h: successSchedules.length, + failed_jobs_24h: failedSchedules.length, + successful_crawls_24h: successSchedules.length, + failed_crawls_24h: failedSchedules.length, products_found_24h: 0, snapshots_created_24h: 0, last_job_started: null, @@ -170,7 +172,7 @@ export function WorkersDashboard() { const fetchWorkerLogs = useCallback(async (scheduleId: number) => { setLogsLoading(true); try { - const res = await api.get(`/api/workers/${scheduleId}/logs?limit=20`); + const res = await api.get(`/api/az/admin/schedules/${scheduleId}/logs?limit=20`); setWorkerLogs(res.data.logs || []); } catch (err: any) { console.error('Failed to fetch worker logs:', err); @@ -182,7 +184,7 @@ export function WorkersDashboard() { useEffect(() => { fetchData(); - const interval = setInterval(fetchData, 30000); + const interval = setInterval(fetchData, 5000); // Refresh every 5 seconds return () => clearInterval(interval); }, [fetchData]); @@ -205,7 +207,7 @@ export function WorkersDashboard() { const handleTrigger = async (scheduleId: number) => { setTriggering(scheduleId); try { - await api.post(`/api/workers/${scheduleId}/trigger`); + await api.post(`/api/az/admin/schedules/${scheduleId}/trigger`); // Refresh data after trigger setTimeout(fetchData, 1000); } catch (err: any) { diff --git a/k8s/woodpecker-agent.yaml b/k8s/woodpecker-agent.yaml new file mode 100644 index 00000000..e0987331 --- /dev/null +++ b/k8s/woodpecker-agent.yaml @@ -0,0 +1,92 @@ +# Woodpecker CI Agent Deployment +# Runs in the K8s cluster to pick up CI jobs from ci.cannabrands.app +--- +apiVersion: v1 +kind: Namespace +metadata: + name: woodpecker +--- +apiVersion: v1 +kind: Secret +metadata: + name: woodpecker-agent-secret + namespace: woodpecker +type: Opaque +stringData: + WOODPECKER_AGENT_SECRET: "" # Get from CI server admin panel +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: woodpecker-agent + namespace: woodpecker + labels: + app: woodpecker-agent +spec: + replicas: 2 + selector: + matchLabels: + app: woodpecker-agent + template: + metadata: + labels: + app: woodpecker-agent + spec: + serviceAccountName: woodpecker-agent + containers: + - name: agent + image: woodpeckerci/woodpecker-agent:latest + env: + - name: WOODPECKER_SERVER + value: "ci.cannabrands.app:443" + - name: WOODPECKER_AGENT_SECRET + valueFrom: + secretKeyRef: + name: woodpecker-agent-secret + key: WOODPECKER_AGENT_SECRET + - name: WOODPECKER_GRPC_SECURE + value: "true" + - name: WOODPECKER_BACKEND + value: "kubernetes" + - name: WOODPECKER_BACKEND_K8S_NAMESPACE + value: "woodpecker" + - name: WOODPECKER_BACKEND_K8S_VOLUME_SIZE + value: "10G" + resources: + limits: + memory: "512Mi" + cpu: "500m" + requests: + memory: "256Mi" + cpu: "100m" +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: woodpecker-agent + namespace: woodpecker +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: woodpecker-agent +rules: + - apiGroups: [""] + resources: ["pods", "pods/log", "secrets", "configmaps", "persistentvolumeclaims"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + - apiGroups: [""] + resources: ["events"] + verbs: ["create"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: woodpecker-agent +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: woodpecker-agent +subjects: + - kind: ServiceAccount + name: woodpecker-agent + namespace: woodpecker