chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/: - hydration/ (old pipeline approach) - scraper-v2/ (old Puppeteer scraper) - canonical-hydration/ (merged into tasks) - Unused services: availability, crawler-logger, geolocation, etc - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser - Archive outdated docs to docs/_archive/: - ANALYTICS_RUNBOOK.md - ANALYTICS_V2_EXAMPLES.md - BRAND_INTELLIGENCE_API.md - CRAWL_PIPELINE.md - TASK_WORKFLOW_2024-12-10.md - WORKER_TASK_ARCHITECTURE.md - ORGANIC_SCRAPING_GUIDE.md - Add docs/CODEBASE_MAP.md as single source of truth - Add warning files to deprecated/archived directories - Slim down CLAUDE.md to essential rules only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
712
backend/docs/_archive/ANALYTICS_RUNBOOK.md
Normal file
712
backend/docs/_archive/ANALYTICS_RUNBOOK.md
Normal file
@@ -0,0 +1,712 @@
|
||||
# CannaiQ Analytics Runbook
|
||||
|
||||
Phase 3: Analytics Engine - Complete Implementation Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The CannaiQ Analytics Engine provides real-time insights into cannabis market data across price trends, brand penetration, category performance, store changes, and competitive positioning.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ API Layer │
|
||||
│ /api/az/analytics/* │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Analytics Services │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │PriceTrend │ │Penetration │ │CategoryAnalytics │ │
|
||||
│ │Service │ │Service │ │Service │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||||
│ │StoreChange │ │BrandOpportunity│ │AnalyticsCache │ │
|
||||
│ │Service │ │Service │ │(15-min TTL) │ │
|
||||
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Canonical Tables │
|
||||
│ store_products │ store_product_snapshots │ brands │ categories │
|
||||
│ dispensaries │ brand_snapshots │ category_snapshots │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Services
|
||||
|
||||
### 1. PriceTrendService
|
||||
|
||||
Provides time-series price analytics.
|
||||
|
||||
**Key Methods:**
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `getProductPriceTrend(productId, storeId?, days)` | Price history for a product |
|
||||
| `getBrandPriceTrend(brandName, filters)` | Average prices for a brand |
|
||||
| `getCategoryPriceTrend(category, filters)` | Category-level price trends |
|
||||
| `getPriceSummary(filters)` | 7d/30d/90d price averages |
|
||||
| `detectPriceCompression(category, state?)` | Price war detection |
|
||||
| `getGlobalPriceStats()` | Market-wide pricing overview |
|
||||
|
||||
**Filters:**
|
||||
```typescript
|
||||
interface PriceFilters {
|
||||
storeId?: number;
|
||||
brandName?: string;
|
||||
category?: string;
|
||||
state?: string;
|
||||
days?: number; // default: 30
|
||||
}
|
||||
```
|
||||
|
||||
**Price Compression Detection:**
|
||||
- Calculates standard deviation of prices within category
|
||||
- Returns compression score 0-100 (higher = more compressed)
|
||||
- Identifies brands converging toward mean price
|
||||
|
||||
---
|
||||
|
||||
### 2. PenetrationService
|
||||
|
||||
Tracks brand market presence across stores and states.
|
||||
|
||||
**Key Methods:**
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `getBrandPenetration(brandName, filters)` | Store count, SKU count, coverage |
|
||||
| `getTopBrandsByPenetration(limit, filters)` | Leaderboard of dominant brands |
|
||||
| `getPenetrationTrend(brandName, days)` | Historical penetration growth |
|
||||
| `getShelfShareByCategory(brandName)` | % of shelf per category |
|
||||
| `getBrandPresenceByState(brandName)` | Multi-state presence map |
|
||||
| `getStoresCarryingBrand(brandName)` | List of stores carrying brand |
|
||||
| `getPenetrationHeatmap(brandName?)` | Geographic distribution |
|
||||
|
||||
**Penetration Calculation:**
|
||||
```
|
||||
Penetration % = (Stores with Brand / Total Stores in Market) × 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. CategoryAnalyticsService
|
||||
|
||||
Analyzes category performance and trends.
|
||||
|
||||
**Key Methods:**
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `getCategorySummary(category?, filters)` | SKU count, avg price, stores |
|
||||
| `getCategoryGrowth(days, filters)` | 7d/30d/90d growth rates |
|
||||
| `getCategoryGrowthTrend(category, days)` | Time-series category growth |
|
||||
| `getCategoryHeatmap(metric, periods)` | Visual heatmap data |
|
||||
| `getTopMovers(limit, days)` | Fastest growing/declining categories |
|
||||
| `getSubcategoryBreakdown(category)` | Drill-down into subcategories |
|
||||
|
||||
**Time Windows:**
|
||||
- 7 days: Short-term volatility
|
||||
- 30 days: Monthly trends
|
||||
- 90 days: Seasonal patterns
|
||||
|
||||
---
|
||||
|
||||
### 4. StoreChangeService
|
||||
|
||||
Tracks product adds/drops, brand changes, and price movements per store.
|
||||
|
||||
**Key Methods:**
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `getStoreChangeSummary(storeId)` | Overview of recent changes |
|
||||
| `getStoreChangeEvents(storeId, filters)` | Event log (add, drop, price, OOS) |
|
||||
| `getNewBrands(storeId, days)` | Brands added to store |
|
||||
| `getLostBrands(storeId, days)` | Brands dropped from store |
|
||||
| `getProductChanges(storeId, type, days)` | Filtered product changes |
|
||||
| `getCategoryLeaderboard(category, limit)` | Top stores for category |
|
||||
| `getMostActiveStores(days, limit)` | Stores with most changes |
|
||||
| `compareStores(store1, store2)` | Side-by-side store comparison |
|
||||
|
||||
**Event Types:**
|
||||
- `added` - New product appeared
|
||||
- `discontinued` - Product removed
|
||||
- `price_drop` - Price decreased
|
||||
- `price_increase` - Price increased
|
||||
- `restocked` - OOS → In Stock
|
||||
- `out_of_stock` - In Stock → OOS
|
||||
|
||||
---
|
||||
|
||||
### 5. BrandOpportunityService
|
||||
|
||||
Competitive intelligence and opportunity identification.
|
||||
|
||||
**Key Methods:**
|
||||
| Method | Description |
|
||||
|--------|-------------|
|
||||
| `getBrandOpportunity(brandName)` | Full opportunity analysis |
|
||||
| `getMarketPositionSummary(brandName)` | Market position vs competitors |
|
||||
| `getAlerts(filters)` | Analytics-generated alerts |
|
||||
| `markAlertsRead(alertIds)` | Mark alerts as read |
|
||||
|
||||
**Opportunity Analysis Includes:**
|
||||
- White space stores (potential targets)
|
||||
- Competitive threats (brands gaining share)
|
||||
- Pricing opportunities (underpriced vs market)
|
||||
- Missing SKU recommendations
|
||||
|
||||
---
|
||||
|
||||
### 6. AnalyticsCache
|
||||
|
||||
In-memory caching with database fallback.
|
||||
|
||||
**Configuration:**
|
||||
```typescript
|
||||
const cache = new AnalyticsCache(pool, {
|
||||
defaultTtlMinutes: 15,
|
||||
});
|
||||
```
|
||||
|
||||
**Usage Pattern:**
|
||||
```typescript
|
||||
const data = await cache.getOrCompute(cacheKey, async () => {
|
||||
// Expensive query here
|
||||
return result;
|
||||
});
|
||||
```
|
||||
|
||||
**Cache Management:**
|
||||
- `GET /api/az/analytics/cache/stats` - View cache stats
|
||||
- `POST /api/az/analytics/cache/clear?pattern=price*` - Clear by pattern
|
||||
- Auto-cleanup of expired entries every 5 minutes
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints Reference
|
||||
|
||||
### Price Endpoints
|
||||
|
||||
```bash
|
||||
# Product price trend (last 30 days)
|
||||
GET /api/az/analytics/price/product/12345?days=30
|
||||
|
||||
# Brand price trend with filters
|
||||
GET /api/az/analytics/price/brand/Cookies?storeId=101&category=Flower&days=90
|
||||
|
||||
# Category median price
|
||||
GET /api/az/analytics/price/category/Vaporizers?state=AZ
|
||||
|
||||
# Price summary (7d/30d/90d)
|
||||
GET /api/az/analytics/price/summary?brand=Stiiizy&state=AZ
|
||||
|
||||
# Detect price wars
|
||||
GET /api/az/analytics/price/compression/Flower?state=AZ
|
||||
|
||||
# Global stats
|
||||
GET /api/az/analytics/price/global
|
||||
```
|
||||
|
||||
### Penetration Endpoints
|
||||
|
||||
```bash
|
||||
# Brand penetration
|
||||
GET /api/az/analytics/penetration/brand/Cookies
|
||||
|
||||
# Top brands leaderboard
|
||||
GET /api/az/analytics/penetration/top?limit=20&state=AZ&category=Flower
|
||||
|
||||
# Penetration trend
|
||||
GET /api/az/analytics/penetration/trend/Cookies?days=90
|
||||
|
||||
# Shelf share by category
|
||||
GET /api/az/analytics/penetration/shelf-share/Cookies
|
||||
|
||||
# Multi-state presence
|
||||
GET /api/az/analytics/penetration/by-state/Cookies
|
||||
|
||||
# Stores carrying brand
|
||||
GET /api/az/analytics/penetration/stores/Cookies
|
||||
|
||||
# Heatmap data
|
||||
GET /api/az/analytics/penetration/heatmap?brand=Cookies
|
||||
```
|
||||
|
||||
### Category Endpoints
|
||||
|
||||
```bash
|
||||
# Category summary
|
||||
GET /api/az/analytics/category/summary?category=Flower&state=AZ
|
||||
|
||||
# Category growth (7d/30d/90d)
|
||||
GET /api/az/analytics/category/growth?days=30&state=AZ
|
||||
|
||||
# Category trend
|
||||
GET /api/az/analytics/category/trend/Concentrates?days=90
|
||||
|
||||
# Heatmap
|
||||
GET /api/az/analytics/category/heatmap?metric=growth&periods=12
|
||||
|
||||
# Top movers (growing/declining)
|
||||
GET /api/az/analytics/category/top-movers?limit=5&days=30
|
||||
|
||||
# Subcategory breakdown
|
||||
GET /api/az/analytics/category/Edibles/subcategories
|
||||
```
|
||||
|
||||
### Store Endpoints
|
||||
|
||||
```bash
|
||||
# Store change summary
|
||||
GET /api/az/analytics/store/101/summary
|
||||
|
||||
# Event log
|
||||
GET /api/az/analytics/store/101/events?type=price_drop&days=7&limit=50
|
||||
|
||||
# New brands
|
||||
GET /api/az/analytics/store/101/brands/new?days=30
|
||||
|
||||
# Lost brands
|
||||
GET /api/az/analytics/store/101/brands/lost?days=30
|
||||
|
||||
# Product changes by type
|
||||
GET /api/az/analytics/store/101/products/changes?type=added&days=7
|
||||
|
||||
# Category leaderboard
|
||||
GET /api/az/analytics/store/leaderboard/Flower?limit=20
|
||||
|
||||
# Most active stores
|
||||
GET /api/az/analytics/store/most-active?days=7&limit=10
|
||||
|
||||
# Compare two stores
|
||||
GET /api/az/analytics/store/compare?store1=101&store2=102
|
||||
```
|
||||
|
||||
### Brand Opportunity Endpoints
|
||||
|
||||
```bash
|
||||
# Full opportunity analysis
|
||||
GET /api/az/analytics/brand/Cookies/opportunity
|
||||
|
||||
# Market position summary
|
||||
GET /api/az/analytics/brand/Cookies/position
|
||||
|
||||
# Get alerts
|
||||
GET /api/az/analytics/alerts?brand=Cookies&type=competitive&unreadOnly=true
|
||||
|
||||
# Mark alerts read
|
||||
POST /api/az/analytics/alerts/mark-read
|
||||
Body: { "alertIds": [1, 2, 3] }
|
||||
```
|
||||
|
||||
### Maintenance Endpoints
|
||||
|
||||
```bash
|
||||
# Capture daily snapshots (run by scheduler)
|
||||
POST /api/az/analytics/snapshots/capture
|
||||
|
||||
# Cache statistics
|
||||
GET /api/az/analytics/cache/stats
|
||||
|
||||
# Clear cache (admin)
|
||||
POST /api/az/analytics/cache/clear?pattern=price*
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incremental Computation
|
||||
|
||||
Analytics are designed for real-time queries without full recomputation:
|
||||
|
||||
### Snapshot Strategy
|
||||
|
||||
1. **Raw Data**: `store_products` (current state)
|
||||
2. **Historical**: `store_product_snapshots` (time-series)
|
||||
3. **Aggregated**: `brand_snapshots`, `category_snapshots` (daily rollups)
|
||||
|
||||
### Window Calculations
|
||||
|
||||
```sql
|
||||
-- 7-day window
|
||||
WHERE crawled_at >= NOW() - INTERVAL '7 days'
|
||||
|
||||
-- 30-day window
|
||||
WHERE crawled_at >= NOW() - INTERVAL '30 days'
|
||||
|
||||
-- 90-day window
|
||||
WHERE crawled_at >= NOW() - INTERVAL '90 days'
|
||||
```
|
||||
|
||||
### Materialized Views (Optional)
|
||||
|
||||
For heavy queries, create materialized views:
|
||||
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW mv_brand_daily_metrics AS
|
||||
SELECT
|
||||
DATE(sps.captured_at) as date,
|
||||
sp.brand_id,
|
||||
COUNT(DISTINCT sp.dispensary_id) as store_count,
|
||||
COUNT(*) as sku_count,
|
||||
AVG(sp.price_rec) as avg_price
|
||||
FROM store_product_snapshots sps
|
||||
JOIN store_products sp ON sps.store_product_id = sp.id
|
||||
WHERE sps.captured_at >= NOW() - INTERVAL '90 days'
|
||||
GROUP BY DATE(sps.captured_at), sp.brand_id;
|
||||
|
||||
-- Refresh daily
|
||||
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_brand_daily_metrics;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scheduled Jobs
|
||||
|
||||
### Daily Snapshot Capture
|
||||
|
||||
Trigger via cron or scheduler:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3010/api/az/analytics/snapshots/capture
|
||||
```
|
||||
|
||||
This calls:
|
||||
- `capture_brand_snapshots()` - Captures brand metrics
|
||||
- `capture_category_snapshots()` - Captures category metrics
|
||||
|
||||
### Cache Cleanup
|
||||
|
||||
Automatic cleanup every 5 minutes via in-memory timer.
|
||||
|
||||
For manual cleanup:
|
||||
```bash
|
||||
curl -X POST http://localhost:3010/api/az/analytics/cache/clear
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extending Analytics (Future Phases)
|
||||
|
||||
### Phase 6: Intelligence Engine
|
||||
- Automated alert generation
|
||||
- Recommendation engine
|
||||
- Price prediction
|
||||
|
||||
### Phase 7: Orders Integration
|
||||
- Sales velocity analytics
|
||||
- Reorder predictions
|
||||
- Inventory turnover
|
||||
|
||||
### Phase 8: Advanced ML
|
||||
- Demand forecasting
|
||||
- Price elasticity modeling
|
||||
- Customer segmentation
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**1. Slow queries**
|
||||
- Check cache stats: `GET /api/az/analytics/cache/stats`
|
||||
- Increase cache TTL if data doesn't need real-time freshness
|
||||
- Add indexes on frequently filtered columns
|
||||
|
||||
**2. Empty results**
|
||||
- Verify data exists in source tables
|
||||
- Check filter parameters (case-sensitive brand names)
|
||||
- Verify state codes are valid
|
||||
|
||||
**3. Stale data**
|
||||
- Run snapshot capture: `POST /api/az/analytics/snapshots/capture`
|
||||
- Clear cache: `POST /api/az/analytics/cache/clear`
|
||||
|
||||
### Debugging
|
||||
|
||||
Enable query logging:
|
||||
```typescript
|
||||
// In service constructor
|
||||
this.debug = process.env.ANALYTICS_DEBUG === 'true';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Contracts
|
||||
|
||||
### Price Trend Response
|
||||
```typescript
|
||||
interface PriceTrend {
|
||||
productId?: number;
|
||||
storeId?: number;
|
||||
brandName?: string;
|
||||
category?: string;
|
||||
dataPoints: Array<{
|
||||
date: string;
|
||||
minPrice: number | null;
|
||||
maxPrice: number | null;
|
||||
avgPrice: number | null;
|
||||
wholesalePrice: number | null;
|
||||
sampleSize: number;
|
||||
}>;
|
||||
summary: {
|
||||
currentAvg: number | null;
|
||||
previousAvg: number | null;
|
||||
changePercent: number | null;
|
||||
trend: 'up' | 'down' | 'stable';
|
||||
volatilityScore: number | null;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Brand Penetration Response
|
||||
```typescript
|
||||
interface BrandPenetration {
|
||||
brandName: string;
|
||||
totalStores: number;
|
||||
storesWithBrand: number;
|
||||
penetrationPercent: number;
|
||||
skuCount: number;
|
||||
avgPrice: number | null;
|
||||
priceRange: { min: number; max: number } | null;
|
||||
topCategories: Array<{ category: string; count: number }>;
|
||||
stateBreakdown?: Array<{ state: string; storeCount: number }>;
|
||||
}
|
||||
```
|
||||
|
||||
### Category Growth Response
|
||||
```typescript
|
||||
interface CategoryGrowth {
|
||||
category: string;
|
||||
currentCount: number;
|
||||
previousCount: number;
|
||||
growthPercent: number;
|
||||
growthTrend: 'up' | 'down' | 'stable';
|
||||
avgPrice: number | null;
|
||||
priceChange: number | null;
|
||||
topBrands: Array<{ brandName: string; count: number }>;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/dutchie-az/services/analytics/price-trends.ts` | Price analytics |
|
||||
| `src/dutchie-az/services/analytics/penetration.ts` | Brand penetration |
|
||||
| `src/dutchie-az/services/analytics/category-analytics.ts` | Category metrics |
|
||||
| `src/dutchie-az/services/analytics/store-changes.ts` | Store event tracking |
|
||||
| `src/dutchie-az/services/analytics/brand-opportunity.ts` | Competitive intel |
|
||||
| `src/dutchie-az/services/analytics/cache.ts` | Caching layer |
|
||||
| `src/dutchie-az/services/analytics/index.ts` | Module exports |
|
||||
| `src/dutchie-az/routes/analytics.ts` | API routes (680 LOC) |
|
||||
| `src/multi-state/state-query-service.ts` | Cross-state analytics |
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Analytics V2: Rec/Med State Segmentation
|
||||
|
||||
Phase 3 Enhancement: Enhanced analytics with recreational vs medical-only state analysis.
|
||||
|
||||
### V2 API Endpoints
|
||||
|
||||
All V2 endpoints are prefixed with `/api/analytics/v2`
|
||||
|
||||
#### V2 Price Analytics
|
||||
|
||||
```bash
|
||||
# Price trends for a specific product
|
||||
GET /api/analytics/v2/price/product/12345?window=30d
|
||||
|
||||
# Price by category and state (with rec/med segmentation)
|
||||
GET /api/analytics/v2/price/category/Flower?state=AZ
|
||||
|
||||
# Price by brand and state
|
||||
GET /api/analytics/v2/price/brand/Cookies?state=AZ
|
||||
|
||||
# Most volatile products
|
||||
GET /api/analytics/v2/price/volatile?window=30d&limit=50&state=AZ
|
||||
|
||||
# Rec vs Med price comparison by category
|
||||
GET /api/analytics/v2/price/rec-vs-med?category=Flower
|
||||
```
|
||||
|
||||
#### V2 Brand Penetration
|
||||
|
||||
```bash
|
||||
# Brand penetration metrics with state breakdown
|
||||
GET /api/analytics/v2/brand/Cookies/penetration?window=30d
|
||||
|
||||
# Brand market position within categories
|
||||
GET /api/analytics/v2/brand/Cookies/market-position?category=Flower&state=AZ
|
||||
|
||||
# Brand presence in rec vs med-only states
|
||||
GET /api/analytics/v2/brand/Cookies/rec-vs-med
|
||||
|
||||
# Top brands by penetration
|
||||
GET /api/analytics/v2/brand/top?limit=25&state=AZ
|
||||
|
||||
# Brands expanding or contracting
|
||||
GET /api/analytics/v2/brand/expansion-contraction?window=30d&limit=25
|
||||
```
|
||||
|
||||
#### V2 Category Analytics
|
||||
|
||||
```bash
|
||||
# Category growth metrics
|
||||
GET /api/analytics/v2/category/Flower/growth?window=30d
|
||||
|
||||
# Category growth trend over time
|
||||
GET /api/analytics/v2/category/Flower/trend?window=30d
|
||||
|
||||
# Top brands in category
|
||||
GET /api/analytics/v2/category/Flower/top-brands?limit=25&state=AZ
|
||||
|
||||
# All categories with metrics
|
||||
GET /api/analytics/v2/category/all?state=AZ&limit=50
|
||||
|
||||
# Rec vs Med category comparison
|
||||
GET /api/analytics/v2/category/rec-vs-med?category=Flower
|
||||
|
||||
# Fastest growing categories
|
||||
GET /api/analytics/v2/category/fastest-growing?window=30d&limit=25
|
||||
```
|
||||
|
||||
#### V2 Store Analytics
|
||||
|
||||
```bash
|
||||
# Store change summary
|
||||
GET /api/analytics/v2/store/101/summary?window=30d
|
||||
|
||||
# Product change events
|
||||
GET /api/analytics/v2/store/101/events?window=7d&limit=100
|
||||
|
||||
# Store inventory composition
|
||||
GET /api/analytics/v2/store/101/inventory
|
||||
|
||||
# Store price positioning vs market
|
||||
GET /api/analytics/v2/store/101/price-position
|
||||
|
||||
# Most active stores by changes
|
||||
GET /api/analytics/v2/store/most-active?window=7d&limit=25&state=AZ
|
||||
```
|
||||
|
||||
#### V2 State Analytics
|
||||
|
||||
```bash
|
||||
# State market summary
|
||||
GET /api/analytics/v2/state/AZ/summary
|
||||
|
||||
# All states with coverage metrics
|
||||
GET /api/analytics/v2/state/all
|
||||
|
||||
# Legal state breakdown (rec, med-only, no program)
|
||||
GET /api/analytics/v2/state/legal-breakdown
|
||||
|
||||
# Rec vs Med pricing by category
|
||||
GET /api/analytics/v2/state/rec-vs-med-pricing?category=Flower
|
||||
|
||||
# States with coverage gaps
|
||||
GET /api/analytics/v2/state/coverage-gaps
|
||||
|
||||
# Cross-state pricing comparison
|
||||
GET /api/analytics/v2/state/price-comparison
|
||||
```
|
||||
|
||||
### V2 Services Architecture
|
||||
|
||||
```
|
||||
src/services/analytics/
|
||||
├── index.ts # Exports all V2 services
|
||||
├── types.ts # Shared type definitions
|
||||
├── PriceAnalyticsService.ts # Price trends and volatility
|
||||
├── BrandPenetrationService.ts # Brand market presence
|
||||
├── CategoryAnalyticsService.ts # Category growth analysis
|
||||
├── StoreAnalyticsService.ts # Store change tracking
|
||||
└── StateAnalyticsService.ts # State-level analytics
|
||||
|
||||
src/routes/analytics-v2.ts # V2 API route handlers
|
||||
```
|
||||
|
||||
### Key V2 Features
|
||||
|
||||
1. **Rec/Med State Segmentation**: All analytics can be filtered and compared by legal status
|
||||
2. **State Coverage Gaps**: Identify legal states with missing or stale data
|
||||
3. **Cross-State Pricing**: Compare prices across recreational and medical-only markets
|
||||
4. **Brand Footprint Analysis**: Track brand presence in rec vs med states
|
||||
5. **Category Comparison**: Compare category performance by legal status
|
||||
|
||||
### V2 Migration Path
|
||||
|
||||
1. Run migration 052 for state cannabis flags:
|
||||
```bash
|
||||
psql "$DATABASE_URL" -f migrations/052_add_state_cannabis_flags.sql
|
||||
```
|
||||
|
||||
2. Run migration 053 for analytics indexes:
|
||||
```bash
|
||||
psql "$DATABASE_URL" -f migrations/053_analytics_indexes.sql
|
||||
```
|
||||
|
||||
3. Restart backend to pick up new routes
|
||||
|
||||
### V2 Response Examples
|
||||
|
||||
**Rec vs Med Price Comparison:**
|
||||
```json
|
||||
{
|
||||
"category": "Flower",
|
||||
"recreational": {
|
||||
"state_count": 15,
|
||||
"product_count": 12500,
|
||||
"avg_price": 35.50,
|
||||
"median_price": 32.00
|
||||
},
|
||||
"medical_only": {
|
||||
"state_count": 8,
|
||||
"product_count": 5200,
|
||||
"avg_price": 42.00,
|
||||
"median_price": 40.00
|
||||
},
|
||||
"price_diff_percent": -15.48
|
||||
}
|
||||
```
|
||||
|
||||
**Legal State Breakdown:**
|
||||
```json
|
||||
{
|
||||
"recreational_states": {
|
||||
"count": 24,
|
||||
"dispensary_count": 850,
|
||||
"product_count": 125000,
|
||||
"states": [
|
||||
{ "code": "CA", "name": "California", "dispensary_count": 250 },
|
||||
{ "code": "CO", "name": "Colorado", "dispensary_count": 150 }
|
||||
]
|
||||
},
|
||||
"medical_only_states": {
|
||||
"count": 18,
|
||||
"dispensary_count": 320,
|
||||
"product_count": 45000,
|
||||
"states": [
|
||||
{ "code": "FL", "name": "Florida", "dispensary_count": 120 }
|
||||
]
|
||||
},
|
||||
"no_program_states": {
|
||||
"count": 9,
|
||||
"states": [
|
||||
{ "code": "ID", "name": "Idaho" }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Phase 3 Analytics Engine - Fully Implemented*
|
||||
*V2 Rec/Med State Analytics - Added December 2024*
|
||||
594
backend/docs/_archive/ANALYTICS_V2_EXAMPLES.md
Normal file
594
backend/docs/_archive/ANALYTICS_V2_EXAMPLES.md
Normal file
@@ -0,0 +1,594 @@
|
||||
# Analytics V2 API Examples
|
||||
|
||||
## Overview
|
||||
|
||||
All endpoints are prefixed with `/api/analytics/v2`
|
||||
|
||||
### Filtering Options
|
||||
|
||||
**Time Windows:**
|
||||
- `?window=7d` - Last 7 days
|
||||
- `?window=30d` - Last 30 days (default)
|
||||
- `?window=90d` - Last 90 days
|
||||
|
||||
**Legal Type Filtering:**
|
||||
- `?legalType=recreational` - Recreational states only
|
||||
- `?legalType=medical_only` - Medical-only states (not recreational)
|
||||
- `?legalType=no_program` - States with no cannabis program
|
||||
|
||||
---
|
||||
|
||||
## 1. Price Analytics
|
||||
|
||||
### GET /price/product/:id
|
||||
|
||||
Get price trends for a specific store product.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/price/product/12345?window=30d
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"store_product_id": 12345,
|
||||
"product_name": "Blue Dream 3.5g",
|
||||
"brand_name": "Cookies",
|
||||
"category": "Flower",
|
||||
"dispensary_id": 101,
|
||||
"dispensary_name": "Green Leaf Dispensary",
|
||||
"state_code": "AZ",
|
||||
"data_points": [
|
||||
{
|
||||
"date": "2024-11-06",
|
||||
"price_rec": 45.00,
|
||||
"price_med": 40.00,
|
||||
"price_rec_special": null,
|
||||
"price_med_special": null,
|
||||
"is_on_special": false
|
||||
},
|
||||
{
|
||||
"date": "2024-11-07",
|
||||
"price_rec": 42.00,
|
||||
"price_med": 38.00,
|
||||
"price_rec_special": null,
|
||||
"price_med_special": null,
|
||||
"is_on_special": false
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"current_price": 42.00,
|
||||
"min_price": 40.00,
|
||||
"max_price": 48.00,
|
||||
"avg_price": 43.50,
|
||||
"price_change_count": 3,
|
||||
"volatility_percent": 8.2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### GET /price/rec-vs-med
|
||||
|
||||
Get recreational vs medical-only price comparison by category.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/price/rec-vs-med?category=Flower
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "Flower",
|
||||
"rec_avg": 38.50,
|
||||
"rec_median": 35.00,
|
||||
"med_avg": 42.00,
|
||||
"med_median": 40.00
|
||||
},
|
||||
{
|
||||
"category": "Concentrates",
|
||||
"rec_avg": 45.00,
|
||||
"rec_median": 42.00,
|
||||
"med_avg": 48.00,
|
||||
"med_median": 45.00
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Brand Analytics
|
||||
|
||||
### GET /brand/:name/penetration
|
||||
|
||||
Get brand penetration metrics with state breakdown.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/brand/Cookies/penetration?window=30d
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"brand_name": "Cookies",
|
||||
"total_dispensaries": 125,
|
||||
"total_skus": 450,
|
||||
"avg_skus_per_dispensary": 3.6,
|
||||
"states_present": ["AZ", "CA", "CO", "NV", "MI"],
|
||||
"state_breakdown": [
|
||||
{
|
||||
"state_code": "CA",
|
||||
"state_name": "California",
|
||||
"legal_type": "recreational",
|
||||
"dispensary_count": 45,
|
||||
"sku_count": 180,
|
||||
"avg_skus_per_dispensary": 4.0,
|
||||
"market_share_percent": 12.5
|
||||
},
|
||||
{
|
||||
"state_code": "AZ",
|
||||
"state_name": "Arizona",
|
||||
"legal_type": "recreational",
|
||||
"dispensary_count": 32,
|
||||
"sku_count": 128,
|
||||
"avg_skus_per_dispensary": 4.0,
|
||||
"market_share_percent": 15.2
|
||||
}
|
||||
],
|
||||
"penetration_trend": [
|
||||
{
|
||||
"date": "2024-11-01",
|
||||
"dispensary_count": 120,
|
||||
"new_dispensaries": 0,
|
||||
"dropped_dispensaries": 0
|
||||
},
|
||||
{
|
||||
"date": "2024-11-08",
|
||||
"dispensary_count": 123,
|
||||
"new_dispensaries": 3,
|
||||
"dropped_dispensaries": 0
|
||||
},
|
||||
{
|
||||
"date": "2024-11-15",
|
||||
"dispensary_count": 125,
|
||||
"new_dispensaries": 2,
|
||||
"dropped_dispensaries": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### GET /brand/:name/rec-vs-med
|
||||
|
||||
Get brand presence in recreational vs medical-only states.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/brand/Cookies/rec-vs-med
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"brand_name": "Cookies",
|
||||
"rec_states_count": 4,
|
||||
"rec_states": ["AZ", "CA", "CO", "NV"],
|
||||
"rec_dispensary_count": 110,
|
||||
"rec_avg_skus": 3.8,
|
||||
"med_only_states_count": 2,
|
||||
"med_only_states": ["FL", "OH"],
|
||||
"med_only_dispensary_count": 15,
|
||||
"med_only_avg_skus": 2.5
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Category Analytics
|
||||
|
||||
### GET /category/:name/growth
|
||||
|
||||
Get category growth metrics with state breakdown.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/category/Flower/growth?window=30d
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"category": "Flower",
|
||||
"current_sku_count": 5200,
|
||||
"current_dispensary_count": 320,
|
||||
"avg_price": 38.50,
|
||||
"growth_data": [
|
||||
{
|
||||
"date": "2024-11-01",
|
||||
"sku_count": 4800,
|
||||
"dispensary_count": 310,
|
||||
"avg_price": 39.00
|
||||
},
|
||||
{
|
||||
"date": "2024-11-15",
|
||||
"sku_count": 5000,
|
||||
"dispensary_count": 315,
|
||||
"avg_price": 38.75
|
||||
},
|
||||
{
|
||||
"date": "2024-12-01",
|
||||
"sku_count": 5200,
|
||||
"dispensary_count": 320,
|
||||
"avg_price": 38.50
|
||||
}
|
||||
],
|
||||
"state_breakdown": [
|
||||
{
|
||||
"state_code": "CA",
|
||||
"state_name": "California",
|
||||
"legal_type": "recreational",
|
||||
"sku_count": 2100,
|
||||
"dispensary_count": 145,
|
||||
"avg_price": 36.00
|
||||
},
|
||||
{
|
||||
"state_code": "AZ",
|
||||
"state_name": "Arizona",
|
||||
"legal_type": "recreational",
|
||||
"sku_count": 950,
|
||||
"dispensary_count": 85,
|
||||
"avg_price": 40.00
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### GET /category/rec-vs-med
|
||||
|
||||
Get category comparison between recreational and medical-only states.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/category/rec-vs-med
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "Flower",
|
||||
"recreational": {
|
||||
"state_count": 15,
|
||||
"dispensary_count": 650,
|
||||
"sku_count": 12500,
|
||||
"avg_price": 35.50,
|
||||
"median_price": 32.00
|
||||
},
|
||||
"medical_only": {
|
||||
"state_count": 8,
|
||||
"dispensary_count": 220,
|
||||
"sku_count": 4200,
|
||||
"avg_price": 42.00,
|
||||
"median_price": 40.00
|
||||
},
|
||||
"price_diff_percent": -15.48
|
||||
},
|
||||
{
|
||||
"category": "Concentrates",
|
||||
"recreational": {
|
||||
"state_count": 15,
|
||||
"dispensary_count": 600,
|
||||
"sku_count": 8500,
|
||||
"avg_price": 42.00,
|
||||
"median_price": 40.00
|
||||
},
|
||||
"medical_only": {
|
||||
"state_count": 8,
|
||||
"dispensary_count": 200,
|
||||
"sku_count": 3100,
|
||||
"avg_price": 48.00,
|
||||
"median_price": 45.00
|
||||
},
|
||||
"price_diff_percent": -12.50
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Store Analytics
|
||||
|
||||
### GET /store/:id/summary
|
||||
|
||||
Get change summary for a store over a time window.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/store/101/summary?window=30d
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"dispensary_id": 101,
|
||||
"dispensary_name": "Green Leaf Dispensary",
|
||||
"state_code": "AZ",
|
||||
"window": "30d",
|
||||
"products_added": 45,
|
||||
"products_dropped": 12,
|
||||
"brands_added": ["Alien Labs", "Connected"],
|
||||
"brands_dropped": ["House Brand"],
|
||||
"price_changes": 156,
|
||||
"avg_price_change_percent": 3.2,
|
||||
"stock_in_events": 89,
|
||||
"stock_out_events": 34,
|
||||
"current_product_count": 512,
|
||||
"current_in_stock_count": 478
|
||||
}
|
||||
```
|
||||
|
||||
### GET /store/:id/events
|
||||
|
||||
Get recent product change events for a store.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/store/101/events?window=7d&limit=50
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"store_product_id": 12345,
|
||||
"product_name": "Blue Dream 3.5g",
|
||||
"brand_name": "Cookies",
|
||||
"category": "Flower",
|
||||
"event_type": "price_change",
|
||||
"event_date": "2024-12-05T14:30:00.000Z",
|
||||
"old_value": "45.00",
|
||||
"new_value": "42.00"
|
||||
},
|
||||
{
|
||||
"store_product_id": 12346,
|
||||
"product_name": "OG Kush 1g",
|
||||
"brand_name": "Alien Labs",
|
||||
"category": "Flower",
|
||||
"event_type": "added",
|
||||
"event_date": "2024-12-04T10:00:00.000Z",
|
||||
"old_value": null,
|
||||
"new_value": null
|
||||
},
|
||||
{
|
||||
"store_product_id": 12300,
|
||||
"product_name": "Sour Diesel Cart",
|
||||
"brand_name": "Select",
|
||||
"category": "Vaporizers",
|
||||
"event_type": "stock_out",
|
||||
"event_date": "2024-12-03T16:45:00.000Z",
|
||||
"old_value": "true",
|
||||
"new_value": "false"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. State Analytics
|
||||
|
||||
### GET /state/:code/summary
|
||||
|
||||
Get market summary for a specific state with rec/med breakdown.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/state/AZ/summary
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"state_code": "AZ",
|
||||
"state_name": "Arizona",
|
||||
"legal_status": {
|
||||
"recreational_legal": true,
|
||||
"rec_year": 2020,
|
||||
"medical_legal": true,
|
||||
"med_year": 2010
|
||||
},
|
||||
"coverage": {
|
||||
"dispensary_count": 145,
|
||||
"product_count": 18500,
|
||||
"brand_count": 320,
|
||||
"category_count": 12,
|
||||
"snapshot_count": 2450000,
|
||||
"last_crawl_at": "2024-12-06T02:30:00.000Z"
|
||||
},
|
||||
"pricing": {
|
||||
"avg_price": 42.50,
|
||||
"median_price": 38.00,
|
||||
"min_price": 5.00,
|
||||
"max_price": 250.00
|
||||
},
|
||||
"top_categories": [
|
||||
{ "category": "Flower", "count": 5200 },
|
||||
{ "category": "Concentrates", "count": 3800 },
|
||||
{ "category": "Vaporizers", "count": 2950 },
|
||||
{ "category": "Edibles", "count": 2400 },
|
||||
{ "category": "Pre-Rolls", "count": 1850 }
|
||||
],
|
||||
"top_brands": [
|
||||
{ "brand": "Cookies", "count": 450 },
|
||||
{ "brand": "Alien Labs", "count": 380 },
|
||||
{ "brand": "Connected", "count": 320 },
|
||||
{ "brand": "Stiiizy", "count": 290 },
|
||||
{ "brand": "Raw Garden", "count": 275 }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### GET /state/legal-breakdown
|
||||
|
||||
Get breakdown by legal status (recreational, medical-only, no program).
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/state/legal-breakdown
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"recreational_states": {
|
||||
"count": 24,
|
||||
"dispensary_count": 850,
|
||||
"product_count": 125000,
|
||||
"snapshot_count": 15000000,
|
||||
"states": [
|
||||
{ "code": "CA", "name": "California", "dispensary_count": 250 },
|
||||
{ "code": "CO", "name": "Colorado", "dispensary_count": 150 },
|
||||
{ "code": "AZ", "name": "Arizona", "dispensary_count": 145 },
|
||||
{ "code": "MI", "name": "Michigan", "dispensary_count": 120 }
|
||||
]
|
||||
},
|
||||
"medical_only_states": {
|
||||
"count": 18,
|
||||
"dispensary_count": 320,
|
||||
"product_count": 45000,
|
||||
"snapshot_count": 5000000,
|
||||
"states": [
|
||||
{ "code": "FL", "name": "Florida", "dispensary_count": 120 },
|
||||
{ "code": "OH", "name": "Ohio", "dispensary_count": 85 },
|
||||
{ "code": "PA", "name": "Pennsylvania", "dispensary_count": 75 }
|
||||
]
|
||||
},
|
||||
"no_program_states": {
|
||||
"count": 9,
|
||||
"states": [
|
||||
{ "code": "ID", "name": "Idaho" },
|
||||
{ "code": "WY", "name": "Wyoming" },
|
||||
{ "code": "KS", "name": "Kansas" }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### GET /state/recreational
|
||||
|
||||
Get list of recreational state codes.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/state/recreational
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"legal_type": "recreational",
|
||||
"states": ["AK", "AZ", "CA", "CO", "CT", "DE", "IL", "MA", "MD", "ME", "MI", "MN", "MO", "MT", "NJ", "NM", "NV", "NY", "OH", "OR", "RI", "VA", "VT", "WA"],
|
||||
"count": 24
|
||||
}
|
||||
```
|
||||
|
||||
### GET /state/medical-only
|
||||
|
||||
Get list of medical-only state codes (not recreational).
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/state/medical-only
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"legal_type": "medical_only",
|
||||
"states": ["AR", "FL", "HI", "LA", "MS", "ND", "NH", "OK", "PA", "SD", "UT", "WV"],
|
||||
"count": 12
|
||||
}
|
||||
```
|
||||
|
||||
### GET /state/rec-vs-med-pricing
|
||||
|
||||
Get rec vs med price comparison by category.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
GET /api/analytics/v2/state/rec-vs-med-pricing?category=Flower
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "Flower",
|
||||
"recreational": {
|
||||
"state_count": 15,
|
||||
"product_count": 12500,
|
||||
"avg_price": 35.50,
|
||||
"median_price": 32.00
|
||||
},
|
||||
"medical_only": {
|
||||
"state_count": 8,
|
||||
"product_count": 5200,
|
||||
"avg_price": 42.00,
|
||||
"median_price": 40.00
|
||||
},
|
||||
"price_diff_percent": -15.48
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How These Endpoints Support Portals
|
||||
|
||||
### Brand Portal Use Cases
|
||||
|
||||
1. **Track brand penetration**: Use `/brand/:name/penetration` to see how many stores carry the brand
|
||||
2. **Compare rec vs med markets**: Use `/brand/:name/rec-vs-med` to understand footprint by legal status
|
||||
3. **Identify expansion opportunities**: Use `/state/coverage-gaps` to find underserved markets
|
||||
4. **Monitor pricing**: Use `/price/brand/:brand` to track pricing by state
|
||||
|
||||
### Buyer Portal Use Cases
|
||||
|
||||
1. **Compare stores**: Use `/store/:id/summary` to see activity levels
|
||||
2. **Track price changes**: Use `/store/:id/events` to monitor competitor pricing
|
||||
3. **Analyze categories**: Use `/category/:name/growth` to identify trending products
|
||||
4. **State-level insights**: Use `/state/:code/summary` for market overview
|
||||
|
||||
---
|
||||
|
||||
## Time Window Filtering
|
||||
|
||||
All time-based endpoints support the `window` query parameter:
|
||||
|
||||
| Value | Description |
|
||||
|-------|-------------|
|
||||
| `7d` | Last 7 days |
|
||||
| `30d` | Last 30 days (default) |
|
||||
| `90d` | Last 90 days |
|
||||
|
||||
The window affects:
|
||||
- `store_product_snapshots.captured_at` for historical data
|
||||
- `store_products.first_seen_at` / `last_seen_at` for product lifecycle
|
||||
- `crawl_runs.started_at` for crawl-based metrics
|
||||
|
||||
---
|
||||
|
||||
## Rec/Med Segmentation
|
||||
|
||||
All state-level endpoints automatically segment by:
|
||||
|
||||
- **Recreational**: `states.recreational_legal = TRUE`
|
||||
- **Medical-only**: `states.medical_legal = TRUE AND states.recreational_legal = FALSE`
|
||||
- **No program**: Both flags are FALSE or NULL
|
||||
|
||||
This segmentation appears in:
|
||||
- `legal_type` field in responses
|
||||
- State breakdown arrays
|
||||
- Price comparison endpoints
|
||||
394
backend/docs/_archive/BRAND_INTELLIGENCE_API.md
Normal file
394
backend/docs/_archive/BRAND_INTELLIGENCE_API.md
Normal file
@@ -0,0 +1,394 @@
|
||||
# Brand Intelligence API
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
GET /api/analytics/v2/brand/:name/intelligence
|
||||
```
|
||||
|
||||
## Query Parameters
|
||||
|
||||
| Param | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `window` | `7d\|30d\|90d` | `30d` | Time window for trend calculations |
|
||||
| `state` | string | - | Filter by state code (e.g., `AZ`) |
|
||||
| `category` | string | - | Filter by category (e.g., `Flower`) |
|
||||
|
||||
## Response Payload Schema
|
||||
|
||||
```typescript
|
||||
interface BrandIntelligenceResult {
|
||||
brand_name: string;
|
||||
window: '7d' | '30d' | '90d';
|
||||
generated_at: string; // ISO timestamp when data was computed
|
||||
|
||||
performance_snapshot: PerformanceSnapshot;
|
||||
alerts: Alerts;
|
||||
sku_performance: SkuPerformance[];
|
||||
retail_footprint: RetailFootprint;
|
||||
competitive_landscape: CompetitiveLandscape;
|
||||
inventory_health: InventoryHealth;
|
||||
promo_performance: PromoPerformance;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Section 1: Performance Snapshot
|
||||
|
||||
Summary cards with key brand metrics.
|
||||
|
||||
```typescript
|
||||
interface PerformanceSnapshot {
|
||||
active_skus: number; // Total products in catalog
|
||||
total_revenue_30d: number | null; // Estimated from qty × price
|
||||
total_stores: number; // Active retail partners
|
||||
new_stores_30d: number; // New distribution in window
|
||||
market_share: number | null; // % of category SKUs
|
||||
avg_wholesale_price: number | null;
|
||||
price_position: 'premium' | 'value' | 'competitive';
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label | Helper Text |
|
||||
|-------|-------------------|-------------|
|
||||
| `active_skus` | Active Products | X total in catalog |
|
||||
| `total_revenue_30d` | Monthly Revenue | Estimated from sales |
|
||||
| `total_stores` | Retail Distribution | Active retail partners |
|
||||
| `new_stores_30d` | New Opportunities | X new in last 30 days |
|
||||
| `market_share` | Category Position | % of category |
|
||||
| `avg_wholesale_price` | Avg Wholesale | Per unit |
|
||||
| `price_position` | Pricing Tier | Premium/Value/Market Rate |
|
||||
|
||||
---
|
||||
|
||||
## Section 2: Alerts
|
||||
|
||||
Issues requiring attention.
|
||||
|
||||
```typescript
|
||||
interface Alerts {
|
||||
lost_stores_30d_count: number;
|
||||
lost_skus_30d_count: number;
|
||||
competitor_takeover_count: number;
|
||||
avg_oos_duration_days: number | null;
|
||||
avg_reorder_lag_days: number | null;
|
||||
items: AlertItem[];
|
||||
}
|
||||
|
||||
interface AlertItem {
|
||||
type: 'lost_store' | 'delisted_sku' | 'shelf_loss' | 'extended_oos';
|
||||
severity: 'critical' | 'warning';
|
||||
store_name?: string;
|
||||
product_name?: string;
|
||||
competitor_brand?: string;
|
||||
days_since?: number;
|
||||
state_code?: string;
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `lost_stores_30d_count` | Accounts at Risk |
|
||||
| `lost_skus_30d_count` | Delisted SKUs |
|
||||
| `competitor_takeover_count` | Shelf Losses |
|
||||
| `avg_oos_duration_days` | Avg Stockout Length |
|
||||
| `avg_reorder_lag_days` | Avg Restock Time |
|
||||
| `severity: critical` | Urgent |
|
||||
| `severity: warning` | Watch |
|
||||
|
||||
---
|
||||
|
||||
## Section 3: SKU Performance (Product Velocity)
|
||||
|
||||
How fast each SKU sells.
|
||||
|
||||
```typescript
|
||||
interface SkuPerformance {
|
||||
store_product_id: number;
|
||||
product_name: string;
|
||||
category: string | null;
|
||||
daily_velocity: number; // Units/day estimate
|
||||
velocity_status: 'hot' | 'steady' | 'slow' | 'stale';
|
||||
retail_price: number | null;
|
||||
on_sale: boolean;
|
||||
stores_carrying: number;
|
||||
stock_status: 'in_stock' | 'low_stock' | 'out_of_stock';
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `daily_velocity` | Daily Rate |
|
||||
| `velocity_status` | Momentum |
|
||||
| `velocity_status: hot` | Hot |
|
||||
| `velocity_status: steady` | Steady |
|
||||
| `velocity_status: slow` | Slow |
|
||||
| `velocity_status: stale` | Stale |
|
||||
| `retail_price` | Retail Price |
|
||||
| `on_sale` | Promo (badge) |
|
||||
|
||||
**Velocity Thresholds:**
|
||||
- `hot`: >= 5 units/day
|
||||
- `steady`: >= 1 unit/day
|
||||
- `slow`: >= 0.1 units/day
|
||||
- `stale`: < 0.1 units/day
|
||||
|
||||
---
|
||||
|
||||
## Section 4: Retail Footprint
|
||||
|
||||
Store placement and coverage.
|
||||
|
||||
```typescript
|
||||
interface RetailFootprint {
|
||||
total_stores: number;
|
||||
in_stock_count: number;
|
||||
out_of_stock_count: number;
|
||||
penetration_by_region: RegionPenetration[];
|
||||
whitespace_stores: WhitespaceStore[];
|
||||
}
|
||||
|
||||
interface RegionPenetration {
|
||||
state_code: string;
|
||||
store_count: number;
|
||||
percent_reached: number; // % of state's dispensaries
|
||||
in_stock: number;
|
||||
out_of_stock: number;
|
||||
}
|
||||
|
||||
interface WhitespaceStore {
|
||||
store_id: number;
|
||||
store_name: string;
|
||||
state_code: string;
|
||||
city: string | null;
|
||||
category_fit: number; // How many competing brands they carry
|
||||
competitor_brands: string[];
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `penetration_by_region` | Market Coverage by Region |
|
||||
| `percent_reached` | X% reached |
|
||||
| `in_stock` | X stocked |
|
||||
| `out_of_stock` | X out |
|
||||
| `whitespace_stores` | Expansion Opportunities |
|
||||
| `category_fit` | X fit |
|
||||
|
||||
---
|
||||
|
||||
## Section 5: Competitive Landscape
|
||||
|
||||
Market positioning vs competitors.
|
||||
|
||||
```typescript
|
||||
interface CompetitiveLandscape {
|
||||
brand_price_position: 'premium' | 'value' | 'competitive';
|
||||
market_share_trend: MarketSharePoint[];
|
||||
competitors: Competitor[];
|
||||
head_to_head_skus: HeadToHead[];
|
||||
}
|
||||
|
||||
interface MarketSharePoint {
|
||||
date: string;
|
||||
share_percent: number;
|
||||
}
|
||||
|
||||
interface Competitor {
|
||||
brand_name: string;
|
||||
store_overlap_percent: number;
|
||||
price_position: 'premium' | 'value' | 'competitive';
|
||||
avg_price: number | null;
|
||||
sku_count: number;
|
||||
}
|
||||
|
||||
interface HeadToHead {
|
||||
product_name: string;
|
||||
brand_price: number;
|
||||
competitor_brand: string;
|
||||
competitor_price: number;
|
||||
price_diff_percent: number;
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `price_position: premium` | Premium Tier |
|
||||
| `price_position: value` | Value Leader |
|
||||
| `price_position: competitive` | Market Rate |
|
||||
| `market_share_trend` | Share of Shelf Trend |
|
||||
| `head_to_head_skus` | Price Comparison |
|
||||
| `store_overlap_percent` | X% store overlap |
|
||||
|
||||
---
|
||||
|
||||
## Section 6: Inventory Health
|
||||
|
||||
Stock projections and risk levels.
|
||||
|
||||
```typescript
|
||||
interface InventoryHealth {
|
||||
critical_count: number; // <7 days stock
|
||||
warning_count: number; // 7-14 days stock
|
||||
healthy_count: number; // 14-90 days stock
|
||||
overstocked_count: number; // >90 days stock
|
||||
skus: InventorySku[];
|
||||
overstock_alert: OverstockItem[];
|
||||
}
|
||||
|
||||
interface InventorySku {
|
||||
store_product_id: number;
|
||||
product_name: string;
|
||||
store_name: string;
|
||||
days_of_stock: number | null;
|
||||
risk_level: 'critical' | 'elevated' | 'moderate' | 'healthy';
|
||||
current_quantity: number | null;
|
||||
daily_sell_rate: number | null;
|
||||
}
|
||||
|
||||
interface OverstockItem {
|
||||
product_name: string;
|
||||
store_name: string;
|
||||
excess_units: number;
|
||||
days_of_stock: number;
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `risk_level: critical` | Reorder Now |
|
||||
| `risk_level: elevated` | Low Stock |
|
||||
| `risk_level: moderate` | Monitor |
|
||||
| `risk_level: healthy` | Healthy |
|
||||
| `critical_count` | Urgent (<7 days) |
|
||||
| `warning_count` | Low (7-14 days) |
|
||||
| `overstocked_count` | Excess (>90 days) |
|
||||
| `days_of_stock` | X days remaining |
|
||||
| `overstock_alert` | Overstock Alert |
|
||||
| `excess_units` | X excess units |
|
||||
|
||||
---
|
||||
|
||||
## Section 7: Promotion Effectiveness
|
||||
|
||||
How promotions impact sales.
|
||||
|
||||
```typescript
|
||||
interface PromoPerformance {
|
||||
avg_baseline_velocity: number | null;
|
||||
avg_promo_velocity: number | null;
|
||||
avg_velocity_lift: number | null; // % increase during promo
|
||||
avg_efficiency_score: number | null; // ROI proxy
|
||||
promotions: Promotion[];
|
||||
}
|
||||
|
||||
interface Promotion {
|
||||
product_name: string;
|
||||
store_name: string;
|
||||
status: 'active' | 'scheduled' | 'ended';
|
||||
start_date: string;
|
||||
end_date: string | null;
|
||||
regular_price: number;
|
||||
promo_price: number;
|
||||
discount_percent: number;
|
||||
baseline_velocity: number | null;
|
||||
promo_velocity: number | null;
|
||||
velocity_lift: number | null;
|
||||
efficiency_score: number | null;
|
||||
}
|
||||
```
|
||||
|
||||
**UI Label Mapping:**
|
||||
| Field | User-Facing Label |
|
||||
|-------|-------------------|
|
||||
| `avg_baseline_velocity` | Normal Rate |
|
||||
| `avg_promo_velocity` | During Promos |
|
||||
| `avg_velocity_lift` | Avg Sales Lift |
|
||||
| `avg_efficiency_score` | ROI Score |
|
||||
| `velocity_lift` | Sales Lift |
|
||||
| `efficiency_score` | ROI Score |
|
||||
| `status: active` | Live |
|
||||
| `status: scheduled` | Scheduled |
|
||||
| `status: ended` | Ended |
|
||||
|
||||
---
|
||||
|
||||
## Example Queries
|
||||
|
||||
### Get full payload
|
||||
```javascript
|
||||
const response = await fetch('/api/analytics/v2/brand/Wyld/intelligence?window=30d');
|
||||
const data = await response.json();
|
||||
```
|
||||
|
||||
### Extract summary cards (flattened)
|
||||
```javascript
|
||||
const { performance_snapshot: ps, alerts } = data;
|
||||
|
||||
const summaryCards = {
|
||||
activeProducts: ps.active_skus,
|
||||
monthlyRevenue: ps.total_revenue_30d,
|
||||
retailDistribution: ps.total_stores,
|
||||
newOpportunities: ps.new_stores_30d,
|
||||
categoryPosition: ps.market_share,
|
||||
avgWholesale: ps.avg_wholesale_price,
|
||||
pricingTier: ps.price_position,
|
||||
accountsAtRisk: alerts.lost_stores_30d_count,
|
||||
delistedSkus: alerts.lost_skus_30d_count,
|
||||
shelfLosses: alerts.competitor_takeover_count,
|
||||
};
|
||||
```
|
||||
|
||||
### Get top 10 fastest selling SKUs
|
||||
```javascript
|
||||
const topSkus = data.sku_performance
|
||||
.filter(sku => sku.velocity_status === 'hot' || sku.velocity_status === 'steady')
|
||||
.sort((a, b) => b.daily_velocity - a.daily_velocity)
|
||||
.slice(0, 10);
|
||||
```
|
||||
|
||||
### Get critical inventory alerts only
|
||||
```javascript
|
||||
const criticalInventory = data.inventory_health.skus
|
||||
.filter(sku => sku.risk_level === 'critical');
|
||||
```
|
||||
|
||||
### Get states with <50% penetration
|
||||
```javascript
|
||||
const underPenetrated = data.retail_footprint.penetration_by_region
|
||||
.filter(region => region.percent_reached < 50)
|
||||
.sort((a, b) => a.percent_reached - b.percent_reached);
|
||||
```
|
||||
|
||||
### Get active promotions with positive lift
|
||||
```javascript
|
||||
const effectivePromos = data.promo_performance.promotions
|
||||
.filter(p => p.status === 'active' && p.velocity_lift > 0)
|
||||
.sort((a, b) => b.velocity_lift - a.velocity_lift);
|
||||
```
|
||||
|
||||
### Build chart data for market share trend
|
||||
```javascript
|
||||
const chartData = data.competitive_landscape.market_share_trend.map(point => ({
|
||||
x: new Date(point.date),
|
||||
y: point.share_percent,
|
||||
}));
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Notes for Frontend Implementation
|
||||
|
||||
1. **All fields are snake_case** - transform to camelCase if needed
|
||||
2. **Null values are possible** - handle gracefully in UI
|
||||
3. **Arrays may be empty** - show appropriate empty states
|
||||
4. **Timestamps are ISO format** - parse with `new Date()`
|
||||
5. **Percentages are already computed** - no need to multiply by 100
|
||||
6. **The `window` parameter affects trend calculations** - 7d/30d/90d
|
||||
539
backend/docs/_archive/CRAWL_PIPELINE.md
Normal file
539
backend/docs/_archive/CRAWL_PIPELINE.md
Normal file
@@ -0,0 +1,539 @@
|
||||
# Crawl Pipeline Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
```
|
||||
┌─────────────────────┐
|
||||
│ store_discovery │ Find new dispensaries
|
||||
└─────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ entry_point_discovery│ Resolve slug → platform_dispensary_id
|
||||
└─────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ product_discovery │ Initial product crawl
|
||||
└─────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ product_resync │ Recurring crawl (every 4 hours)
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stage Details
|
||||
|
||||
### 1. Store Discovery
|
||||
**Purpose:** Find new dispensaries to crawl
|
||||
|
||||
**Handler:** `src/tasks/handlers/store-discovery.ts`
|
||||
|
||||
**Flow:**
|
||||
1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states
|
||||
2. Extract dispensary info (name, address, menu_url)
|
||||
3. Insert into `dutchie_discovery_locations`
|
||||
4. Queue `entry_point_discovery` for each new location
|
||||
|
||||
---
|
||||
|
||||
### 2. Entry Point Discovery
|
||||
**Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)
|
||||
|
||||
**Handler:** `src/tasks/handlers/entry-point-discovery.ts`
|
||||
|
||||
**Flow:**
|
||||
1. Load dispensary from database
|
||||
2. Extract slug from `menu_url`:
|
||||
- `/embedded-menu/<slug>` or `/dispensary/<slug>`
|
||||
3. Start stealth session (fingerprint + proxy)
|
||||
4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL
|
||||
5. Update dispensary with `platform_dispensary_id`
|
||||
6. Queue `product_discovery` task
|
||||
|
||||
**Example:**
|
||||
```
|
||||
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
|
||||
slug: deeply-rooted
|
||||
platform_dispensary_id: 6405ef617056e8014d79101b
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Product Discovery
|
||||
**Purpose:** Initial crawl of a new dispensary
|
||||
|
||||
**Handler:** `src/tasks/handlers/product-discovery.ts`
|
||||
|
||||
Same as product_resync but for first-time crawls.
|
||||
|
||||
---
|
||||
|
||||
### 4. Product Resync
|
||||
**Purpose:** Recurring crawl to capture price/stock changes
|
||||
|
||||
**Handler:** `src/tasks/handlers/product-resync.ts`
|
||||
|
||||
**Flow:**
|
||||
|
||||
#### Step 1: Load Dispensary Info
|
||||
```sql
|
||||
SELECT id, name, platform_dispensary_id, menu_url, state
|
||||
FROM dispensaries
|
||||
WHERE id = $1 AND crawl_enabled = true
|
||||
```
|
||||
|
||||
#### Step 2: Start Stealth Session
|
||||
- Generate random browser fingerprint
|
||||
- Set locale/timezone matching state
|
||||
- Optional proxy rotation
|
||||
|
||||
#### Step 3: Fetch Products via GraphQL
|
||||
**Endpoint:** `https://dutchie.com/api-3/graphql`
|
||||
|
||||
**Variables:**
|
||||
```javascript
|
||||
{
|
||||
includeEnterpriseSpecials: false,
|
||||
productsFilter: {
|
||||
dispensaryId: "<platform_dispensary_id>",
|
||||
pricingType: "rec",
|
||||
Status: "All",
|
||||
types: [],
|
||||
useCache: false,
|
||||
isDefaultSort: true,
|
||||
sortBy: "popularSortIdx",
|
||||
sortDirection: 1,
|
||||
bypassOnlineThresholds: true,
|
||||
isKioskMenu: false,
|
||||
removeProductsBelowOptionThresholds: false
|
||||
},
|
||||
page: 0,
|
||||
perPage: 100
|
||||
}
|
||||
```
|
||||
|
||||
**Key Notes:**
|
||||
- `Status: "All"` returns all products (Active returns same count)
|
||||
- `Status: null` returns 0 products (broken)
|
||||
- `pricingType: "rec"` returns BOTH rec and med prices
|
||||
- Paginate until `products.length < perPage` or `allProducts.length >= totalCount`
|
||||
|
||||
#### Step 4: Normalize Data
|
||||
Transform raw Dutchie payload to canonical format via `DutchieNormalizer`.
|
||||
|
||||
#### Step 5: Upsert Products
|
||||
Insert/update `store_products` table with normalized data.
|
||||
|
||||
#### Step 6: Create Snapshots
|
||||
Insert point-in-time record to `store_product_snapshots`.
|
||||
|
||||
#### Step 7: Track Missing Products (OOS Detection)
|
||||
```sql
|
||||
-- Reset consecutive_misses for products IN the feed
|
||||
UPDATE store_products
|
||||
SET consecutive_misses = 0, last_seen_at = NOW()
|
||||
WHERE dispensary_id = $1
|
||||
AND provider = 'dutchie'
|
||||
AND provider_product_id = ANY($2)
|
||||
|
||||
-- Increment for products NOT in feed
|
||||
UPDATE store_products
|
||||
SET consecutive_misses = consecutive_misses + 1
|
||||
WHERE dispensary_id = $1
|
||||
AND provider = 'dutchie'
|
||||
AND provider_product_id NOT IN (...)
|
||||
AND consecutive_misses < 3
|
||||
|
||||
-- Mark OOS at 3 consecutive misses
|
||||
UPDATE store_products
|
||||
SET stock_status = 'oos', is_in_stock = false
|
||||
WHERE dispensary_id = $1
|
||||
AND consecutive_misses >= 3
|
||||
AND stock_status != 'oos'
|
||||
```
|
||||
|
||||
#### Step 8: Download Images
|
||||
For new products, download and store images locally.
|
||||
|
||||
#### Step 9: Update Dispensary
|
||||
```sql
|
||||
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## GraphQL Payload Structure
|
||||
|
||||
### Product Fields (from filteredProducts.products[])
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `_id` / `id` | string | MongoDB ObjectId (24 hex chars) |
|
||||
| `Name` | string | Product display name |
|
||||
| `brandName` | string | Brand name |
|
||||
| `brand.name` | string | Brand name (nested) |
|
||||
| `brand.description` | string | Brand description |
|
||||
| `type` | string | Category (Flower, Edible, Concentrate, etc.) |
|
||||
| `subcategory` | string | Subcategory |
|
||||
| `strainType` | string | Hybrid, Indica, Sativa, N/A |
|
||||
| `Status` | string | Always "Active" in feed |
|
||||
| `Image` | string | Primary image URL |
|
||||
| `images[]` | array | All product images |
|
||||
|
||||
### Pricing Fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `Prices[]` | number[] | Rec prices per option |
|
||||
| `recPrices[]` | number[] | Rec prices |
|
||||
| `medicalPrices[]` | number[] | Medical prices |
|
||||
| `recSpecialPrices[]` | number[] | Rec sale prices |
|
||||
| `medicalSpecialPrices[]` | number[] | Medical sale prices |
|
||||
| `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) |
|
||||
| `rawOptions[]` | string[] | Raw weight options ("3.5g") |
|
||||
|
||||
### Inventory Fields (POSMetaData.children[])
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `quantity` | number | Total inventory count |
|
||||
| `quantityAvailable` | number | Available for online orders |
|
||||
| `kioskQuantityAvailable` | number | Available for kiosk orders |
|
||||
| `option` | string | Which size option this is for |
|
||||
|
||||
### Potency Fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `THCContent.range[]` | number[] | THC percentage |
|
||||
| `CBDContent.range[]` | number[] | CBD percentage |
|
||||
| `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown |
|
||||
|
||||
### Specials (specialData.bogoSpecials[])
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `specialName` | string | Deal name |
|
||||
| `specialType` | string | "bogo", "sale", etc. |
|
||||
| `itemsForAPrice.value` | string | Bundle price |
|
||||
| `bogoRewards[].totalQuantity.quantity` | number | Required quantity |
|
||||
|
||||
---
|
||||
|
||||
## OOS Detection Logic
|
||||
|
||||
Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`:
|
||||
|
||||
| Scenario | Action |
|
||||
|----------|--------|
|
||||
| Product in feed | `consecutive_misses = 0` |
|
||||
| Product missing 1st time | `consecutive_misses = 1` |
|
||||
| Product missing 2nd time | `consecutive_misses = 2` |
|
||||
| Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` |
|
||||
| Product returns to feed | `consecutive_misses = 0`, update stock_status |
|
||||
|
||||
**Why 3 misses?**
|
||||
- Protects against false positives from crawl failures
|
||||
- Single bad crawl doesn't trigger mass OOS alerts
|
||||
- Balances detection speed vs accuracy
|
||||
|
||||
---
|
||||
|
||||
## Database Tables
|
||||
|
||||
### store_products
|
||||
Current state of each product:
|
||||
- `provider_product_id` - Dutchie's MongoDB ObjectId
|
||||
- `name_raw`, `brand_name_raw` - Raw values from feed
|
||||
- `price_rec`, `price_med` - Current prices
|
||||
- `is_in_stock`, `stock_status` - Availability
|
||||
- `consecutive_misses` - OOS detection counter
|
||||
- `last_seen_at` - Last time product was in feed
|
||||
|
||||
### store_product_snapshots
|
||||
Point-in-time records for historical analysis:
|
||||
- One row per product per crawl
|
||||
- Captures price, stock, potency at that moment
|
||||
- Used for price history, analytics
|
||||
|
||||
### dispensaries
|
||||
Store metadata:
|
||||
- `platform_dispensary_id` - MongoDB ObjectId for GraphQL
|
||||
- `menu_url` - Source URL
|
||||
- `last_crawl_at` - Last successful crawl
|
||||
- `crawl_enabled` - Whether to crawl
|
||||
|
||||
---
|
||||
|
||||
## Worker Roles
|
||||
|
||||
Workers pull tasks from the `worker_tasks` queue based on their assigned role.
|
||||
|
||||
| Role | Name | Description | Handler |
|
||||
|------|------|-------------|---------|
|
||||
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
|
||||
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
|
||||
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
|
||||
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
|
||||
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
|
||||
|
||||
**API Endpoint:** `GET /api/worker-registry/roles`
|
||||
|
||||
---
|
||||
|
||||
## Scheduling
|
||||
|
||||
Crawls are scheduled via `worker_tasks` table:
|
||||
|
||||
| Role | Frequency | Description |
|
||||
|------|-----------|-------------|
|
||||
| `product_resync` | Every 4 hours | Regular product refresh |
|
||||
| `product_discovery` | On-demand | First crawl for new stores |
|
||||
| `entry_point_discovery` | On-demand | New store setup |
|
||||
| `store_discovery` | Daily | Find new stores |
|
||||
| `analytics_refresh` | Daily | Refresh analytics materialized views |
|
||||
|
||||
---
|
||||
|
||||
## Priority & On-Demand Tasks
|
||||
|
||||
Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
|
||||
|
||||
### Priority Levels
|
||||
|
||||
| Priority | Use Case | Example |
|
||||
|----------|----------|---------|
|
||||
| 0 | Scheduled/batch tasks | Daily product_resync generation |
|
||||
| 10 | On-demand/chained tasks | entry_point → product_discovery |
|
||||
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
|
||||
|
||||
### Task Chaining
|
||||
|
||||
When a task completes, the system automatically creates follow-up tasks:
|
||||
|
||||
```
|
||||
store_discovery (completed)
|
||||
└─► entry_point_discovery (priority: 10) for each new store
|
||||
|
||||
entry_point_discovery (completed, success)
|
||||
└─► product_discovery (priority: 10) for that store
|
||||
|
||||
product_discovery (completed)
|
||||
└─► [no chain] Store enters regular resync schedule
|
||||
```
|
||||
|
||||
### On-Demand Task Creation
|
||||
|
||||
Use the task service to create high-priority tasks:
|
||||
|
||||
```typescript
|
||||
// Create immediate product resync for a store
|
||||
await taskService.createTask({
|
||||
role: 'product_resync',
|
||||
dispensary_id: 123,
|
||||
platform: 'dutchie',
|
||||
priority: 20, // Higher than batch tasks
|
||||
});
|
||||
|
||||
// Convenience methods with default high priority (10)
|
||||
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
|
||||
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
|
||||
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
|
||||
```
|
||||
|
||||
### Claim Function
|
||||
|
||||
The `claim_task()` SQL function atomically claims tasks:
|
||||
- Respects priority ordering (higher = first)
|
||||
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
|
||||
- Prevents multiple active tasks per store
|
||||
|
||||
---
|
||||
|
||||
## Image Storage
|
||||
|
||||
Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
|
||||
|
||||
### Storage Path
|
||||
```
|
||||
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
|
||||
/storage/images/brands/<brand>/logo-<hash>.webp
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
|
||||
```
|
||||
|
||||
### Image Proxy API
|
||||
Served via `/img/*` with on-demand resizing using **sharp**:
|
||||
|
||||
```
|
||||
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
|
||||
```
|
||||
|
||||
| Param | Description |
|
||||
|-------|-------------|
|
||||
| `w` | Width in pixels (max 4000) |
|
||||
| `h` | Height in pixels (max 4000) |
|
||||
| `q` | Quality 1-100 (default 80) |
|
||||
| `fit` | cover, contain, fill, inside, outside |
|
||||
| `blur` | Blur sigma (0.3-1000) |
|
||||
| `gray` | Grayscale (1 = enabled) |
|
||||
| `format` | webp, jpeg, png, avif (default webp) |
|
||||
|
||||
### Key Files
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
|
||||
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
|
||||
|
||||
### Download Rules
|
||||
|
||||
| Scenario | Image Action |
|
||||
|----------|--------------|
|
||||
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
|
||||
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
|
||||
| **Product already has local image** | Skip download entirely |
|
||||
|
||||
**Logic:**
|
||||
- Images are downloaded **once** and never re-downloaded on subsequent crawls
|
||||
- `skipIfExists: true` - filesystem check prevents re-download even if queued
|
||||
- First crawl: all products get images
|
||||
- Refresh crawl: only new products or products missing local images
|
||||
|
||||
### Storage Rules
|
||||
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
|
||||
- Store full resolution, resize on-demand via `/img` proxy
|
||||
- Convert to webp for consistency using **sharp**
|
||||
- Preserve original Dutchie URL as fallback in `image_url` column
|
||||
- Local path stored in `local_image_path` column
|
||||
|
||||
---
|
||||
|
||||
## Stealth & Anti-Detection
|
||||
|
||||
**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
|
||||
|
||||
Workers automatically initialize anti-detection systems on startup.
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Purpose | Source |
|
||||
|-----------|---------|--------|
|
||||
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
|
||||
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
|
||||
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
|
||||
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
|
||||
|
||||
### Initialization Flow
|
||||
|
||||
```
|
||||
Worker Start
|
||||
│
|
||||
├─► initializeStealth()
|
||||
│ │
|
||||
│ ├─► CrawlRotator.initialize()
|
||||
│ │ └─► Load proxies from `proxies` table
|
||||
│ │
|
||||
│ └─► setCrawlRotator(rotator)
|
||||
│ └─► Wire to Dutchie client
|
||||
│
|
||||
└─► Process tasks...
|
||||
```
|
||||
|
||||
### Stealth Session (per task)
|
||||
|
||||
Each crawl task starts a stealth session:
|
||||
|
||||
```typescript
|
||||
// In product-refresh.ts, entry-point-discovery.ts
|
||||
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
|
||||
```
|
||||
|
||||
This creates a new identity with:
|
||||
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
|
||||
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix` → `en-US,en;q=0.9`)
|
||||
- **sec-ch-ua headers:** Proper Client Hints for the browser profile
|
||||
|
||||
### On 403 Block
|
||||
|
||||
When Dutchie returns 403, the client automatically:
|
||||
|
||||
1. Records failure on current proxy (increments `failure_count`)
|
||||
2. If proxy has 5+ failures, deactivates it
|
||||
3. Rotates to next healthy proxy
|
||||
4. Rotates fingerprint
|
||||
5. Retries the request
|
||||
|
||||
### Proxy Table Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE proxies (
|
||||
id SERIAL PRIMARY KEY,
|
||||
host VARCHAR(255) NOT NULL,
|
||||
port INTEGER NOT NULL,
|
||||
username VARCHAR(100),
|
||||
password VARCHAR(100),
|
||||
protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
last_used_at TIMESTAMPTZ,
|
||||
failure_count INTEGER DEFAULT 0,
|
||||
success_count INTEGER DEFAULT 0,
|
||||
avg_response_time_ms INTEGER,
|
||||
last_failure_at TIMESTAMPTZ,
|
||||
last_error TEXT
|
||||
);
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
|
||||
|
||||
### User-Agent Generation
|
||||
|
||||
See `workflow-12102025.md` for full specification.
|
||||
|
||||
**Summary:**
|
||||
- Uses `intoli/user-agents` library (daily-updated market share data)
|
||||
- Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
|
||||
- Browser whitelist: Chrome, Safari, Edge, Firefox only
|
||||
- UA sticks until IP rotates (403 or manual rotation)
|
||||
- Failure = alert admin + stop crawl (no fallback)
|
||||
|
||||
Each fingerprint includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
- **GraphQL errors:** Logged, task marked failed, retried later
|
||||
- **Normalization errors:** Logged as warnings, continue with valid products
|
||||
- **Image download errors:** Non-fatal, logged, continue
|
||||
- **Database errors:** Task fails, will be retried
|
||||
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/tasks/handlers/product-resync.ts` | Main crawl handler |
|
||||
| `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution |
|
||||
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
|
||||
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
|
||||
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
|
||||
| `src/utils/image-storage.ts` | Image download and local storage |
|
||||
| `src/routes/image-proxy.ts` | On-demand image resizing |
|
||||
| `migrations/075_consecutive_misses.sql` | OOS tracking column |
|
||||
297
backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md
Normal file
297
backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md
Normal file
@@ -0,0 +1,297 @@
|
||||
# Organic Browser-Based Scraping Guide
|
||||
|
||||
**Last Updated:** 2025-12-12
|
||||
**Status:** Production-ready proof of concept
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
|
||||
|
||||
---
|
||||
|
||||
## Why Organic Scraping?
|
||||
|
||||
| Approach | Detection Risk | Speed | Complexity |
|
||||
|----------|---------------|-------|------------|
|
||||
| Direct curl | Higher | Fast | Low |
|
||||
| curl-impersonate | Medium | Fast | Medium |
|
||||
| **Browser-based (organic)** | **Lowest** | Slower | Higher |
|
||||
|
||||
Direct curl requests can be fingerprinted via:
|
||||
- TLS fingerprint (cipher suites, extensions)
|
||||
- Header order and values
|
||||
- Missing cookies/session data
|
||||
- Request patterns
|
||||
|
||||
Browser-based requests inherit:
|
||||
- Real Chrome TLS fingerprint
|
||||
- Session cookies from page visit
|
||||
- Natural header order
|
||||
- JavaScript execution environment
|
||||
|
||||
---
|
||||
|
||||
## Implementation
|
||||
|
||||
### Dependencies
|
||||
|
||||
```bash
|
||||
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
|
||||
```
|
||||
|
||||
### Core Script: `test-intercept.js`
|
||||
|
||||
Located at: `backend/test-intercept.js`
|
||||
|
||||
```javascript
|
||||
const puppeteer = require('puppeteer-extra');
|
||||
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
|
||||
const fs = require('fs');
|
||||
|
||||
puppeteer.use(StealthPlugin());
|
||||
|
||||
async function capturePayload(config) {
|
||||
const { dispensaryId, platformId, cName, outputPath } = config;
|
||||
|
||||
const browser = await puppeteer.launch({
|
||||
headless: 'new',
|
||||
args: ['--no-sandbox', '--disable-setuid-sandbox']
|
||||
});
|
||||
|
||||
const page = await browser.newPage();
|
||||
|
||||
// STEP 1: Establish session by visiting the menu
|
||||
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
|
||||
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
|
||||
|
||||
// STEP 2: Fetch ALL products using GraphQL from browser context
|
||||
const result = await page.evaluate(async (platformId) => {
|
||||
const allProducts = [];
|
||||
let pageNum = 0;
|
||||
const perPage = 100;
|
||||
let totalCount = 0;
|
||||
const sessionId = 'browser-session-' + Date.now();
|
||||
|
||||
while (pageNum < 30) {
|
||||
const variables = {
|
||||
includeEnterpriseSpecials: false,
|
||||
productsFilter: {
|
||||
dispensaryId: platformId,
|
||||
pricingType: 'rec',
|
||||
Status: 'Active', // CRITICAL: Must be 'Active', not null
|
||||
types: [],
|
||||
useCache: true,
|
||||
isDefaultSort: true,
|
||||
sortBy: 'popularSortIdx',
|
||||
sortDirection: 1,
|
||||
bypassOnlineThresholds: true,
|
||||
isKioskMenu: false,
|
||||
removeProductsBelowOptionThresholds: false,
|
||||
},
|
||||
page: pageNum,
|
||||
perPage: perPage,
|
||||
};
|
||||
|
||||
const extensions = {
|
||||
persistedQuery: {
|
||||
version: 1,
|
||||
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
|
||||
}
|
||||
};
|
||||
|
||||
const qs = new URLSearchParams({
|
||||
operationName: 'FilteredProducts',
|
||||
variables: JSON.stringify(variables),
|
||||
extensions: JSON.stringify(extensions)
|
||||
});
|
||||
|
||||
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
|
||||
method: 'GET',
|
||||
headers: {
|
||||
'Accept': 'application/json',
|
||||
'content-type': 'application/json',
|
||||
'x-dutchie-session': sessionId,
|
||||
'apollographql-client-name': 'Marketplace (production)',
|
||||
},
|
||||
credentials: 'include'
|
||||
});
|
||||
|
||||
const json = await response.json();
|
||||
const data = json?.data?.filteredProducts;
|
||||
if (!data?.products) break;
|
||||
|
||||
allProducts.push(...data.products);
|
||||
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
|
||||
if (allProducts.length >= totalCount) break;
|
||||
|
||||
pageNum++;
|
||||
await new Promise(r => setTimeout(r, 200)); // Polite delay
|
||||
}
|
||||
|
||||
return { products: allProducts, totalCount };
|
||||
}, platformId);
|
||||
|
||||
await browser.close();
|
||||
|
||||
// STEP 3: Save payload
|
||||
const payload = {
|
||||
dispensaryId,
|
||||
platformId,
|
||||
cName,
|
||||
fetchedAt: new Date().toISOString(),
|
||||
productCount: result.products.length,
|
||||
products: result.products,
|
||||
};
|
||||
|
||||
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
|
||||
return payload;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Critical Parameters
|
||||
|
||||
### GraphQL Hash (FilteredProducts)
|
||||
|
||||
```
|
||||
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
|
||||
```
|
||||
|
||||
**WARNING:** Using the wrong hash returns HTTP 400.
|
||||
|
||||
### Status Parameter
|
||||
|
||||
| Value | Result |
|
||||
|-------|--------|
|
||||
| `'Active'` | Returns in-stock products (1019 in test) |
|
||||
| `null` | Returns 0 products |
|
||||
| `'All'` | Returns HTTP 400 |
|
||||
|
||||
**ALWAYS use `Status: 'Active'`**
|
||||
|
||||
### Required Headers
|
||||
|
||||
```javascript
|
||||
{
|
||||
'Accept': 'application/json',
|
||||
'content-type': 'application/json',
|
||||
'x-dutchie-session': 'unique-session-id',
|
||||
'apollographql-client-name': 'Marketplace (production)',
|
||||
}
|
||||
```
|
||||
|
||||
### Endpoint
|
||||
|
||||
```
|
||||
https://dutchie.com/api-3/graphql
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
Test store: AZ-Deeply-Rooted (1019 products)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Total products | 1019 |
|
||||
| Time | 18.5 seconds |
|
||||
| Payload size | 11.8 MB |
|
||||
| Pages fetched | 11 (100 per page) |
|
||||
| Success rate | 100% |
|
||||
|
||||
---
|
||||
|
||||
## Payload Format
|
||||
|
||||
The output matches the existing `payload-fetch.ts` handler format:
|
||||
|
||||
```json
|
||||
{
|
||||
"dispensaryId": 123,
|
||||
"platformId": "6405ef617056e8014d79101b",
|
||||
"cName": "AZ-Deeply-Rooted",
|
||||
"fetchedAt": "2025-12-12T05:05:19.837Z",
|
||||
"productCount": 1019,
|
||||
"products": [
|
||||
{
|
||||
"id": "6927508db4851262f629a869",
|
||||
"Name": "Product Name",
|
||||
"brand": { "name": "Brand Name", ... },
|
||||
"type": "Flower",
|
||||
"THC": "25%",
|
||||
"Prices": [...],
|
||||
"Options": [...],
|
||||
...
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Points
|
||||
|
||||
### As a Task Handler
|
||||
|
||||
The organic approach can be integrated as an alternative to curl-based fetching:
|
||||
|
||||
```typescript
|
||||
// In src/tasks/handlers/organic-payload-fetch.ts
|
||||
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
|
||||
// Use puppeteer-based capture
|
||||
// Save to same payload storage
|
||||
// Queue product_refresh task
|
||||
}
|
||||
```
|
||||
|
||||
### Worker Configuration
|
||||
|
||||
Add to job_schedules:
|
||||
```sql
|
||||
INSERT INTO job_schedules (name, role, cron_expression)
|
||||
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### HTTP 400 Bad Request
|
||||
- Check hash is correct: `ee29c060...`
|
||||
- Verify Status is `'Active'` (string, not null)
|
||||
|
||||
### 0 Products Returned
|
||||
- Status was likely `null` or `'All'` - use `'Active'`
|
||||
- Check platformId is valid MongoDB ObjectId
|
||||
|
||||
### Session Not Established
|
||||
- Increase timeout on initial page.goto()
|
||||
- Check cName is valid (matches embedded-menu URL)
|
||||
|
||||
### Detection/Blocking
|
||||
- StealthPlugin should handle most cases
|
||||
- Add random delays between pages
|
||||
- Use headless: 'new' (not true/false)
|
||||
|
||||
---
|
||||
|
||||
## Files Reference
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `backend/test-intercept.js` | Proof of concept script |
|
||||
| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation |
|
||||
| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler |
|
||||
| `backend/src/utils/payload-storage.ts` | Payload save/load utilities |
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation
|
||||
- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture
|
||||
- `CLAUDE.md` - Project rules and constraints
|
||||
25
backend/docs/_archive/README.md
Normal file
25
backend/docs/_archive/README.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# ARCHIVED DOCUMENTATION
|
||||
|
||||
**WARNING: These docs may be outdated or inaccurate.**
|
||||
|
||||
The code has evolved significantly. These docs are kept for historical reference only.
|
||||
|
||||
## What to Use Instead
|
||||
|
||||
**The single source of truth is:**
|
||||
- `CLAUDE.md` (root) - Essential rules and quick reference
|
||||
- `docs/CODEBASE_MAP.md` - Current file/directory reference
|
||||
|
||||
## Why Archive?
|
||||
|
||||
These docs were written during development iterations and may reference:
|
||||
- Old file paths that no longer exist
|
||||
- Deprecated approaches (hydration, scraper-v2)
|
||||
- APIs that have changed
|
||||
- Database schemas that evolved
|
||||
|
||||
## If You Need Details
|
||||
|
||||
1. First check CODEBASE_MAP.md for current file locations
|
||||
2. Then read the actual source code
|
||||
3. Only use archive docs as a last resort for historical context
|
||||
584
backend/docs/_archive/TASK_WORKFLOW_2024-12-10.md
Normal file
584
backend/docs/_archive/TASK_WORKFLOW_2024-12-10.md
Normal file
@@ -0,0 +1,584 @@
|
||||
# Task Workflow Documentation
|
||||
**Date: 2024-12-10**
|
||||
|
||||
This document describes the complete task/job processing architecture after the 2024-12-10 rewrite.
|
||||
|
||||
---
|
||||
|
||||
## Complete Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────────┐
|
||||
│ KUBERNETES CLUSTER │
|
||||
├─────────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ API SERVER POD (scraper) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────────────┐ ┌────────────────────────────────────────┐ │ │
|
||||
│ │ │ Express API │ │ TaskScheduler │ │ │
|
||||
│ │ │ │ │ (src/services/task-scheduler.ts) │ │ │
|
||||
│ │ │ /api/job-queue │ │ │ │ │
|
||||
│ │ │ /api/tasks │ │ • Polls every 60s │ │ │
|
||||
│ │ │ /api/schedules │ │ • Checks task_schedules table │ │ │
|
||||
│ │ └────────┬─────────┘ │ • SELECT FOR UPDATE SKIP LOCKED │ │ │
|
||||
│ │ │ │ • Generates tasks when due │ │ │
|
||||
│ │ │ └──────────────────┬─────────────────────┘ │ │
|
||||
│ │ │ │ │ │
|
||||
│ └────────────┼──────────────────────────────────┼──────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ┌────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ POSTGRESQL DATABASE │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
|
||||
│ │ │ task_schedules │ │ worker_tasks │ │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ │ • product_refresh │───────►│ • pending tasks │ │ │
|
||||
│ │ │ • store_discovery │ create │ • claimed tasks │ │ │
|
||||
│ │ │ • analytics_refresh │ tasks │ • running tasks │ │ │
|
||||
│ │ │ │ │ • completed tasks │ │ │
|
||||
│ │ │ next_run_at │ │ │ │ │
|
||||
│ │ │ last_run_at │ │ role, dispensary_id │ │ │
|
||||
│ │ │ interval_hours │ │ priority, status │ │ │
|
||||
│ │ └─────────────────────┘ └──────────┬──────────┘ │ │
|
||||
│ │ │ │ │
|
||||
│ └─────────────────────────────────────────────┼────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────┘ │
|
||||
│ │ Workers poll for tasks │
|
||||
│ │ (SELECT FOR UPDATE SKIP LOCKED) │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ WORKER PODS (StatefulSet: scraper-worker) │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
|
||||
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
|
||||
│ │ │ │ │ │ │ │ │ │ │ │
|
||||
│ │ │ task-worker │ │ task-worker │ │ task-worker │ │ task-worker │ │ │
|
||||
│ │ │ .ts │ │ .ts │ │ .ts │ │ .ts │ │ │
|
||||
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ └──────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Startup Sequence
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ API SERVER STARTUP │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. Express app initializes │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 2. runAutoMigrations() │
|
||||
│ • Runs pending migrations (including 079_task_schedules.sql) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 3. initializeMinio() / initializeImageStorage() │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 4. cleanupOrphanedJobs() │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 5. taskScheduler.start() ◄─── NEW (per TASK_WORKFLOW_2024-12-10.md) │
|
||||
│ │ │
|
||||
│ ├── Recover stale tasks (workers that died) │
|
||||
│ ├── Ensure default schedules exist in task_schedules │
|
||||
│ ├── Check and run any due schedules immediately │
|
||||
│ └── Start 60-second poll interval │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 6. app.listen(PORT) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ WORKER POD STARTUP │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. K8s starts pod from StatefulSet │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 2. TaskWorker.constructor() │
|
||||
│ • Create DB pool │
|
||||
│ • Create CrawlRotator │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 3. initializeStealth() │
|
||||
│ • Load proxies from DB (REQUIRED - fails if none) │
|
||||
│ • Wire rotator to Dutchie client │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 4. register() with API │
|
||||
│ • Optional - continues if fails │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 5. startRegistryHeartbeat() every 30s │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 6. processNextTask() loop │
|
||||
│ │ │
|
||||
│ ├── Poll for pending task (FOR UPDATE SKIP LOCKED) │
|
||||
│ ├── Claim task atomically │
|
||||
│ ├── Execute handler (product_refresh, store_discovery, etc.) │
|
||||
│ ├── Mark complete/failed │
|
||||
│ ├── Chain next task if applicable │
|
||||
│ └── Loop │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Schedule Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ SCHEDULER POLL (every 60 seconds) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ BEGIN TRANSACTION │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ SELECT * FROM task_schedules │
|
||||
│ WHERE enabled = true AND next_run_at <= NOW() │
|
||||
│ FOR UPDATE SKIP LOCKED ◄─── Prevents duplicate execution across replicas │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ For each due schedule: │
|
||||
│ │ │
|
||||
│ ├── product_refresh_all │
|
||||
│ │ └─► Query dispensaries needing crawl │
|
||||
│ │ └─► Create product_refresh tasks in worker_tasks │
|
||||
│ │ │
|
||||
│ ├── store_discovery_dutchie │
|
||||
│ │ └─► Create single store_discovery task │
|
||||
│ │ │
|
||||
│ └── analytics_refresh │
|
||||
│ └─► Create single analytics_refresh task │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ UPDATE task_schedules SET │
|
||||
│ last_run_at = NOW(), │
|
||||
│ next_run_at = NOW() + interval_hours │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ COMMIT │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task Lifecycle
|
||||
|
||||
```
|
||||
┌──────────┐
|
||||
│ SCHEDULE │
|
||||
│ DUE │
|
||||
└────┬─────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐ claim ┌──────────────┐ start ┌──────────────┐
|
||||
│ PENDING │────────────►│ CLAIMED │────────────►│ RUNNING │
|
||||
└──────────────┘ └──────────────┘ └──────┬───────┘
|
||||
▲ │
|
||||
│ ┌──────────────┼──────────────┐
|
||||
│ retry │ │ │
|
||||
│ (if retries < max) ▼ ▼ ▼
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
└──────────────────────────────────│ FAILED │ │ COMPLETED│ │ STALE │
|
||||
└──────────┘ └──────────┘ └────┬─────┘
|
||||
│
|
||||
recover_stale_tasks()
|
||||
│
|
||||
▼
|
||||
┌──────────┐
|
||||
│ PENDING │
|
||||
└──────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Tables
|
||||
|
||||
### task_schedules (NEW - migration 079)
|
||||
|
||||
Stores schedule definitions. Survives restarts.
|
||||
|
||||
```sql
|
||||
CREATE TABLE task_schedules (
|
||||
id SERIAL PRIMARY KEY,
|
||||
name VARCHAR(100) NOT NULL UNIQUE,
|
||||
role VARCHAR(50) NOT NULL, -- product_refresh, store_discovery, etc.
|
||||
enabled BOOLEAN DEFAULT TRUE,
|
||||
interval_hours INTEGER NOT NULL, -- How often to run
|
||||
priority INTEGER DEFAULT 0, -- Task priority when created
|
||||
state_code VARCHAR(2), -- Optional filter
|
||||
last_run_at TIMESTAMPTZ, -- When it last ran
|
||||
next_run_at TIMESTAMPTZ, -- When it's due next
|
||||
last_task_count INTEGER, -- Tasks created last run
|
||||
last_error TEXT -- Error message if failed
|
||||
);
|
||||
```
|
||||
|
||||
### worker_tasks (migration 074)
|
||||
|
||||
The task queue. Workers pull from here.
|
||||
|
||||
```sql
|
||||
CREATE TABLE worker_tasks (
|
||||
id SERIAL PRIMARY KEY,
|
||||
role task_role NOT NULL, -- What type of work
|
||||
dispensary_id INTEGER, -- Which store (if applicable)
|
||||
platform VARCHAR(50), -- Which platform
|
||||
status task_status DEFAULT 'pending',
|
||||
priority INTEGER DEFAULT 0, -- Higher = process first
|
||||
scheduled_for TIMESTAMP, -- Don't process before this time
|
||||
worker_id VARCHAR(100), -- Which worker claimed it
|
||||
claimed_at TIMESTAMP,
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
last_heartbeat_at TIMESTAMP, -- For stale detection
|
||||
result JSONB,
|
||||
error_message TEXT,
|
||||
retry_count INTEGER DEFAULT 0,
|
||||
max_retries INTEGER DEFAULT 3
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Default Schedules
|
||||
|
||||
| Name | Role | Interval | Priority | Description |
|
||||
|------|------|----------|----------|-------------|
|
||||
| `payload_fetch_all` | payload_fetch | 4 hours | 0 | Fetch payloads from Dutchie API (chains to product_refresh) |
|
||||
| `store_discovery_dutchie` | store_discovery | 24 hours | 5 | Find new Dutchie stores |
|
||||
| `analytics_refresh` | analytics_refresh | 6 hours | 0 | Refresh MVs |
|
||||
|
||||
---
|
||||
|
||||
## Task Roles
|
||||
|
||||
| Role | Description | Creates Tasks For |
|
||||
|------|-------------|-------------------|
|
||||
| `payload_fetch` | **NEW** - Fetch from Dutchie API, save to disk | Each dispensary needing crawl |
|
||||
| `product_refresh` | **CHANGED** - Read local payload, normalize, upsert to DB | Chained from payload_fetch |
|
||||
| `store_discovery` | Find new dispensaries, returns newStoreIds[] | Single task per platform |
|
||||
| `entry_point_discovery` | **DEPRECATED** - Resolve platform IDs | No longer used |
|
||||
| `product_discovery` | Initial product fetch for new stores | Chained from store_discovery |
|
||||
| `analytics_refresh` | Refresh MVs | Single global task |
|
||||
|
||||
### Payload/Refresh Separation (2024-12-10)
|
||||
|
||||
The crawl workflow is now split into two phases:
|
||||
|
||||
```
|
||||
payload_fetch (scheduled every 4h)
|
||||
└─► Hit Dutchie GraphQL API
|
||||
└─► Save raw JSON to /storage/payloads/{year}/{month}/{day}/store_{id}_{ts}.json.gz
|
||||
└─► Record metadata in raw_crawl_payloads table
|
||||
└─► Queue product_refresh task with payload_id
|
||||
|
||||
product_refresh (chained from payload_fetch)
|
||||
└─► Load payload from filesystem (NOT from API)
|
||||
└─► Normalize via DutchieNormalizer
|
||||
└─► Upsert to store_products
|
||||
└─► Create snapshots
|
||||
└─► Track missing products
|
||||
└─► Download images
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- **Retry-friendly**: If normalize fails, re-run product_refresh without re-crawling
|
||||
- **Replay-able**: Run product_refresh against any historical payload
|
||||
- **Faster refreshes**: Local file read vs network call
|
||||
- **Historical diffs**: Compare payloads to see what changed between crawls
|
||||
- **Less API pressure**: Only payload_fetch hits Dutchie
|
||||
|
||||
---
|
||||
|
||||
## Task Chaining
|
||||
|
||||
Tasks automatically queue follow-up tasks upon successful completion. This creates two main flows:
|
||||
|
||||
### Discovery Flow (New Stores)
|
||||
|
||||
When `store_discovery` finds new dispensaries, they automatically get their initial product data:
|
||||
|
||||
```
|
||||
store_discovery
|
||||
└─► Discovers new locations via Dutchie GraphQL
|
||||
└─► Auto-promotes valid locations to dispensaries table
|
||||
└─► Collects newDispensaryIds[] from promotions
|
||||
└─► Returns { newStoreIds: [...] } in result
|
||||
|
||||
chainNextTask() detects newStoreIds
|
||||
└─► Creates product_discovery task for each new store
|
||||
|
||||
product_discovery
|
||||
└─► Calls handlePayloadFetch() internally
|
||||
└─► payload_fetch hits Dutchie API
|
||||
└─► Saves raw JSON to /storage/payloads/
|
||||
└─► Queues product_refresh task with payload_id
|
||||
|
||||
product_refresh
|
||||
└─► Loads payload from filesystem
|
||||
└─► Normalizes and upserts to store_products
|
||||
└─► Creates snapshots, downloads images
|
||||
```
|
||||
|
||||
**Complete Discovery Chain:**
|
||||
```
|
||||
store_discovery → product_discovery → payload_fetch → product_refresh
|
||||
(internal call) (queues next)
|
||||
```
|
||||
|
||||
### Scheduled Flow (Existing Stores)
|
||||
|
||||
For existing stores, `payload_fetch_all` schedule runs every 4 hours:
|
||||
|
||||
```
|
||||
TaskScheduler (every 60s)
|
||||
└─► Checks task_schedules for due schedules
|
||||
└─► payload_fetch_all is due
|
||||
└─► Generates payload_fetch task for each dispensary
|
||||
|
||||
payload_fetch
|
||||
└─► Hits Dutchie GraphQL API
|
||||
└─► Saves raw JSON to /storage/payloads/
|
||||
└─► Queues product_refresh task with payload_id
|
||||
|
||||
product_refresh
|
||||
└─► Loads payload from filesystem (NOT API)
|
||||
└─► Normalizes via DutchieNormalizer
|
||||
└─► Upserts to store_products
|
||||
└─► Creates snapshots
|
||||
```
|
||||
|
||||
**Complete Scheduled Chain:**
|
||||
```
|
||||
payload_fetch → product_refresh
|
||||
(queues) (reads local)
|
||||
```
|
||||
|
||||
### Chaining Implementation
|
||||
|
||||
Task chaining is handled in two places:
|
||||
|
||||
1. **Internal chaining (handler calls handler):**
|
||||
- `product_discovery` calls `handlePayloadFetch()` directly
|
||||
|
||||
2. **External chaining (chainNextTask() in task-service.ts):**
|
||||
- Called after task completion
|
||||
- `store_discovery` → queues `product_discovery` for each newStoreId
|
||||
|
||||
3. **Queue-based chaining (taskService.createTask):**
|
||||
- `payload_fetch` queues `product_refresh` with `payload: { payload_id }`
|
||||
|
||||
---
|
||||
|
||||
## Payload API Endpoints
|
||||
|
||||
Raw crawl payloads can be accessed via the Payloads API:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `GET /api/payloads` | GET | List payload metadata (paginated) |
|
||||
| `GET /api/payloads/:id` | GET | Get payload metadata by ID |
|
||||
| `GET /api/payloads/:id/data` | GET | Get full payload JSON (decompressed) |
|
||||
| `GET /api/payloads/store/:dispensaryId` | GET | List payloads for a store |
|
||||
| `GET /api/payloads/store/:dispensaryId/latest` | GET | Get latest payload for a store |
|
||||
| `GET /api/payloads/store/:dispensaryId/diff` | GET | Diff two payloads for changes |
|
||||
|
||||
### Payload Diff Response
|
||||
|
||||
The diff endpoint returns:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"from": { "id": 123, "fetchedAt": "...", "productCount": 100 },
|
||||
"to": { "id": 456, "fetchedAt": "...", "productCount": 105 },
|
||||
"diff": {
|
||||
"added": 10,
|
||||
"removed": 5,
|
||||
"priceChanges": 8,
|
||||
"stockChanges": 12
|
||||
},
|
||||
"details": {
|
||||
"added": [...],
|
||||
"removed": [...],
|
||||
"priceChanges": [...],
|
||||
"stockChanges": [...]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Schedules (NEW)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `GET /api/schedules` | GET | List all schedules |
|
||||
| `PUT /api/schedules/:id` | PUT | Update schedule |
|
||||
| `POST /api/schedules/:id/trigger` | POST | Run schedule immediately |
|
||||
|
||||
### Task Creation (rewired 2024-12-10)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `POST /api/job-queue/enqueue` | POST | Create single task |
|
||||
| `POST /api/job-queue/enqueue-batch` | POST | Create batch tasks |
|
||||
| `POST /api/job-queue/enqueue-state` | POST | Create tasks for state |
|
||||
| `POST /api/tasks` | POST | Direct task creation |
|
||||
|
||||
### Task Management
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `GET /api/tasks` | GET | List tasks |
|
||||
| `GET /api/tasks/:id` | GET | Get single task |
|
||||
| `GET /api/tasks/counts` | GET | Task counts by status |
|
||||
| `POST /api/tasks/recover-stale` | POST | Recover stale tasks |
|
||||
|
||||
---
|
||||
|
||||
## Key Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/services/task-scheduler.ts` | **NEW** - DB-driven scheduler |
|
||||
| `src/tasks/task-worker.ts` | Worker that processes tasks |
|
||||
| `src/tasks/task-service.ts` | Task CRUD operations |
|
||||
| `src/tasks/handlers/payload-fetch.ts` | **NEW** - Fetches from API, saves to disk |
|
||||
| `src/tasks/handlers/product-refresh.ts` | **CHANGED** - Reads from disk, processes to DB |
|
||||
| `src/utils/payload-storage.ts` | **NEW** - Payload save/load utilities |
|
||||
| `src/routes/tasks.ts` | Task API endpoints |
|
||||
| `src/routes/job-queue.ts` | Job Queue UI endpoints (rewired) |
|
||||
| `migrations/079_task_schedules.sql` | Schedule table |
|
||||
| `migrations/080_raw_crawl_payloads.sql` | Payload metadata table |
|
||||
| `migrations/081_payload_fetch_columns.sql` | payload, last_fetch_at columns |
|
||||
| `migrations/074_worker_task_queue.sql` | Task queue table |
|
||||
|
||||
---
|
||||
|
||||
## Legacy Code (DEPRECATED)
|
||||
|
||||
| File | Status | Replacement |
|
||||
|------|--------|-------------|
|
||||
| `src/services/scheduler.ts` | DEPRECATED | `task-scheduler.ts` |
|
||||
| `dispensary_crawl_jobs` table | ORPHANED | `worker_tasks` |
|
||||
| `job_schedules` table | LEGACY | `task_schedules` |
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Integration
|
||||
|
||||
Both pages remain wired to the dashboard:
|
||||
|
||||
| Page | Data Source | Actions |
|
||||
|------|-------------|---------|
|
||||
| **Job Queue** | `worker_tasks`, `task_schedules` | Create tasks, view schedules |
|
||||
| **Task Queue** | `worker_tasks` | View tasks, recover stale |
|
||||
|
||||
---
|
||||
|
||||
## Multi-Replica Safety
|
||||
|
||||
The scheduler uses `SELECT FOR UPDATE SKIP LOCKED` to ensure:
|
||||
|
||||
1. **Only one replica** executes a schedule at a time
|
||||
2. **No duplicate tasks** created
|
||||
3. **Survives pod restarts** - state in DB, not memory
|
||||
4. **Self-healing** - recovers stale tasks on startup
|
||||
|
||||
```sql
|
||||
-- This query is atomic across all API server replicas
|
||||
SELECT * FROM task_schedules
|
||||
WHERE enabled = true AND next_run_at <= NOW()
|
||||
FOR UPDATE SKIP LOCKED
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Worker Scaling (K8s)
|
||||
|
||||
Workers run as a StatefulSet in Kubernetes. You can scale from the admin UI or CLI.
|
||||
|
||||
### From Admin UI
|
||||
|
||||
The Workers page (`/admin/workers`) provides:
|
||||
- Current replica count display
|
||||
- Scale up/down buttons
|
||||
- Target replica input
|
||||
|
||||
### API Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `GET /api/workers/k8s/replicas` | GET | Get current/desired replica counts |
|
||||
| `POST /api/workers/k8s/scale` | POST | Scale to N replicas (body: `{ replicas: N }`) |
|
||||
|
||||
### From CLI
|
||||
|
||||
```bash
|
||||
# View current replicas
|
||||
kubectl get statefulset scraper-worker -n dispensary-scraper
|
||||
|
||||
# Scale to 10 workers
|
||||
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=10
|
||||
|
||||
# Scale down to 3 workers
|
||||
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=3
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Environment variables for the API server:
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `K8S_NAMESPACE` | `dispensary-scraper` | Kubernetes namespace |
|
||||
| `K8S_WORKER_STATEFULSET` | `scraper-worker` | StatefulSet name |
|
||||
|
||||
### RBAC Requirements
|
||||
|
||||
The API server pod needs these K8s permissions:
|
||||
|
||||
```yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: Role
|
||||
metadata:
|
||||
name: worker-scaler
|
||||
namespace: dispensary-scraper
|
||||
rules:
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["statefulsets"]
|
||||
verbs: ["get", "patch"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: RoleBinding
|
||||
metadata:
|
||||
name: scraper-worker-scaler
|
||||
namespace: dispensary-scraper
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: default
|
||||
namespace: dispensary-scraper
|
||||
roleRef:
|
||||
kind: Role
|
||||
name: worker-scaler
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
```
|
||||
542
backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md
Normal file
542
backend/docs/_archive/WORKER_TASK_ARCHITECTURE.md
Normal file
@@ -0,0 +1,542 @@
|
||||
# Worker Task Architecture
|
||||
|
||||
This document describes the unified task-based worker system that replaces the legacy fragmented job systems.
|
||||
|
||||
## Overview
|
||||
|
||||
The task worker architecture provides a single, unified system for managing all background work in CannaiQ:
|
||||
|
||||
- **Store discovery** - Find new dispensaries on platforms
|
||||
- **Entry point discovery** - Resolve platform IDs from menu URLs
|
||||
- **Product discovery** - Initial product fetch for new stores
|
||||
- **Product resync** - Regular price/stock updates for existing stores
|
||||
- **Analytics refresh** - Refresh materialized views and analytics
|
||||
|
||||
## Architecture
|
||||
|
||||
### Database Tables
|
||||
|
||||
**`worker_tasks`** - Central task queue
|
||||
```sql
|
||||
CREATE TABLE worker_tasks (
|
||||
id SERIAL PRIMARY KEY,
|
||||
role task_role NOT NULL, -- What type of work
|
||||
dispensary_id INTEGER, -- Which store (if applicable)
|
||||
platform VARCHAR(50), -- Which platform (dutchie, etc.)
|
||||
status task_status DEFAULT 'pending',
|
||||
priority INTEGER DEFAULT 0, -- Higher = process first
|
||||
scheduled_for TIMESTAMP, -- Don't process before this time
|
||||
worker_id VARCHAR(100), -- Which worker claimed it
|
||||
claimed_at TIMESTAMP,
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
last_heartbeat_at TIMESTAMP, -- For stale detection
|
||||
result JSONB, -- Output from handler
|
||||
error_message TEXT,
|
||||
retry_count INTEGER DEFAULT 0,
|
||||
max_retries INTEGER DEFAULT 3,
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
updated_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
**Key indexes:**
|
||||
- `idx_worker_tasks_pending_priority` - For efficient task claiming
|
||||
- `idx_worker_tasks_active_dispensary` - Prevents concurrent tasks per store (partial unique index)
|
||||
|
||||
### Task Roles
|
||||
|
||||
| Role | Purpose | Per-Store | Scheduled |
|
||||
|------|---------|-----------|-----------|
|
||||
| `store_discovery` | Find new stores on a platform | No | Daily |
|
||||
| `entry_point_discovery` | Resolve platform IDs | Yes | On-demand |
|
||||
| `product_discovery` | Initial product fetch | Yes | After entry_point |
|
||||
| `product_resync` | Price/stock updates | Yes | Every 4 hours |
|
||||
| `analytics_refresh` | Refresh MVs | No | Daily |
|
||||
|
||||
### Task Lifecycle
|
||||
|
||||
```
|
||||
pending → claimed → running → completed
|
||||
↓
|
||||
failed
|
||||
```
|
||||
|
||||
1. **pending** - Task is waiting to be picked up
|
||||
2. **claimed** - Worker has claimed it (atomic via SELECT FOR UPDATE SKIP LOCKED)
|
||||
3. **running** - Worker is actively processing
|
||||
4. **completed** - Task finished successfully
|
||||
5. **failed** - Task encountered an error
|
||||
6. **stale** - Task lost its worker (recovered automatically)
|
||||
|
||||
## Files
|
||||
|
||||
### Core Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/tasks/task-service.ts` | TaskService - CRUD, claiming, capacity metrics |
|
||||
| `src/tasks/task-worker.ts` | TaskWorker - Main worker loop |
|
||||
| `src/tasks/index.ts` | Module exports |
|
||||
| `src/routes/tasks.ts` | API endpoints |
|
||||
| `migrations/074_worker_task_queue.sql` | Database schema |
|
||||
|
||||
### Task Handlers
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
|
||||
| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
|
||||
| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
|
||||
| `src/tasks/handlers/product-resync.ts` | `product_resync` |
|
||||
| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
|
||||
|
||||
## Running Workers
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `WORKER_ROLE` | (required) | Which task role to process |
|
||||
| `WORKER_ID` | auto-generated | Custom worker identifier |
|
||||
| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
|
||||
| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
|
||||
|
||||
### Starting a Worker
|
||||
|
||||
```bash
|
||||
# Start a product resync worker
|
||||
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
|
||||
|
||||
# Start with custom ID
|
||||
WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
|
||||
|
||||
# Start multiple workers for different roles
|
||||
WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts &
|
||||
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts &
|
||||
```
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: task-worker-resync
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: worker
|
||||
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||||
command: ["npx", "tsx", "src/tasks/task-worker.ts"]
|
||||
env:
|
||||
- name: WORKER_ROLE
|
||||
value: "product_resync"
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Task Management
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/tasks` | GET | List tasks with filters |
|
||||
| `/api/tasks` | POST | Create a new task |
|
||||
| `/api/tasks/:id` | GET | Get task by ID |
|
||||
| `/api/tasks/counts` | GET | Get counts by status |
|
||||
| `/api/tasks/capacity` | GET | Get capacity metrics |
|
||||
| `/api/tasks/capacity/:role` | GET | Get role-specific capacity |
|
||||
| `/api/tasks/recover-stale` | POST | Recover tasks from dead workers |
|
||||
|
||||
### Task Generation
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/tasks/generate/resync` | POST | Generate daily resync tasks |
|
||||
| `/api/tasks/generate/discovery` | POST | Create store discovery task |
|
||||
|
||||
### Migration (from legacy systems)
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/tasks/migration/status` | GET | Compare old vs new systems |
|
||||
| `/api/tasks/migration/disable-old-schedules` | POST | Disable job_schedules |
|
||||
| `/api/tasks/migration/cancel-pending-crawl-jobs` | POST | Cancel old crawl jobs |
|
||||
| `/api/tasks/migration/create-resync-tasks` | POST | Create tasks for all stores |
|
||||
| `/api/tasks/migration/full-migrate` | POST | One-click migration |
|
||||
|
||||
### Role-Specific Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/api/tasks/role/:role/last-completion` | GET | Last completion time |
|
||||
| `/api/tasks/role/:role/recent` | GET | Recent completions |
|
||||
| `/api/tasks/store/:id/active` | GET | Check if store has active task |
|
||||
|
||||
## Capacity Planning
|
||||
|
||||
The `v_worker_capacity` view provides real-time metrics:
|
||||
|
||||
```sql
|
||||
SELECT * FROM v_worker_capacity;
|
||||
```
|
||||
|
||||
Returns:
|
||||
- `pending_tasks` - Tasks waiting to be claimed
|
||||
- `ready_tasks` - Tasks ready now (scheduled_for is null or past)
|
||||
- `claimed_tasks` - Tasks claimed but not started
|
||||
- `running_tasks` - Tasks actively processing
|
||||
- `completed_last_hour` - Recent completions
|
||||
- `failed_last_hour` - Recent failures
|
||||
- `active_workers` - Workers with recent heartbeats
|
||||
- `avg_duration_sec` - Average task duration
|
||||
- `tasks_per_worker_hour` - Throughput estimate
|
||||
- `estimated_hours_to_drain` - Time to clear queue
|
||||
|
||||
### Scaling Recommendations
|
||||
|
||||
```javascript
|
||||
// API: GET /api/tasks/capacity/:role
|
||||
{
|
||||
"role": "product_resync",
|
||||
"pending_tasks": 500,
|
||||
"active_workers": 3,
|
||||
"workers_needed": {
|
||||
"for_1_hour": 10,
|
||||
"for_4_hours": 3,
|
||||
"for_8_hours": 2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Task Chaining
|
||||
|
||||
Tasks can automatically create follow-up tasks:
|
||||
|
||||
```
|
||||
store_discovery → entry_point_discovery → product_discovery
|
||||
↓
|
||||
(store has platform_dispensary_id)
|
||||
↓
|
||||
Daily resync tasks
|
||||
```
|
||||
|
||||
The `chainNextTask()` method handles this automatically.
|
||||
|
||||
## Stale Task Recovery
|
||||
|
||||
Tasks are considered stale if `last_heartbeat_at` is older than the threshold (default 10 minutes).
|
||||
|
||||
```sql
|
||||
SELECT recover_stale_tasks(10); -- 10 minute threshold
|
||||
```
|
||||
|
||||
Or via API:
|
||||
```bash
|
||||
curl -X POST /api/tasks/recover-stale \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"threshold_minutes": 10}'
|
||||
```
|
||||
|
||||
## Migration from Legacy Systems
|
||||
|
||||
### Legacy Systems Replaced
|
||||
|
||||
1. **job_schedules + job_run_logs** - Scheduled job definitions
|
||||
2. **dispensary_crawl_jobs** - Per-dispensary crawl queue
|
||||
3. **SyncOrchestrator + HydrationWorker** - Raw payload processing
|
||||
|
||||
### Migration Steps
|
||||
|
||||
**Option 1: One-Click Migration**
|
||||
```bash
|
||||
curl -X POST /api/tasks/migration/full-migrate
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Disable all job_schedules
|
||||
2. Cancel pending dispensary_crawl_jobs
|
||||
3. Generate resync tasks for all stores
|
||||
4. Create discovery and analytics tasks
|
||||
|
||||
**Option 2: Manual Migration**
|
||||
```bash
|
||||
# 1. Check current status
|
||||
curl /api/tasks/migration/status
|
||||
|
||||
# 2. Disable old schedules
|
||||
curl -X POST /api/tasks/migration/disable-old-schedules
|
||||
|
||||
# 3. Cancel pending crawl jobs
|
||||
curl -X POST /api/tasks/migration/cancel-pending-crawl-jobs
|
||||
|
||||
# 4. Create resync tasks
|
||||
curl -X POST /api/tasks/migration/create-resync-tasks \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"state_code": "AZ"}'
|
||||
|
||||
# 5. Generate daily resync schedule
|
||||
curl -X POST /api/tasks/generate/resync \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"batches_per_day": 6}'
|
||||
```
|
||||
|
||||
## Per-Store Locking
|
||||
|
||||
The system prevents concurrent tasks for the same store using a partial unique index:
|
||||
|
||||
```sql
|
||||
CREATE UNIQUE INDEX idx_worker_tasks_active_dispensary
|
||||
ON worker_tasks (dispensary_id)
|
||||
WHERE dispensary_id IS NOT NULL
|
||||
AND status IN ('claimed', 'running');
|
||||
```
|
||||
|
||||
This ensures only one task can be active per store at any time.
|
||||
|
||||
## Task Priority
|
||||
|
||||
Tasks are claimed in priority order (higher first), then by creation time:
|
||||
|
||||
```sql
|
||||
ORDER BY priority DESC, created_at ASC
|
||||
```
|
||||
|
||||
Default priorities:
|
||||
- `store_discovery`: 0
|
||||
- `entry_point_discovery`: 10 (high - new stores)
|
||||
- `product_discovery`: 10 (high - new stores)
|
||||
- `product_resync`: 0
|
||||
- `analytics_refresh`: 0
|
||||
|
||||
## Scheduled Tasks
|
||||
|
||||
Tasks can be scheduled for future execution:
|
||||
|
||||
```javascript
|
||||
await taskService.createTask({
|
||||
role: 'product_resync',
|
||||
dispensary_id: 123,
|
||||
scheduled_for: new Date('2025-01-10T06:00:00Z'),
|
||||
});
|
||||
```
|
||||
|
||||
The `generate_resync_tasks()` function creates staggered tasks throughout the day:
|
||||
|
||||
```sql
|
||||
SELECT generate_resync_tasks(6, '2025-01-10'); -- 6 batches = every 4 hours
|
||||
```
|
||||
|
||||
## Dashboard Integration
|
||||
|
||||
The admin dashboard shows task queue status in the main overview:
|
||||
|
||||
```
|
||||
Task Queue Summary
|
||||
------------------
|
||||
Pending: 45
|
||||
Running: 3
|
||||
Completed: 1,234
|
||||
Failed: 12
|
||||
```
|
||||
|
||||
Full task management is available at `/admin/tasks`.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Failed tasks include the error message in `error_message` and can be retried:
|
||||
|
||||
```sql
|
||||
-- View failed tasks
|
||||
SELECT id, role, dispensary_id, error_message, retry_count
|
||||
FROM worker_tasks
|
||||
WHERE status = 'failed'
|
||||
ORDER BY completed_at DESC
|
||||
LIMIT 20;
|
||||
|
||||
-- Retry failed tasks
|
||||
UPDATE worker_tasks
|
||||
SET status = 'pending', retry_count = retry_count + 1
|
||||
WHERE status = 'failed' AND retry_count < max_retries;
|
||||
```
|
||||
|
||||
## Concurrent Task Processing (Added 2024-12)
|
||||
|
||||
Workers can now process multiple tasks concurrently within a single worker instance. This improves throughput by utilizing async I/O efficiently.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Pod (K8s) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ TaskWorker │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
|
||||
│ │ │ Task 1 │ │ Task 2 │ │ Task 3 │ (concurrent)│ │
|
||||
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ Resource Monitor │ │
|
||||
│ │ ├── Memory: 65% (threshold: 85%) │ │
|
||||
│ │ ├── CPU: 45% (threshold: 90%) │ │
|
||||
│ │ └── Status: Normal │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `MAX_CONCURRENT_TASKS` | 3 | Maximum tasks a worker will run concurrently |
|
||||
| `MEMORY_BACKOFF_THRESHOLD` | 0.85 | Back off when heap memory exceeds 85% |
|
||||
| `CPU_BACKOFF_THRESHOLD` | 0.90 | Back off when CPU exceeds 90% |
|
||||
| `BACKOFF_DURATION_MS` | 10000 | How long to wait when backing off (10s) |
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **Main Loop**: Worker continuously tries to fill up to `MAX_CONCURRENT_TASKS`
|
||||
2. **Resource Monitoring**: Before claiming a new task, worker checks memory and CPU
|
||||
3. **Backoff**: If resources exceed thresholds, worker pauses and stops claiming new tasks
|
||||
4. **Concurrent Execution**: Tasks run in parallel using `Promise` - they don't block each other
|
||||
5. **Graceful Shutdown**: On SIGTERM/decommission, worker stops claiming but waits for active tasks
|
||||
|
||||
### Resource Monitoring
|
||||
|
||||
```typescript
|
||||
// ResourceStats interface
|
||||
interface ResourceStats {
|
||||
memoryPercent: number; // Current heap usage as decimal (0.0-1.0)
|
||||
memoryMb: number; // Current heap used in MB
|
||||
memoryTotalMb: number; // Total heap available in MB
|
||||
cpuPercent: number; // CPU usage as percentage (0-100)
|
||||
isBackingOff: boolean; // True if worker is in backoff state
|
||||
backoffReason: string; // Why the worker is backing off
|
||||
}
|
||||
```
|
||||
|
||||
### Heartbeat Data
|
||||
|
||||
Workers report the following in their heartbeat:
|
||||
|
||||
```json
|
||||
{
|
||||
"worker_id": "worker-abc123",
|
||||
"current_task_id": 456,
|
||||
"current_task_ids": [456, 457, 458],
|
||||
"active_task_count": 3,
|
||||
"max_concurrent_tasks": 3,
|
||||
"status": "active",
|
||||
"resources": {
|
||||
"memory_mb": 256,
|
||||
"memory_total_mb": 512,
|
||||
"memory_rss_mb": 320,
|
||||
"memory_percent": 50,
|
||||
"cpu_user_ms": 12500,
|
||||
"cpu_system_ms": 3200,
|
||||
"cpu_percent": 45,
|
||||
"is_backing_off": false,
|
||||
"backoff_reason": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Backoff Behavior
|
||||
|
||||
When resources exceed thresholds:
|
||||
|
||||
1. Worker logs the backoff reason:
|
||||
```
|
||||
[TaskWorker] MyWorker backing off: Memory at 87.3% (threshold: 85%)
|
||||
```
|
||||
|
||||
2. Worker stops claiming new tasks but continues existing tasks
|
||||
|
||||
3. After `BACKOFF_DURATION_MS`, worker rechecks resources
|
||||
|
||||
4. When resources return to normal:
|
||||
```
|
||||
[TaskWorker] MyWorker resuming normal operation
|
||||
```
|
||||
|
||||
### UI Display
|
||||
|
||||
The Workers Dashboard shows:
|
||||
|
||||
- **Tasks Column**: `2/3 tasks` (active/max concurrent)
|
||||
- **Resources Column**: Memory % and CPU % with color coding
|
||||
- Green: < 50%
|
||||
- Yellow: 50-74%
|
||||
- Amber: 75-89%
|
||||
- Red: 90%+
|
||||
- **Backing Off**: Orange warning badge when worker is in backoff state
|
||||
|
||||
### Task Count Badge Details
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Worker: "MyWorker" │
|
||||
│ Tasks: 2/3 tasks #456, #457 │
|
||||
│ Resources: 🧠 65% 💻 45% │
|
||||
│ Status: ● Active │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Start Conservative**: Use `MAX_CONCURRENT_TASKS=3` initially
|
||||
2. **Monitor Resources**: Watch for frequent backoffs in logs
|
||||
3. **Tune Per Workload**: I/O-bound tasks benefit from higher concurrency
|
||||
4. **Scale Horizontally**: Add more pods rather than cranking concurrency too high
|
||||
|
||||
### Code References
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/tasks/task-worker.ts:68-71` | Concurrency environment variables |
|
||||
| `src/tasks/task-worker.ts:104-111` | ResourceStats interface |
|
||||
| `src/tasks/task-worker.ts:149-179` | getResourceStats() method |
|
||||
| `src/tasks/task-worker.ts:184-196` | shouldBackOff() method |
|
||||
| `src/tasks/task-worker.ts:462-516` | mainLoop() with concurrent claiming |
|
||||
| `src/routes/worker-registry.ts:148-195` | Heartbeat endpoint handling |
|
||||
| `cannaiq/src/pages/WorkersDashboard.tsx:233-305` | UI components for resources |
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Logs
|
||||
|
||||
Workers log to stdout:
|
||||
```
|
||||
[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
|
||||
[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
|
||||
[TaskWorker] Task 123 completed successfully
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
Check if workers are active:
|
||||
```sql
|
||||
SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
|
||||
FROM worker_tasks
|
||||
WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
|
||||
GROUP BY worker_id, role;
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
```sql
|
||||
-- Tasks by status
|
||||
SELECT status, COUNT(*) FROM worker_tasks GROUP BY status;
|
||||
|
||||
-- Tasks by role
|
||||
SELECT role, status, COUNT(*) FROM worker_tasks GROUP BY role, status;
|
||||
|
||||
-- Average duration by role
|
||||
SELECT role, AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds
|
||||
FROM worker_tasks
|
||||
WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY role;
|
||||
```
|
||||
Reference in New Issue
Block a user