chore: Clean up deprecated code and docs

- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Kelly
2025-12-11 22:17:40 -07:00
parent f2864bd2ad
commit a35976b9e9
61 changed files with 856 additions and 1281 deletions

View File

@@ -0,0 +1,712 @@
# CannaiQ Analytics Runbook
Phase 3: Analytics Engine - Complete Implementation Guide
## Overview
The CannaiQ Analytics Engine provides real-time insights into cannabis market data across price trends, brand penetration, category performance, store changes, and competitive positioning.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ API Layer │
│ /api/az/analytics/* │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Analytics Services │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │PriceTrend │ │Penetration │ │CategoryAnalytics │ │
│ │Service │ │Service │ │Service │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │StoreChange │ │BrandOpportunity│ │AnalyticsCache │ │
│ │Service │ │Service │ │(15-min TTL) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Canonical Tables │
│ store_products │ store_product_snapshots │ brands │ categories │
│ dispensaries │ brand_snapshots │ category_snapshots │
└─────────────────────────────────────────────────────────────────┘
```
## Services
### 1. PriceTrendService
Provides time-series price analytics.
**Key Methods:**
| Method | Description |
|--------|-------------|
| `getProductPriceTrend(productId, storeId?, days)` | Price history for a product |
| `getBrandPriceTrend(brandName, filters)` | Average prices for a brand |
| `getCategoryPriceTrend(category, filters)` | Category-level price trends |
| `getPriceSummary(filters)` | 7d/30d/90d price averages |
| `detectPriceCompression(category, state?)` | Price war detection |
| `getGlobalPriceStats()` | Market-wide pricing overview |
**Filters:**
```typescript
interface PriceFilters {
storeId?: number;
brandName?: string;
category?: string;
state?: string;
days?: number; // default: 30
}
```
**Price Compression Detection:**
- Calculates standard deviation of prices within category
- Returns compression score 0-100 (higher = more compressed)
- Identifies brands converging toward mean price
---
### 2. PenetrationService
Tracks brand market presence across stores and states.
**Key Methods:**
| Method | Description |
|--------|-------------|
| `getBrandPenetration(brandName, filters)` | Store count, SKU count, coverage |
| `getTopBrandsByPenetration(limit, filters)` | Leaderboard of dominant brands |
| `getPenetrationTrend(brandName, days)` | Historical penetration growth |
| `getShelfShareByCategory(brandName)` | % of shelf per category |
| `getBrandPresenceByState(brandName)` | Multi-state presence map |
| `getStoresCarryingBrand(brandName)` | List of stores carrying brand |
| `getPenetrationHeatmap(brandName?)` | Geographic distribution |
**Penetration Calculation:**
```
Penetration % = (Stores with Brand / Total Stores in Market) × 100
```
---
### 3. CategoryAnalyticsService
Analyzes category performance and trends.
**Key Methods:**
| Method | Description |
|--------|-------------|
| `getCategorySummary(category?, filters)` | SKU count, avg price, stores |
| `getCategoryGrowth(days, filters)` | 7d/30d/90d growth rates |
| `getCategoryGrowthTrend(category, days)` | Time-series category growth |
| `getCategoryHeatmap(metric, periods)` | Visual heatmap data |
| `getTopMovers(limit, days)` | Fastest growing/declining categories |
| `getSubcategoryBreakdown(category)` | Drill-down into subcategories |
**Time Windows:**
- 7 days: Short-term volatility
- 30 days: Monthly trends
- 90 days: Seasonal patterns
---
### 4. StoreChangeService
Tracks product adds/drops, brand changes, and price movements per store.
**Key Methods:**
| Method | Description |
|--------|-------------|
| `getStoreChangeSummary(storeId)` | Overview of recent changes |
| `getStoreChangeEvents(storeId, filters)` | Event log (add, drop, price, OOS) |
| `getNewBrands(storeId, days)` | Brands added to store |
| `getLostBrands(storeId, days)` | Brands dropped from store |
| `getProductChanges(storeId, type, days)` | Filtered product changes |
| `getCategoryLeaderboard(category, limit)` | Top stores for category |
| `getMostActiveStores(days, limit)` | Stores with most changes |
| `compareStores(store1, store2)` | Side-by-side store comparison |
**Event Types:**
- `added` - New product appeared
- `discontinued` - Product removed
- `price_drop` - Price decreased
- `price_increase` - Price increased
- `restocked` - OOS → In Stock
- `out_of_stock` - In Stock → OOS
---
### 5. BrandOpportunityService
Competitive intelligence and opportunity identification.
**Key Methods:**
| Method | Description |
|--------|-------------|
| `getBrandOpportunity(brandName)` | Full opportunity analysis |
| `getMarketPositionSummary(brandName)` | Market position vs competitors |
| `getAlerts(filters)` | Analytics-generated alerts |
| `markAlertsRead(alertIds)` | Mark alerts as read |
**Opportunity Analysis Includes:**
- White space stores (potential targets)
- Competitive threats (brands gaining share)
- Pricing opportunities (underpriced vs market)
- Missing SKU recommendations
---
### 6. AnalyticsCache
In-memory caching with database fallback.
**Configuration:**
```typescript
const cache = new AnalyticsCache(pool, {
defaultTtlMinutes: 15,
});
```
**Usage Pattern:**
```typescript
const data = await cache.getOrCompute(cacheKey, async () => {
// Expensive query here
return result;
});
```
**Cache Management:**
- `GET /api/az/analytics/cache/stats` - View cache stats
- `POST /api/az/analytics/cache/clear?pattern=price*` - Clear by pattern
- Auto-cleanup of expired entries every 5 minutes
---
## API Endpoints Reference
### Price Endpoints
```bash
# Product price trend (last 30 days)
GET /api/az/analytics/price/product/12345?days=30
# Brand price trend with filters
GET /api/az/analytics/price/brand/Cookies?storeId=101&category=Flower&days=90
# Category median price
GET /api/az/analytics/price/category/Vaporizers?state=AZ
# Price summary (7d/30d/90d)
GET /api/az/analytics/price/summary?brand=Stiiizy&state=AZ
# Detect price wars
GET /api/az/analytics/price/compression/Flower?state=AZ
# Global stats
GET /api/az/analytics/price/global
```
### Penetration Endpoints
```bash
# Brand penetration
GET /api/az/analytics/penetration/brand/Cookies
# Top brands leaderboard
GET /api/az/analytics/penetration/top?limit=20&state=AZ&category=Flower
# Penetration trend
GET /api/az/analytics/penetration/trend/Cookies?days=90
# Shelf share by category
GET /api/az/analytics/penetration/shelf-share/Cookies
# Multi-state presence
GET /api/az/analytics/penetration/by-state/Cookies
# Stores carrying brand
GET /api/az/analytics/penetration/stores/Cookies
# Heatmap data
GET /api/az/analytics/penetration/heatmap?brand=Cookies
```
### Category Endpoints
```bash
# Category summary
GET /api/az/analytics/category/summary?category=Flower&state=AZ
# Category growth (7d/30d/90d)
GET /api/az/analytics/category/growth?days=30&state=AZ
# Category trend
GET /api/az/analytics/category/trend/Concentrates?days=90
# Heatmap
GET /api/az/analytics/category/heatmap?metric=growth&periods=12
# Top movers (growing/declining)
GET /api/az/analytics/category/top-movers?limit=5&days=30
# Subcategory breakdown
GET /api/az/analytics/category/Edibles/subcategories
```
### Store Endpoints
```bash
# Store change summary
GET /api/az/analytics/store/101/summary
# Event log
GET /api/az/analytics/store/101/events?type=price_drop&days=7&limit=50
# New brands
GET /api/az/analytics/store/101/brands/new?days=30
# Lost brands
GET /api/az/analytics/store/101/brands/lost?days=30
# Product changes by type
GET /api/az/analytics/store/101/products/changes?type=added&days=7
# Category leaderboard
GET /api/az/analytics/store/leaderboard/Flower?limit=20
# Most active stores
GET /api/az/analytics/store/most-active?days=7&limit=10
# Compare two stores
GET /api/az/analytics/store/compare?store1=101&store2=102
```
### Brand Opportunity Endpoints
```bash
# Full opportunity analysis
GET /api/az/analytics/brand/Cookies/opportunity
# Market position summary
GET /api/az/analytics/brand/Cookies/position
# Get alerts
GET /api/az/analytics/alerts?brand=Cookies&type=competitive&unreadOnly=true
# Mark alerts read
POST /api/az/analytics/alerts/mark-read
Body: { "alertIds": [1, 2, 3] }
```
### Maintenance Endpoints
```bash
# Capture daily snapshots (run by scheduler)
POST /api/az/analytics/snapshots/capture
# Cache statistics
GET /api/az/analytics/cache/stats
# Clear cache (admin)
POST /api/az/analytics/cache/clear?pattern=price*
```
---
## Incremental Computation
Analytics are designed for real-time queries without full recomputation:
### Snapshot Strategy
1. **Raw Data**: `store_products` (current state)
2. **Historical**: `store_product_snapshots` (time-series)
3. **Aggregated**: `brand_snapshots`, `category_snapshots` (daily rollups)
### Window Calculations
```sql
-- 7-day window
WHERE crawled_at >= NOW() - INTERVAL '7 days'
-- 30-day window
WHERE crawled_at >= NOW() - INTERVAL '30 days'
-- 90-day window
WHERE crawled_at >= NOW() - INTERVAL '90 days'
```
### Materialized Views (Optional)
For heavy queries, create materialized views:
```sql
CREATE MATERIALIZED VIEW mv_brand_daily_metrics AS
SELECT
DATE(sps.captured_at) as date,
sp.brand_id,
COUNT(DISTINCT sp.dispensary_id) as store_count,
COUNT(*) as sku_count,
AVG(sp.price_rec) as avg_price
FROM store_product_snapshots sps
JOIN store_products sp ON sps.store_product_id = sp.id
WHERE sps.captured_at >= NOW() - INTERVAL '90 days'
GROUP BY DATE(sps.captured_at), sp.brand_id;
-- Refresh daily
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_brand_daily_metrics;
```
---
## Scheduled Jobs
### Daily Snapshot Capture
Trigger via cron or scheduler:
```bash
curl -X POST http://localhost:3010/api/az/analytics/snapshots/capture
```
This calls:
- `capture_brand_snapshots()` - Captures brand metrics
- `capture_category_snapshots()` - Captures category metrics
### Cache Cleanup
Automatic cleanup every 5 minutes via in-memory timer.
For manual cleanup:
```bash
curl -X POST http://localhost:3010/api/az/analytics/cache/clear
```
---
## Extending Analytics (Future Phases)
### Phase 6: Intelligence Engine
- Automated alert generation
- Recommendation engine
- Price prediction
### Phase 7: Orders Integration
- Sales velocity analytics
- Reorder predictions
- Inventory turnover
### Phase 8: Advanced ML
- Demand forecasting
- Price elasticity modeling
- Customer segmentation
---
## Troubleshooting
### Common Issues
**1. Slow queries**
- Check cache stats: `GET /api/az/analytics/cache/stats`
- Increase cache TTL if data doesn't need real-time freshness
- Add indexes on frequently filtered columns
**2. Empty results**
- Verify data exists in source tables
- Check filter parameters (case-sensitive brand names)
- Verify state codes are valid
**3. Stale data**
- Run snapshot capture: `POST /api/az/analytics/snapshots/capture`
- Clear cache: `POST /api/az/analytics/cache/clear`
### Debugging
Enable query logging:
```typescript
// In service constructor
this.debug = process.env.ANALYTICS_DEBUG === 'true';
```
---
## Data Contracts
### Price Trend Response
```typescript
interface PriceTrend {
productId?: number;
storeId?: number;
brandName?: string;
category?: string;
dataPoints: Array<{
date: string;
minPrice: number | null;
maxPrice: number | null;
avgPrice: number | null;
wholesalePrice: number | null;
sampleSize: number;
}>;
summary: {
currentAvg: number | null;
previousAvg: number | null;
changePercent: number | null;
trend: 'up' | 'down' | 'stable';
volatilityScore: number | null;
};
}
```
### Brand Penetration Response
```typescript
interface BrandPenetration {
brandName: string;
totalStores: number;
storesWithBrand: number;
penetrationPercent: number;
skuCount: number;
avgPrice: number | null;
priceRange: { min: number; max: number } | null;
topCategories: Array<{ category: string; count: number }>;
stateBreakdown?: Array<{ state: string; storeCount: number }>;
}
```
### Category Growth Response
```typescript
interface CategoryGrowth {
category: string;
currentCount: number;
previousCount: number;
growthPercent: number;
growthTrend: 'up' | 'down' | 'stable';
avgPrice: number | null;
priceChange: number | null;
topBrands: Array<{ brandName: string; count: number }>;
}
```
---
## Files Reference
| File | Purpose |
|------|---------|
| `src/dutchie-az/services/analytics/price-trends.ts` | Price analytics |
| `src/dutchie-az/services/analytics/penetration.ts` | Brand penetration |
| `src/dutchie-az/services/analytics/category-analytics.ts` | Category metrics |
| `src/dutchie-az/services/analytics/store-changes.ts` | Store event tracking |
| `src/dutchie-az/services/analytics/brand-opportunity.ts` | Competitive intel |
| `src/dutchie-az/services/analytics/cache.ts` | Caching layer |
| `src/dutchie-az/services/analytics/index.ts` | Module exports |
| `src/dutchie-az/routes/analytics.ts` | API routes (680 LOC) |
| `src/multi-state/state-query-service.ts` | Cross-state analytics |
---
---
## Analytics V2: Rec/Med State Segmentation
Phase 3 Enhancement: Enhanced analytics with recreational vs medical-only state analysis.
### V2 API Endpoints
All V2 endpoints are prefixed with `/api/analytics/v2`
#### V2 Price Analytics
```bash
# Price trends for a specific product
GET /api/analytics/v2/price/product/12345?window=30d
# Price by category and state (with rec/med segmentation)
GET /api/analytics/v2/price/category/Flower?state=AZ
# Price by brand and state
GET /api/analytics/v2/price/brand/Cookies?state=AZ
# Most volatile products
GET /api/analytics/v2/price/volatile?window=30d&limit=50&state=AZ
# Rec vs Med price comparison by category
GET /api/analytics/v2/price/rec-vs-med?category=Flower
```
#### V2 Brand Penetration
```bash
# Brand penetration metrics with state breakdown
GET /api/analytics/v2/brand/Cookies/penetration?window=30d
# Brand market position within categories
GET /api/analytics/v2/brand/Cookies/market-position?category=Flower&state=AZ
# Brand presence in rec vs med-only states
GET /api/analytics/v2/brand/Cookies/rec-vs-med
# Top brands by penetration
GET /api/analytics/v2/brand/top?limit=25&state=AZ
# Brands expanding or contracting
GET /api/analytics/v2/brand/expansion-contraction?window=30d&limit=25
```
#### V2 Category Analytics
```bash
# Category growth metrics
GET /api/analytics/v2/category/Flower/growth?window=30d
# Category growth trend over time
GET /api/analytics/v2/category/Flower/trend?window=30d
# Top brands in category
GET /api/analytics/v2/category/Flower/top-brands?limit=25&state=AZ
# All categories with metrics
GET /api/analytics/v2/category/all?state=AZ&limit=50
# Rec vs Med category comparison
GET /api/analytics/v2/category/rec-vs-med?category=Flower
# Fastest growing categories
GET /api/analytics/v2/category/fastest-growing?window=30d&limit=25
```
#### V2 Store Analytics
```bash
# Store change summary
GET /api/analytics/v2/store/101/summary?window=30d
# Product change events
GET /api/analytics/v2/store/101/events?window=7d&limit=100
# Store inventory composition
GET /api/analytics/v2/store/101/inventory
# Store price positioning vs market
GET /api/analytics/v2/store/101/price-position
# Most active stores by changes
GET /api/analytics/v2/store/most-active?window=7d&limit=25&state=AZ
```
#### V2 State Analytics
```bash
# State market summary
GET /api/analytics/v2/state/AZ/summary
# All states with coverage metrics
GET /api/analytics/v2/state/all
# Legal state breakdown (rec, med-only, no program)
GET /api/analytics/v2/state/legal-breakdown
# Rec vs Med pricing by category
GET /api/analytics/v2/state/rec-vs-med-pricing?category=Flower
# States with coverage gaps
GET /api/analytics/v2/state/coverage-gaps
# Cross-state pricing comparison
GET /api/analytics/v2/state/price-comparison
```
### V2 Services Architecture
```
src/services/analytics/
├── index.ts # Exports all V2 services
├── types.ts # Shared type definitions
├── PriceAnalyticsService.ts # Price trends and volatility
├── BrandPenetrationService.ts # Brand market presence
├── CategoryAnalyticsService.ts # Category growth analysis
├── StoreAnalyticsService.ts # Store change tracking
└── StateAnalyticsService.ts # State-level analytics
src/routes/analytics-v2.ts # V2 API route handlers
```
### Key V2 Features
1. **Rec/Med State Segmentation**: All analytics can be filtered and compared by legal status
2. **State Coverage Gaps**: Identify legal states with missing or stale data
3. **Cross-State Pricing**: Compare prices across recreational and medical-only markets
4. **Brand Footprint Analysis**: Track brand presence in rec vs med states
5. **Category Comparison**: Compare category performance by legal status
### V2 Migration Path
1. Run migration 052 for state cannabis flags:
```bash
psql "$DATABASE_URL" -f migrations/052_add_state_cannabis_flags.sql
```
2. Run migration 053 for analytics indexes:
```bash
psql "$DATABASE_URL" -f migrations/053_analytics_indexes.sql
```
3. Restart backend to pick up new routes
### V2 Response Examples
**Rec vs Med Price Comparison:**
```json
{
"category": "Flower",
"recreational": {
"state_count": 15,
"product_count": 12500,
"avg_price": 35.50,
"median_price": 32.00
},
"medical_only": {
"state_count": 8,
"product_count": 5200,
"avg_price": 42.00,
"median_price": 40.00
},
"price_diff_percent": -15.48
}
```
**Legal State Breakdown:**
```json
{
"recreational_states": {
"count": 24,
"dispensary_count": 850,
"product_count": 125000,
"states": [
{ "code": "CA", "name": "California", "dispensary_count": 250 },
{ "code": "CO", "name": "Colorado", "dispensary_count": 150 }
]
},
"medical_only_states": {
"count": 18,
"dispensary_count": 320,
"product_count": 45000,
"states": [
{ "code": "FL", "name": "Florida", "dispensary_count": 120 }
]
},
"no_program_states": {
"count": 9,
"states": [
{ "code": "ID", "name": "Idaho" }
]
}
}
```
---
*Phase 3 Analytics Engine - Fully Implemented*
*V2 Rec/Med State Analytics - Added December 2024*

View File

@@ -0,0 +1,594 @@
# Analytics V2 API Examples
## Overview
All endpoints are prefixed with `/api/analytics/v2`
### Filtering Options
**Time Windows:**
- `?window=7d` - Last 7 days
- `?window=30d` - Last 30 days (default)
- `?window=90d` - Last 90 days
**Legal Type Filtering:**
- `?legalType=recreational` - Recreational states only
- `?legalType=medical_only` - Medical-only states (not recreational)
- `?legalType=no_program` - States with no cannabis program
---
## 1. Price Analytics
### GET /price/product/:id
Get price trends for a specific store product.
**Request:**
```bash
GET /api/analytics/v2/price/product/12345?window=30d
```
**Response:**
```json
{
"store_product_id": 12345,
"product_name": "Blue Dream 3.5g",
"brand_name": "Cookies",
"category": "Flower",
"dispensary_id": 101,
"dispensary_name": "Green Leaf Dispensary",
"state_code": "AZ",
"data_points": [
{
"date": "2024-11-06",
"price_rec": 45.00,
"price_med": 40.00,
"price_rec_special": null,
"price_med_special": null,
"is_on_special": false
},
{
"date": "2024-11-07",
"price_rec": 42.00,
"price_med": 38.00,
"price_rec_special": null,
"price_med_special": null,
"is_on_special": false
}
],
"summary": {
"current_price": 42.00,
"min_price": 40.00,
"max_price": 48.00,
"avg_price": 43.50,
"price_change_count": 3,
"volatility_percent": 8.2
}
}
```
### GET /price/rec-vs-med
Get recreational vs medical-only price comparison by category.
**Request:**
```bash
GET /api/analytics/v2/price/rec-vs-med?category=Flower
```
**Response:**
```json
[
{
"category": "Flower",
"rec_avg": 38.50,
"rec_median": 35.00,
"med_avg": 42.00,
"med_median": 40.00
},
{
"category": "Concentrates",
"rec_avg": 45.00,
"rec_median": 42.00,
"med_avg": 48.00,
"med_median": 45.00
}
]
```
---
## 2. Brand Analytics
### GET /brand/:name/penetration
Get brand penetration metrics with state breakdown.
**Request:**
```bash
GET /api/analytics/v2/brand/Cookies/penetration?window=30d
```
**Response:**
```json
{
"brand_name": "Cookies",
"total_dispensaries": 125,
"total_skus": 450,
"avg_skus_per_dispensary": 3.6,
"states_present": ["AZ", "CA", "CO", "NV", "MI"],
"state_breakdown": [
{
"state_code": "CA",
"state_name": "California",
"legal_type": "recreational",
"dispensary_count": 45,
"sku_count": 180,
"avg_skus_per_dispensary": 4.0,
"market_share_percent": 12.5
},
{
"state_code": "AZ",
"state_name": "Arizona",
"legal_type": "recreational",
"dispensary_count": 32,
"sku_count": 128,
"avg_skus_per_dispensary": 4.0,
"market_share_percent": 15.2
}
],
"penetration_trend": [
{
"date": "2024-11-01",
"dispensary_count": 120,
"new_dispensaries": 0,
"dropped_dispensaries": 0
},
{
"date": "2024-11-08",
"dispensary_count": 123,
"new_dispensaries": 3,
"dropped_dispensaries": 0
},
{
"date": "2024-11-15",
"dispensary_count": 125,
"new_dispensaries": 2,
"dropped_dispensaries": 0
}
]
}
```
### GET /brand/:name/rec-vs-med
Get brand presence in recreational vs medical-only states.
**Request:**
```bash
GET /api/analytics/v2/brand/Cookies/rec-vs-med
```
**Response:**
```json
{
"brand_name": "Cookies",
"rec_states_count": 4,
"rec_states": ["AZ", "CA", "CO", "NV"],
"rec_dispensary_count": 110,
"rec_avg_skus": 3.8,
"med_only_states_count": 2,
"med_only_states": ["FL", "OH"],
"med_only_dispensary_count": 15,
"med_only_avg_skus": 2.5
}
```
---
## 3. Category Analytics
### GET /category/:name/growth
Get category growth metrics with state breakdown.
**Request:**
```bash
GET /api/analytics/v2/category/Flower/growth?window=30d
```
**Response:**
```json
{
"category": "Flower",
"current_sku_count": 5200,
"current_dispensary_count": 320,
"avg_price": 38.50,
"growth_data": [
{
"date": "2024-11-01",
"sku_count": 4800,
"dispensary_count": 310,
"avg_price": 39.00
},
{
"date": "2024-11-15",
"sku_count": 5000,
"dispensary_count": 315,
"avg_price": 38.75
},
{
"date": "2024-12-01",
"sku_count": 5200,
"dispensary_count": 320,
"avg_price": 38.50
}
],
"state_breakdown": [
{
"state_code": "CA",
"state_name": "California",
"legal_type": "recreational",
"sku_count": 2100,
"dispensary_count": 145,
"avg_price": 36.00
},
{
"state_code": "AZ",
"state_name": "Arizona",
"legal_type": "recreational",
"sku_count": 950,
"dispensary_count": 85,
"avg_price": 40.00
}
]
}
```
### GET /category/rec-vs-med
Get category comparison between recreational and medical-only states.
**Request:**
```bash
GET /api/analytics/v2/category/rec-vs-med
```
**Response:**
```json
[
{
"category": "Flower",
"recreational": {
"state_count": 15,
"dispensary_count": 650,
"sku_count": 12500,
"avg_price": 35.50,
"median_price": 32.00
},
"medical_only": {
"state_count": 8,
"dispensary_count": 220,
"sku_count": 4200,
"avg_price": 42.00,
"median_price": 40.00
},
"price_diff_percent": -15.48
},
{
"category": "Concentrates",
"recreational": {
"state_count": 15,
"dispensary_count": 600,
"sku_count": 8500,
"avg_price": 42.00,
"median_price": 40.00
},
"medical_only": {
"state_count": 8,
"dispensary_count": 200,
"sku_count": 3100,
"avg_price": 48.00,
"median_price": 45.00
},
"price_diff_percent": -12.50
}
]
```
---
## 4. Store Analytics
### GET /store/:id/summary
Get change summary for a store over a time window.
**Request:**
```bash
GET /api/analytics/v2/store/101/summary?window=30d
```
**Response:**
```json
{
"dispensary_id": 101,
"dispensary_name": "Green Leaf Dispensary",
"state_code": "AZ",
"window": "30d",
"products_added": 45,
"products_dropped": 12,
"brands_added": ["Alien Labs", "Connected"],
"brands_dropped": ["House Brand"],
"price_changes": 156,
"avg_price_change_percent": 3.2,
"stock_in_events": 89,
"stock_out_events": 34,
"current_product_count": 512,
"current_in_stock_count": 478
}
```
### GET /store/:id/events
Get recent product change events for a store.
**Request:**
```bash
GET /api/analytics/v2/store/101/events?window=7d&limit=50
```
**Response:**
```json
[
{
"store_product_id": 12345,
"product_name": "Blue Dream 3.5g",
"brand_name": "Cookies",
"category": "Flower",
"event_type": "price_change",
"event_date": "2024-12-05T14:30:00.000Z",
"old_value": "45.00",
"new_value": "42.00"
},
{
"store_product_id": 12346,
"product_name": "OG Kush 1g",
"brand_name": "Alien Labs",
"category": "Flower",
"event_type": "added",
"event_date": "2024-12-04T10:00:00.000Z",
"old_value": null,
"new_value": null
},
{
"store_product_id": 12300,
"product_name": "Sour Diesel Cart",
"brand_name": "Select",
"category": "Vaporizers",
"event_type": "stock_out",
"event_date": "2024-12-03T16:45:00.000Z",
"old_value": "true",
"new_value": "false"
}
]
```
---
## 5. State Analytics
### GET /state/:code/summary
Get market summary for a specific state with rec/med breakdown.
**Request:**
```bash
GET /api/analytics/v2/state/AZ/summary
```
**Response:**
```json
{
"state_code": "AZ",
"state_name": "Arizona",
"legal_status": {
"recreational_legal": true,
"rec_year": 2020,
"medical_legal": true,
"med_year": 2010
},
"coverage": {
"dispensary_count": 145,
"product_count": 18500,
"brand_count": 320,
"category_count": 12,
"snapshot_count": 2450000,
"last_crawl_at": "2024-12-06T02:30:00.000Z"
},
"pricing": {
"avg_price": 42.50,
"median_price": 38.00,
"min_price": 5.00,
"max_price": 250.00
},
"top_categories": [
{ "category": "Flower", "count": 5200 },
{ "category": "Concentrates", "count": 3800 },
{ "category": "Vaporizers", "count": 2950 },
{ "category": "Edibles", "count": 2400 },
{ "category": "Pre-Rolls", "count": 1850 }
],
"top_brands": [
{ "brand": "Cookies", "count": 450 },
{ "brand": "Alien Labs", "count": 380 },
{ "brand": "Connected", "count": 320 },
{ "brand": "Stiiizy", "count": 290 },
{ "brand": "Raw Garden", "count": 275 }
]
}
```
### GET /state/legal-breakdown
Get breakdown by legal status (recreational, medical-only, no program).
**Request:**
```bash
GET /api/analytics/v2/state/legal-breakdown
```
**Response:**
```json
{
"recreational_states": {
"count": 24,
"dispensary_count": 850,
"product_count": 125000,
"snapshot_count": 15000000,
"states": [
{ "code": "CA", "name": "California", "dispensary_count": 250 },
{ "code": "CO", "name": "Colorado", "dispensary_count": 150 },
{ "code": "AZ", "name": "Arizona", "dispensary_count": 145 },
{ "code": "MI", "name": "Michigan", "dispensary_count": 120 }
]
},
"medical_only_states": {
"count": 18,
"dispensary_count": 320,
"product_count": 45000,
"snapshot_count": 5000000,
"states": [
{ "code": "FL", "name": "Florida", "dispensary_count": 120 },
{ "code": "OH", "name": "Ohio", "dispensary_count": 85 },
{ "code": "PA", "name": "Pennsylvania", "dispensary_count": 75 }
]
},
"no_program_states": {
"count": 9,
"states": [
{ "code": "ID", "name": "Idaho" },
{ "code": "WY", "name": "Wyoming" },
{ "code": "KS", "name": "Kansas" }
]
}
}
```
### GET /state/recreational
Get list of recreational state codes.
**Request:**
```bash
GET /api/analytics/v2/state/recreational
```
**Response:**
```json
{
"legal_type": "recreational",
"states": ["AK", "AZ", "CA", "CO", "CT", "DE", "IL", "MA", "MD", "ME", "MI", "MN", "MO", "MT", "NJ", "NM", "NV", "NY", "OH", "OR", "RI", "VA", "VT", "WA"],
"count": 24
}
```
### GET /state/medical-only
Get list of medical-only state codes (not recreational).
**Request:**
```bash
GET /api/analytics/v2/state/medical-only
```
**Response:**
```json
{
"legal_type": "medical_only",
"states": ["AR", "FL", "HI", "LA", "MS", "ND", "NH", "OK", "PA", "SD", "UT", "WV"],
"count": 12
}
```
### GET /state/rec-vs-med-pricing
Get rec vs med price comparison by category.
**Request:**
```bash
GET /api/analytics/v2/state/rec-vs-med-pricing?category=Flower
```
**Response:**
```json
[
{
"category": "Flower",
"recreational": {
"state_count": 15,
"product_count": 12500,
"avg_price": 35.50,
"median_price": 32.00
},
"medical_only": {
"state_count": 8,
"product_count": 5200,
"avg_price": 42.00,
"median_price": 40.00
},
"price_diff_percent": -15.48
}
]
```
---
## How These Endpoints Support Portals
### Brand Portal Use Cases
1. **Track brand penetration**: Use `/brand/:name/penetration` to see how many stores carry the brand
2. **Compare rec vs med markets**: Use `/brand/:name/rec-vs-med` to understand footprint by legal status
3. **Identify expansion opportunities**: Use `/state/coverage-gaps` to find underserved markets
4. **Monitor pricing**: Use `/price/brand/:brand` to track pricing by state
### Buyer Portal Use Cases
1. **Compare stores**: Use `/store/:id/summary` to see activity levels
2. **Track price changes**: Use `/store/:id/events` to monitor competitor pricing
3. **Analyze categories**: Use `/category/:name/growth` to identify trending products
4. **State-level insights**: Use `/state/:code/summary` for market overview
---
## Time Window Filtering
All time-based endpoints support the `window` query parameter:
| Value | Description |
|-------|-------------|
| `7d` | Last 7 days |
| `30d` | Last 30 days (default) |
| `90d` | Last 90 days |
The window affects:
- `store_product_snapshots.captured_at` for historical data
- `store_products.first_seen_at` / `last_seen_at` for product lifecycle
- `crawl_runs.started_at` for crawl-based metrics
---
## Rec/Med Segmentation
All state-level endpoints automatically segment by:
- **Recreational**: `states.recreational_legal = TRUE`
- **Medical-only**: `states.medical_legal = TRUE AND states.recreational_legal = FALSE`
- **No program**: Both flags are FALSE or NULL
This segmentation appears in:
- `legal_type` field in responses
- State breakdown arrays
- Price comparison endpoints

View File

@@ -0,0 +1,394 @@
# Brand Intelligence API
## Endpoint
```
GET /api/analytics/v2/brand/:name/intelligence
```
## Query Parameters
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `window` | `7d\|30d\|90d` | `30d` | Time window for trend calculations |
| `state` | string | - | Filter by state code (e.g., `AZ`) |
| `category` | string | - | Filter by category (e.g., `Flower`) |
## Response Payload Schema
```typescript
interface BrandIntelligenceResult {
brand_name: string;
window: '7d' | '30d' | '90d';
generated_at: string; // ISO timestamp when data was computed
performance_snapshot: PerformanceSnapshot;
alerts: Alerts;
sku_performance: SkuPerformance[];
retail_footprint: RetailFootprint;
competitive_landscape: CompetitiveLandscape;
inventory_health: InventoryHealth;
promo_performance: PromoPerformance;
}
```
---
## Section 1: Performance Snapshot
Summary cards with key brand metrics.
```typescript
interface PerformanceSnapshot {
active_skus: number; // Total products in catalog
total_revenue_30d: number | null; // Estimated from qty × price
total_stores: number; // Active retail partners
new_stores_30d: number; // New distribution in window
market_share: number | null; // % of category SKUs
avg_wholesale_price: number | null;
price_position: 'premium' | 'value' | 'competitive';
}
```
**UI Label Mapping:**
| Field | User-Facing Label | Helper Text |
|-------|-------------------|-------------|
| `active_skus` | Active Products | X total in catalog |
| `total_revenue_30d` | Monthly Revenue | Estimated from sales |
| `total_stores` | Retail Distribution | Active retail partners |
| `new_stores_30d` | New Opportunities | X new in last 30 days |
| `market_share` | Category Position | % of category |
| `avg_wholesale_price` | Avg Wholesale | Per unit |
| `price_position` | Pricing Tier | Premium/Value/Market Rate |
---
## Section 2: Alerts
Issues requiring attention.
```typescript
interface Alerts {
lost_stores_30d_count: number;
lost_skus_30d_count: number;
competitor_takeover_count: number;
avg_oos_duration_days: number | null;
avg_reorder_lag_days: number | null;
items: AlertItem[];
}
interface AlertItem {
type: 'lost_store' | 'delisted_sku' | 'shelf_loss' | 'extended_oos';
severity: 'critical' | 'warning';
store_name?: string;
product_name?: string;
competitor_brand?: string;
days_since?: number;
state_code?: string;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `lost_stores_30d_count` | Accounts at Risk |
| `lost_skus_30d_count` | Delisted SKUs |
| `competitor_takeover_count` | Shelf Losses |
| `avg_oos_duration_days` | Avg Stockout Length |
| `avg_reorder_lag_days` | Avg Restock Time |
| `severity: critical` | Urgent |
| `severity: warning` | Watch |
---
## Section 3: SKU Performance (Product Velocity)
How fast each SKU sells.
```typescript
interface SkuPerformance {
store_product_id: number;
product_name: string;
category: string | null;
daily_velocity: number; // Units/day estimate
velocity_status: 'hot' | 'steady' | 'slow' | 'stale';
retail_price: number | null;
on_sale: boolean;
stores_carrying: number;
stock_status: 'in_stock' | 'low_stock' | 'out_of_stock';
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `daily_velocity` | Daily Rate |
| `velocity_status` | Momentum |
| `velocity_status: hot` | Hot |
| `velocity_status: steady` | Steady |
| `velocity_status: slow` | Slow |
| `velocity_status: stale` | Stale |
| `retail_price` | Retail Price |
| `on_sale` | Promo (badge) |
**Velocity Thresholds:**
- `hot`: >= 5 units/day
- `steady`: >= 1 unit/day
- `slow`: >= 0.1 units/day
- `stale`: < 0.1 units/day
---
## Section 4: Retail Footprint
Store placement and coverage.
```typescript
interface RetailFootprint {
total_stores: number;
in_stock_count: number;
out_of_stock_count: number;
penetration_by_region: RegionPenetration[];
whitespace_stores: WhitespaceStore[];
}
interface RegionPenetration {
state_code: string;
store_count: number;
percent_reached: number; // % of state's dispensaries
in_stock: number;
out_of_stock: number;
}
interface WhitespaceStore {
store_id: number;
store_name: string;
state_code: string;
city: string | null;
category_fit: number; // How many competing brands they carry
competitor_brands: string[];
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `penetration_by_region` | Market Coverage by Region |
| `percent_reached` | X% reached |
| `in_stock` | X stocked |
| `out_of_stock` | X out |
| `whitespace_stores` | Expansion Opportunities |
| `category_fit` | X fit |
---
## Section 5: Competitive Landscape
Market positioning vs competitors.
```typescript
interface CompetitiveLandscape {
brand_price_position: 'premium' | 'value' | 'competitive';
market_share_trend: MarketSharePoint[];
competitors: Competitor[];
head_to_head_skus: HeadToHead[];
}
interface MarketSharePoint {
date: string;
share_percent: number;
}
interface Competitor {
brand_name: string;
store_overlap_percent: number;
price_position: 'premium' | 'value' | 'competitive';
avg_price: number | null;
sku_count: number;
}
interface HeadToHead {
product_name: string;
brand_price: number;
competitor_brand: string;
competitor_price: number;
price_diff_percent: number;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `price_position: premium` | Premium Tier |
| `price_position: value` | Value Leader |
| `price_position: competitive` | Market Rate |
| `market_share_trend` | Share of Shelf Trend |
| `head_to_head_skus` | Price Comparison |
| `store_overlap_percent` | X% store overlap |
---
## Section 6: Inventory Health
Stock projections and risk levels.
```typescript
interface InventoryHealth {
critical_count: number; // <7 days stock
warning_count: number; // 7-14 days stock
healthy_count: number; // 14-90 days stock
overstocked_count: number; // >90 days stock
skus: InventorySku[];
overstock_alert: OverstockItem[];
}
interface InventorySku {
store_product_id: number;
product_name: string;
store_name: string;
days_of_stock: number | null;
risk_level: 'critical' | 'elevated' | 'moderate' | 'healthy';
current_quantity: number | null;
daily_sell_rate: number | null;
}
interface OverstockItem {
product_name: string;
store_name: string;
excess_units: number;
days_of_stock: number;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `risk_level: critical` | Reorder Now |
| `risk_level: elevated` | Low Stock |
| `risk_level: moderate` | Monitor |
| `risk_level: healthy` | Healthy |
| `critical_count` | Urgent (<7 days) |
| `warning_count` | Low (7-14 days) |
| `overstocked_count` | Excess (>90 days) |
| `days_of_stock` | X days remaining |
| `overstock_alert` | Overstock Alert |
| `excess_units` | X excess units |
---
## Section 7: Promotion Effectiveness
How promotions impact sales.
```typescript
interface PromoPerformance {
avg_baseline_velocity: number | null;
avg_promo_velocity: number | null;
avg_velocity_lift: number | null; // % increase during promo
avg_efficiency_score: number | null; // ROI proxy
promotions: Promotion[];
}
interface Promotion {
product_name: string;
store_name: string;
status: 'active' | 'scheduled' | 'ended';
start_date: string;
end_date: string | null;
regular_price: number;
promo_price: number;
discount_percent: number;
baseline_velocity: number | null;
promo_velocity: number | null;
velocity_lift: number | null;
efficiency_score: number | null;
}
```
**UI Label Mapping:**
| Field | User-Facing Label |
|-------|-------------------|
| `avg_baseline_velocity` | Normal Rate |
| `avg_promo_velocity` | During Promos |
| `avg_velocity_lift` | Avg Sales Lift |
| `avg_efficiency_score` | ROI Score |
| `velocity_lift` | Sales Lift |
| `efficiency_score` | ROI Score |
| `status: active` | Live |
| `status: scheduled` | Scheduled |
| `status: ended` | Ended |
---
## Example Queries
### Get full payload
```javascript
const response = await fetch('/api/analytics/v2/brand/Wyld/intelligence?window=30d');
const data = await response.json();
```
### Extract summary cards (flattened)
```javascript
const { performance_snapshot: ps, alerts } = data;
const summaryCards = {
activeProducts: ps.active_skus,
monthlyRevenue: ps.total_revenue_30d,
retailDistribution: ps.total_stores,
newOpportunities: ps.new_stores_30d,
categoryPosition: ps.market_share,
avgWholesale: ps.avg_wholesale_price,
pricingTier: ps.price_position,
accountsAtRisk: alerts.lost_stores_30d_count,
delistedSkus: alerts.lost_skus_30d_count,
shelfLosses: alerts.competitor_takeover_count,
};
```
### Get top 10 fastest selling SKUs
```javascript
const topSkus = data.sku_performance
.filter(sku => sku.velocity_status === 'hot' || sku.velocity_status === 'steady')
.sort((a, b) => b.daily_velocity - a.daily_velocity)
.slice(0, 10);
```
### Get critical inventory alerts only
```javascript
const criticalInventory = data.inventory_health.skus
.filter(sku => sku.risk_level === 'critical');
```
### Get states with <50% penetration
```javascript
const underPenetrated = data.retail_footprint.penetration_by_region
.filter(region => region.percent_reached < 50)
.sort((a, b) => a.percent_reached - b.percent_reached);
```
### Get active promotions with positive lift
```javascript
const effectivePromos = data.promo_performance.promotions
.filter(p => p.status === 'active' && p.velocity_lift > 0)
.sort((a, b) => b.velocity_lift - a.velocity_lift);
```
### Build chart data for market share trend
```javascript
const chartData = data.competitive_landscape.market_share_trend.map(point => ({
x: new Date(point.date),
y: point.share_percent,
}));
```
---
## Notes for Frontend Implementation
1. **All fields are snake_case** - transform to camelCase if needed
2. **Null values are possible** - handle gracefully in UI
3. **Arrays may be empty** - show appropriate empty states
4. **Timestamps are ISO format** - parse with `new Date()`
5. **Percentages are already computed** - no need to multiply by 100
6. **The `window` parameter affects trend calculations** - 7d/30d/90d

View File

@@ -0,0 +1,539 @@
# Crawl Pipeline Documentation
## Overview
The crawl pipeline fetches product data from Dutchie dispensary menus and stores it in the canonical database. This document covers the complete flow from task scheduling to data storage.
---
## Pipeline Stages
```
┌─────────────────────┐
│ store_discovery │ Find new dispensaries
└─────────┬───────────┘
┌─────────────────────┐
│ entry_point_discovery│ Resolve slug → platform_dispensary_id
└─────────┬───────────┘
┌─────────────────────┐
│ product_discovery │ Initial product crawl
└─────────┬───────────┘
┌─────────────────────┐
│ product_resync │ Recurring crawl (every 4 hours)
└─────────────────────┘
```
---
## Stage Details
### 1. Store Discovery
**Purpose:** Find new dispensaries to crawl
**Handler:** `src/tasks/handlers/store-discovery.ts`
**Flow:**
1. Query Dutchie `ConsumerDispensaries` GraphQL for cities/states
2. Extract dispensary info (name, address, menu_url)
3. Insert into `dutchie_discovery_locations`
4. Queue `entry_point_discovery` for each new location
---
### 2. Entry Point Discovery
**Purpose:** Resolve menu URL slug to platform_dispensary_id (MongoDB ObjectId)
**Handler:** `src/tasks/handlers/entry-point-discovery.ts`
**Flow:**
1. Load dispensary from database
2. Extract slug from `menu_url`:
- `/embedded-menu/<slug>` or `/dispensary/<slug>`
3. Start stealth session (fingerprint + proxy)
4. Query `resolveDispensaryIdWithDetails(slug)` via GraphQL
5. Update dispensary with `platform_dispensary_id`
6. Queue `product_discovery` task
**Example:**
```
menu_url: https://dutchie.com/embedded-menu/deeply-rooted
slug: deeply-rooted
platform_dispensary_id: 6405ef617056e8014d79101b
```
---
### 3. Product Discovery
**Purpose:** Initial crawl of a new dispensary
**Handler:** `src/tasks/handlers/product-discovery.ts`
Same as product_resync but for first-time crawls.
---
### 4. Product Resync
**Purpose:** Recurring crawl to capture price/stock changes
**Handler:** `src/tasks/handlers/product-resync.ts`
**Flow:**
#### Step 1: Load Dispensary Info
```sql
SELECT id, name, platform_dispensary_id, menu_url, state
FROM dispensaries
WHERE id = $1 AND crawl_enabled = true
```
#### Step 2: Start Stealth Session
- Generate random browser fingerprint
- Set locale/timezone matching state
- Optional proxy rotation
#### Step 3: Fetch Products via GraphQL
**Endpoint:** `https://dutchie.com/api-3/graphql`
**Variables:**
```javascript
{
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: "<platform_dispensary_id>",
pricingType: "rec",
Status: "All",
types: [],
useCache: false,
isDefaultSort: true,
sortBy: "popularSortIdx",
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false
},
page: 0,
perPage: 100
}
```
**Key Notes:**
- `Status: "All"` returns all products (Active returns same count)
- `Status: null` returns 0 products (broken)
- `pricingType: "rec"` returns BOTH rec and med prices
- Paginate until `products.length < perPage` or `allProducts.length >= totalCount`
#### Step 4: Normalize Data
Transform raw Dutchie payload to canonical format via `DutchieNormalizer`.
#### Step 5: Upsert Products
Insert/update `store_products` table with normalized data.
#### Step 6: Create Snapshots
Insert point-in-time record to `store_product_snapshots`.
#### Step 7: Track Missing Products (OOS Detection)
```sql
-- Reset consecutive_misses for products IN the feed
UPDATE store_products
SET consecutive_misses = 0, last_seen_at = NOW()
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id = ANY($2)
-- Increment for products NOT in feed
UPDATE store_products
SET consecutive_misses = consecutive_misses + 1
WHERE dispensary_id = $1
AND provider = 'dutchie'
AND provider_product_id NOT IN (...)
AND consecutive_misses < 3
-- Mark OOS at 3 consecutive misses
UPDATE store_products
SET stock_status = 'oos', is_in_stock = false
WHERE dispensary_id = $1
AND consecutive_misses >= 3
AND stock_status != 'oos'
```
#### Step 8: Download Images
For new products, download and store images locally.
#### Step 9: Update Dispensary
```sql
UPDATE dispensaries SET last_crawl_at = NOW() WHERE id = $1
```
---
## GraphQL Payload Structure
### Product Fields (from filteredProducts.products[])
| Field | Type | Description |
|-------|------|-------------|
| `_id` / `id` | string | MongoDB ObjectId (24 hex chars) |
| `Name` | string | Product display name |
| `brandName` | string | Brand name |
| `brand.name` | string | Brand name (nested) |
| `brand.description` | string | Brand description |
| `type` | string | Category (Flower, Edible, Concentrate, etc.) |
| `subcategory` | string | Subcategory |
| `strainType` | string | Hybrid, Indica, Sativa, N/A |
| `Status` | string | Always "Active" in feed |
| `Image` | string | Primary image URL |
| `images[]` | array | All product images |
### Pricing Fields
| Field | Type | Description |
|-------|------|-------------|
| `Prices[]` | number[] | Rec prices per option |
| `recPrices[]` | number[] | Rec prices |
| `medicalPrices[]` | number[] | Medical prices |
| `recSpecialPrices[]` | number[] | Rec sale prices |
| `medicalSpecialPrices[]` | number[] | Medical sale prices |
| `Options[]` | string[] | Size options ("1/8oz", "1g", etc.) |
| `rawOptions[]` | string[] | Raw weight options ("3.5g") |
### Inventory Fields (POSMetaData.children[])
| Field | Type | Description |
|-------|------|-------------|
| `quantity` | number | Total inventory count |
| `quantityAvailable` | number | Available for online orders |
| `kioskQuantityAvailable` | number | Available for kiosk orders |
| `option` | string | Which size option this is for |
### Potency Fields
| Field | Type | Description |
|-------|------|-------------|
| `THCContent.range[]` | number[] | THC percentage |
| `CBDContent.range[]` | number[] | CBD percentage |
| `cannabinoidsV2[]` | array | Detailed cannabinoid breakdown |
### Specials (specialData.bogoSpecials[])
| Field | Type | Description |
|-------|------|-------------|
| `specialName` | string | Deal name |
| `specialType` | string | "bogo", "sale", etc. |
| `itemsForAPrice.value` | string | Bundle price |
| `bogoRewards[].totalQuantity.quantity` | number | Required quantity |
---
## OOS Detection Logic
Products disappear from the Dutchie feed when they go out of stock. We track this via `consecutive_misses`:
| Scenario | Action |
|----------|--------|
| Product in feed | `consecutive_misses = 0` |
| Product missing 1st time | `consecutive_misses = 1` |
| Product missing 2nd time | `consecutive_misses = 2` |
| Product missing 3rd time | `consecutive_misses = 3`, mark `stock_status = 'oos'` |
| Product returns to feed | `consecutive_misses = 0`, update stock_status |
**Why 3 misses?**
- Protects against false positives from crawl failures
- Single bad crawl doesn't trigger mass OOS alerts
- Balances detection speed vs accuracy
---
## Database Tables
### store_products
Current state of each product:
- `provider_product_id` - Dutchie's MongoDB ObjectId
- `name_raw`, `brand_name_raw` - Raw values from feed
- `price_rec`, `price_med` - Current prices
- `is_in_stock`, `stock_status` - Availability
- `consecutive_misses` - OOS detection counter
- `last_seen_at` - Last time product was in feed
### store_product_snapshots
Point-in-time records for historical analysis:
- One row per product per crawl
- Captures price, stock, potency at that moment
- Used for price history, analytics
### dispensaries
Store metadata:
- `platform_dispensary_id` - MongoDB ObjectId for GraphQL
- `menu_url` - Source URL
- `last_crawl_at` - Last successful crawl
- `crawl_enabled` - Whether to crawl
---
## Worker Roles
Workers pull tasks from the `worker_tasks` queue based on their assigned role.
| Role | Name | Description | Handler |
|------|------|-------------|---------|
| `product_resync` | Product Resync | Re-crawl dispensary products for price/stock changes | `handleProductResync` |
| `product_discovery` | Product Discovery | Initial product discovery for new dispensaries | `handleProductDiscovery` |
| `store_discovery` | Store Discovery | Discover new dispensary locations | `handleStoreDiscovery` |
| `entry_point_discovery` | Entry Point Discovery | Resolve platform IDs from menu URLs | `handleEntryPointDiscovery` |
| `analytics_refresh` | Analytics Refresh | Refresh materialized views and analytics | `handleAnalyticsRefresh` |
**API Endpoint:** `GET /api/worker-registry/roles`
---
## Scheduling
Crawls are scheduled via `worker_tasks` table:
| Role | Frequency | Description |
|------|-----------|-------------|
| `product_resync` | Every 4 hours | Regular product refresh |
| `product_discovery` | On-demand | First crawl for new stores |
| `entry_point_discovery` | On-demand | New store setup |
| `store_discovery` | Daily | Find new stores |
| `analytics_refresh` | Daily | Refresh analytics materialized views |
---
## Priority & On-Demand Tasks
Tasks are claimed by workers in order of **priority DESC, created_at ASC**.
### Priority Levels
| Priority | Use Case | Example |
|----------|----------|---------|
| 0 | Scheduled/batch tasks | Daily product_resync generation |
| 10 | On-demand/chained tasks | entry_point → product_discovery |
| Higher | Urgent/manual triggers | Admin-triggered immediate crawl |
### Task Chaining
When a task completes, the system automatically creates follow-up tasks:
```
store_discovery (completed)
└─► entry_point_discovery (priority: 10) for each new store
entry_point_discovery (completed, success)
└─► product_discovery (priority: 10) for that store
product_discovery (completed)
└─► [no chain] Store enters regular resync schedule
```
### On-Demand Task Creation
Use the task service to create high-priority tasks:
```typescript
// Create immediate product resync for a store
await taskService.createTask({
role: 'product_resync',
dispensary_id: 123,
platform: 'dutchie',
priority: 20, // Higher than batch tasks
});
// Convenience methods with default high priority (10)
await taskService.createEntryPointTask(dispensaryId, 'dutchie');
await taskService.createProductDiscoveryTask(dispensaryId, 'dutchie');
await taskService.createStoreDiscoveryTask('dutchie', 'AZ');
```
### Claim Function
The `claim_task()` SQL function atomically claims tasks:
- Respects priority ordering (higher = first)
- Uses `FOR UPDATE SKIP LOCKED` for concurrency
- Prevents multiple active tasks per store
---
## Image Storage
Images are downloaded from Dutchie's AWS S3 and stored locally with on-demand resizing.
### Storage Path
```
/storage/images/products/<state>/<store>/<brand>/<product_id>/image-<hash>.webp
/storage/images/brands/<brand>/logo-<hash>.webp
```
**Example:**
```
/storage/images/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp
```
### Image Proxy API
Served via `/img/*` with on-demand resizing using **sharp**:
```
GET /img/products/az/az-deeply-rooted/bud-bros/6913e3cd444eac3935e928b9/image-ae38b1f9.webp?w=200
```
| Param | Description |
|-------|-------------|
| `w` | Width in pixels (max 4000) |
| `h` | Height in pixels (max 4000) |
| `q` | Quality 1-100 (default 80) |
| `fit` | cover, contain, fill, inside, outside |
| `blur` | Blur sigma (0.3-1000) |
| `gray` | Grayscale (1 = enabled) |
| `format` | webp, jpeg, png, avif (default webp) |
### Key Files
| File | Purpose |
|------|---------|
| `src/utils/image-storage.ts` | Download & save images to local filesystem |
| `src/routes/image-proxy.ts` | On-demand resize/transform at `/img/*` |
### Download Rules
| Scenario | Image Action |
|----------|--------------|
| **New product (first crawl)** | Download if `primaryImageUrl` exists |
| **Existing product (refresh)** | Download only if `local_image_path` is NULL (backfill) |
| **Product already has local image** | Skip download entirely |
**Logic:**
- Images are downloaded **once** and never re-downloaded on subsequent crawls
- `skipIfExists: true` - filesystem check prevents re-download even if queued
- First crawl: all products get images
- Refresh crawl: only new products or products missing local images
### Storage Rules
- **NO MinIO** - local filesystem only (`STORAGE_DRIVER=local`)
- Store full resolution, resize on-demand via `/img` proxy
- Convert to webp for consistency using **sharp**
- Preserve original Dutchie URL as fallback in `image_url` column
- Local path stored in `local_image_path` column
---
## Stealth & Anti-Detection
**PROXIES ARE REQUIRED** - Workers will fail to start if no active proxies are available in the database. All HTTP requests to Dutchie go through a proxy.
Workers automatically initialize anti-detection systems on startup.
### Components
| Component | Purpose | Source |
|-----------|---------|--------|
| **CrawlRotator** | Coordinates proxy + UA rotation | `src/services/crawl-rotator.ts` |
| **ProxyRotator** | Round-robin proxy selection, health tracking | `src/services/crawl-rotator.ts` |
| **UserAgentRotator** | Cycles through realistic browser fingerprints | `src/services/crawl-rotator.ts` |
| **Dutchie Client** | Curl-based HTTP with auto-retry on 403 | `src/platforms/dutchie/client.ts` |
### Initialization Flow
```
Worker Start
├─► initializeStealth()
│ │
│ ├─► CrawlRotator.initialize()
│ │ └─► Load proxies from `proxies` table
│ │
│ └─► setCrawlRotator(rotator)
│ └─► Wire to Dutchie client
└─► Process tasks...
```
### Stealth Session (per task)
Each crawl task starts a stealth session:
```typescript
// In product-refresh.ts, entry-point-discovery.ts
const session = startSession(dispensary.state || 'AZ', 'America/Phoenix');
```
This creates a new identity with:
- **Random fingerprint:** Chrome/Firefox/Safari/Edge on Win/Mac/Linux
- **Accept-Language:** Matches timezone (e.g., `America/Phoenix``en-US,en;q=0.9`)
- **sec-ch-ua headers:** Proper Client Hints for the browser profile
### On 403 Block
When Dutchie returns 403, the client automatically:
1. Records failure on current proxy (increments `failure_count`)
2. If proxy has 5+ failures, deactivates it
3. Rotates to next healthy proxy
4. Rotates fingerprint
5. Retries the request
### Proxy Table Schema
```sql
CREATE TABLE proxies (
id SERIAL PRIMARY KEY,
host VARCHAR(255) NOT NULL,
port INTEGER NOT NULL,
username VARCHAR(100),
password VARCHAR(100),
protocol VARCHAR(10) DEFAULT 'http', -- http, https, socks5
is_active BOOLEAN DEFAULT true,
last_used_at TIMESTAMPTZ,
failure_count INTEGER DEFAULT 0,
success_count INTEGER DEFAULT 0,
avg_response_time_ms INTEGER,
last_failure_at TIMESTAMPTZ,
last_error TEXT
);
```
### Configuration
Proxies are mandatory. There is no environment variable to disable them. Workers will refuse to start without active proxies in the database.
### User-Agent Generation
See `workflow-12102025.md` for full specification.
**Summary:**
- Uses `intoli/user-agents` library (daily-updated market share data)
- Device distribution: Mobile 62%, Desktop 36%, Tablet 2%
- Browser whitelist: Chrome, Safari, Edge, Firefox only
- UA sticks until IP rotates (403 or manual rotation)
- Failure = alert admin + stop crawl (no fallback)
Each fingerprint includes proper `sec-ch-ua`, `sec-ch-ua-platform`, and `sec-ch-ua-mobile` headers.
---
## Error Handling
- **GraphQL errors:** Logged, task marked failed, retried later
- **Normalization errors:** Logged as warnings, continue with valid products
- **Image download errors:** Non-fatal, logged, continue
- **Database errors:** Task fails, will be retried
- **403 blocks:** Auto-rotate proxy + fingerprint, retry (up to 3 retries)
---
## Files
| File | Purpose |
|------|---------|
| `src/tasks/handlers/product-resync.ts` | Main crawl handler |
| `src/tasks/handlers/entry-point-discovery.ts` | Slug → ID resolution |
| `src/platforms/dutchie/index.ts` | GraphQL client, session management |
| `src/hydration/normalizers/dutchie.ts` | Payload normalization |
| `src/hydration/canonical-upsert.ts` | Database upsert logic |
| `src/utils/image-storage.ts` | Image download and local storage |
| `src/routes/image-proxy.ts` | On-demand image resizing |
| `migrations/075_consecutive_misses.sql` | OOS tracking column |

View File

@@ -0,0 +1,297 @@
# Organic Browser-Based Scraping Guide
**Last Updated:** 2025-12-12
**Status:** Production-ready proof of concept
---
## Overview
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
---
## Why Organic Scraping?
| Approach | Detection Risk | Speed | Complexity |
|----------|---------------|-------|------------|
| Direct curl | Higher | Fast | Low |
| curl-impersonate | Medium | Fast | Medium |
| **Browser-based (organic)** | **Lowest** | Slower | Higher |
Direct curl requests can be fingerprinted via:
- TLS fingerprint (cipher suites, extensions)
- Header order and values
- Missing cookies/session data
- Request patterns
Browser-based requests inherit:
- Real Chrome TLS fingerprint
- Session cookies from page visit
- Natural header order
- JavaScript execution environment
---
## Implementation
### Dependencies
```bash
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
```
### Core Script: `test-intercept.js`
Located at: `backend/test-intercept.js`
```javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function capturePayload(config) {
const { dispensaryId, platformId, cName, outputPath } = config;
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// STEP 1: Establish session by visiting the menu
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
// STEP 2: Fetch ALL products using GraphQL from browser context
const result = await page.evaluate(async (platformId) => {
const allProducts = [];
let pageNum = 0;
const perPage = 100;
let totalCount = 0;
const sessionId = 'browser-session-' + Date.now();
while (pageNum < 30) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // CRITICAL: Must be 'Active', not null
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: pageNum,
perPage: perPage,
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
}
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions)
});
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
method: 'GET',
headers: {
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include'
});
const json = await response.json();
const data = json?.data?.filteredProducts;
if (!data?.products) break;
allProducts.push(...data.products);
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
if (allProducts.length >= totalCount) break;
pageNum++;
await new Promise(r => setTimeout(r, 200)); // Polite delay
}
return { products: allProducts, totalCount };
}, platformId);
await browser.close();
// STEP 3: Save payload
const payload = {
dispensaryId,
platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: result.products.length,
products: result.products,
};
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
return payload;
}
```
---
## Critical Parameters
### GraphQL Hash (FilteredProducts)
```
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
```
**WARNING:** Using the wrong hash returns HTTP 400.
### Status Parameter
| Value | Result |
|-------|--------|
| `'Active'` | Returns in-stock products (1019 in test) |
| `null` | Returns 0 products |
| `'All'` | Returns HTTP 400 |
**ALWAYS use `Status: 'Active'`**
### Required Headers
```javascript
{
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': 'unique-session-id',
'apollographql-client-name': 'Marketplace (production)',
}
```
### Endpoint
```
https://dutchie.com/api-3/graphql
```
---
## Performance Benchmarks
Test store: AZ-Deeply-Rooted (1019 products)
| Metric | Value |
|--------|-------|
| Total products | 1019 |
| Time | 18.5 seconds |
| Payload size | 11.8 MB |
| Pages fetched | 11 (100 per page) |
| Success rate | 100% |
---
## Payload Format
The output matches the existing `payload-fetch.ts` handler format:
```json
{
"dispensaryId": 123,
"platformId": "6405ef617056e8014d79101b",
"cName": "AZ-Deeply-Rooted",
"fetchedAt": "2025-12-12T05:05:19.837Z",
"productCount": 1019,
"products": [
{
"id": "6927508db4851262f629a869",
"Name": "Product Name",
"brand": { "name": "Brand Name", ... },
"type": "Flower",
"THC": "25%",
"Prices": [...],
"Options": [...],
...
}
]
}
```
---
## Integration Points
### As a Task Handler
The organic approach can be integrated as an alternative to curl-based fetching:
```typescript
// In src/tasks/handlers/organic-payload-fetch.ts
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
// Use puppeteer-based capture
// Save to same payload storage
// Queue product_refresh task
}
```
### Worker Configuration
Add to job_schedules:
```sql
INSERT INTO job_schedules (name, role, cron_expression)
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
```
---
## Troubleshooting
### HTTP 400 Bad Request
- Check hash is correct: `ee29c060...`
- Verify Status is `'Active'` (string, not null)
### 0 Products Returned
- Status was likely `null` or `'All'` - use `'Active'`
- Check platformId is valid MongoDB ObjectId
### Session Not Established
- Increase timeout on initial page.goto()
- Check cName is valid (matches embedded-menu URL)
### Detection/Blocking
- StealthPlugin should handle most cases
- Add random delays between pages
- Use headless: 'new' (not true/false)
---
## Files Reference
| File | Purpose |
|------|---------|
| `backend/test-intercept.js` | Proof of concept script |
| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation |
| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler |
| `backend/src/utils/payload-storage.ts` | Payload save/load utilities |
---
## See Also
- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation
- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture
- `CLAUDE.md` - Project rules and constraints

View File

@@ -0,0 +1,25 @@
# ARCHIVED DOCUMENTATION
**WARNING: These docs may be outdated or inaccurate.**
The code has evolved significantly. These docs are kept for historical reference only.
## What to Use Instead
**The single source of truth is:**
- `CLAUDE.md` (root) - Essential rules and quick reference
- `docs/CODEBASE_MAP.md` - Current file/directory reference
## Why Archive?
These docs were written during development iterations and may reference:
- Old file paths that no longer exist
- Deprecated approaches (hydration, scraper-v2)
- APIs that have changed
- Database schemas that evolved
## If You Need Details
1. First check CODEBASE_MAP.md for current file locations
2. Then read the actual source code
3. Only use archive docs as a last resort for historical context

View File

@@ -0,0 +1,584 @@
# Task Workflow Documentation
**Date: 2024-12-10**
This document describes the complete task/job processing architecture after the 2024-12-10 rewrite.
---
## Complete Architecture
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ API SERVER POD (scraper) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌────────────────────────────────────────┐ │ │
│ │ │ Express API │ │ TaskScheduler │ │ │
│ │ │ │ │ (src/services/task-scheduler.ts) │ │ │
│ │ │ /api/job-queue │ │ │ │ │
│ │ │ /api/tasks │ │ • Polls every 60s │ │ │
│ │ │ /api/schedules │ │ • Checks task_schedules table │ │ │
│ │ └────────┬─────────┘ │ • SELECT FOR UPDATE SKIP LOCKED │ │ │
│ │ │ │ • Generates tasks when due │ │ │
│ │ │ └──────────────────┬─────────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┼──────────────────────────────────┼──────────────────────────┘ │
│ │ │ │
│ │ ┌────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ POSTGRESQL DATABASE │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ task_schedules │ │ worker_tasks │ │ │
│ │ │ │ │ │ │ │
│ │ │ • product_refresh │───────►│ • pending tasks │ │ │
│ │ │ • store_discovery │ create │ • claimed tasks │ │ │
│ │ │ • analytics_refresh │ tasks │ • running tasks │ │ │
│ │ │ │ │ • completed tasks │ │ │
│ │ │ next_run_at │ │ │ │ │
│ │ │ last_run_at │ │ role, dispensary_id │ │ │
│ │ │ interval_hours │ │ priority, status │ │ │
│ │ └─────────────────────┘ └──────────┬──────────┘ │ │
│ │ │ │ │
│ └─────────────────────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┘ │
│ │ Workers poll for tasks │
│ │ (SELECT FOR UPDATE SKIP LOCKED) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WORKER PODS (StatefulSet: scraper-worker) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ task-worker │ │ task-worker │ │ task-worker │ │ task-worker │ │ │
│ │ │ .ts │ │ .ts │ │ .ts │ │ .ts │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
```
---
## Startup Sequence
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ API SERVER STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Express app initializes │
│ │ │
│ ▼ │
│ 2. runAutoMigrations() │
│ • Runs pending migrations (including 079_task_schedules.sql) │
│ │ │
│ ▼ │
│ 3. initializeMinio() / initializeImageStorage() │
│ │ │
│ ▼ │
│ 4. cleanupOrphanedJobs() │
│ │ │
│ ▼ │
│ 5. taskScheduler.start() ◄─── NEW (per TASK_WORKFLOW_2024-12-10.md) │
│ │ │
│ ├── Recover stale tasks (workers that died) │
│ ├── Ensure default schedules exist in task_schedules │
│ ├── Check and run any due schedules immediately │
│ └── Start 60-second poll interval │
│ │ │
│ ▼ │
│ 6. app.listen(PORT) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORKER POD STARTUP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. K8s starts pod from StatefulSet │
│ │ │
│ ▼ │
│ 2. TaskWorker.constructor() │
│ • Create DB pool │
│ • Create CrawlRotator │
│ │ │
│ ▼ │
│ 3. initializeStealth() │
│ • Load proxies from DB (REQUIRED - fails if none) │
│ • Wire rotator to Dutchie client │
│ │ │
│ ▼ │
│ 4. register() with API │
│ • Optional - continues if fails │
│ │ │
│ ▼ │
│ 5. startRegistryHeartbeat() every 30s │
│ │ │
│ ▼ │
│ 6. processNextTask() loop │
│ │ │
│ ├── Poll for pending task (FOR UPDATE SKIP LOCKED) │
│ ├── Claim task atomically │
│ ├── Execute handler (product_refresh, store_discovery, etc.) │
│ ├── Mark complete/failed │
│ ├── Chain next task if applicable │
│ └── Loop │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Schedule Flow
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SCHEDULER POLL (every 60 seconds) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ BEGIN TRANSACTION │
│ │ │
│ ▼ │
│ SELECT * FROM task_schedules │
│ WHERE enabled = true AND next_run_at <= NOW() │
│ FOR UPDATE SKIP LOCKED ◄─── Prevents duplicate execution across replicas │
│ │ │
│ ▼ │
│ For each due schedule: │
│ │ │
│ ├── product_refresh_all │
│ │ └─► Query dispensaries needing crawl │
│ │ └─► Create product_refresh tasks in worker_tasks │
│ │ │
│ ├── store_discovery_dutchie │
│ │ └─► Create single store_discovery task │
│ │ │
│ └── analytics_refresh │
│ └─► Create single analytics_refresh task │
│ │ │
│ ▼ │
│ UPDATE task_schedules SET │
│ last_run_at = NOW(), │
│ next_run_at = NOW() + interval_hours │
│ │ │
│ ▼ │
│ COMMIT │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
---
## Task Lifecycle
```
┌──────────┐
│ SCHEDULE │
│ DUE │
└────┬─────┘
┌──────────────┐ claim ┌──────────────┐ start ┌──────────────┐
│ PENDING │────────────►│ CLAIMED │────────────►│ RUNNING │
└──────────────┘ └──────────────┘ └──────┬───────┘
▲ │
│ ┌──────────────┼──────────────┐
│ retry │ │ │
│ (if retries < max) ▼ ▼ ▼
│ ┌──────────┐ ┌──────────┐ ┌──────────┐
└──────────────────────────────────│ FAILED │ │ COMPLETED│ │ STALE │
└──────────┘ └──────────┘ └────┬─────┘
recover_stale_tasks()
┌──────────┐
│ PENDING │
└──────────┘
```
---
## Database Tables
### task_schedules (NEW - migration 079)
Stores schedule definitions. Survives restarts.
```sql
CREATE TABLE task_schedules (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE,
role VARCHAR(50) NOT NULL, -- product_refresh, store_discovery, etc.
enabled BOOLEAN DEFAULT TRUE,
interval_hours INTEGER NOT NULL, -- How often to run
priority INTEGER DEFAULT 0, -- Task priority when created
state_code VARCHAR(2), -- Optional filter
last_run_at TIMESTAMPTZ, -- When it last ran
next_run_at TIMESTAMPTZ, -- When it's due next
last_task_count INTEGER, -- Tasks created last run
last_error TEXT -- Error message if failed
);
```
### worker_tasks (migration 074)
The task queue. Workers pull from here.
```sql
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3
);
```
---
## Default Schedules
| Name | Role | Interval | Priority | Description |
|------|------|----------|----------|-------------|
| `payload_fetch_all` | payload_fetch | 4 hours | 0 | Fetch payloads from Dutchie API (chains to product_refresh) |
| `store_discovery_dutchie` | store_discovery | 24 hours | 5 | Find new Dutchie stores |
| `analytics_refresh` | analytics_refresh | 6 hours | 0 | Refresh MVs |
---
## Task Roles
| Role | Description | Creates Tasks For |
|------|-------------|-------------------|
| `payload_fetch` | **NEW** - Fetch from Dutchie API, save to disk | Each dispensary needing crawl |
| `product_refresh` | **CHANGED** - Read local payload, normalize, upsert to DB | Chained from payload_fetch |
| `store_discovery` | Find new dispensaries, returns newStoreIds[] | Single task per platform |
| `entry_point_discovery` | **DEPRECATED** - Resolve platform IDs | No longer used |
| `product_discovery` | Initial product fetch for new stores | Chained from store_discovery |
| `analytics_refresh` | Refresh MVs | Single global task |
### Payload/Refresh Separation (2024-12-10)
The crawl workflow is now split into two phases:
```
payload_fetch (scheduled every 4h)
└─► Hit Dutchie GraphQL API
└─► Save raw JSON to /storage/payloads/{year}/{month}/{day}/store_{id}_{ts}.json.gz
└─► Record metadata in raw_crawl_payloads table
└─► Queue product_refresh task with payload_id
product_refresh (chained from payload_fetch)
└─► Load payload from filesystem (NOT from API)
└─► Normalize via DutchieNormalizer
└─► Upsert to store_products
└─► Create snapshots
└─► Track missing products
└─► Download images
```
**Benefits:**
- **Retry-friendly**: If normalize fails, re-run product_refresh without re-crawling
- **Replay-able**: Run product_refresh against any historical payload
- **Faster refreshes**: Local file read vs network call
- **Historical diffs**: Compare payloads to see what changed between crawls
- **Less API pressure**: Only payload_fetch hits Dutchie
---
## Task Chaining
Tasks automatically queue follow-up tasks upon successful completion. This creates two main flows:
### Discovery Flow (New Stores)
When `store_discovery` finds new dispensaries, they automatically get their initial product data:
```
store_discovery
└─► Discovers new locations via Dutchie GraphQL
└─► Auto-promotes valid locations to dispensaries table
└─► Collects newDispensaryIds[] from promotions
└─► Returns { newStoreIds: [...] } in result
chainNextTask() detects newStoreIds
└─► Creates product_discovery task for each new store
product_discovery
└─► Calls handlePayloadFetch() internally
└─► payload_fetch hits Dutchie API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem
└─► Normalizes and upserts to store_products
└─► Creates snapshots, downloads images
```
**Complete Discovery Chain:**
```
store_discovery → product_discovery → payload_fetch → product_refresh
(internal call) (queues next)
```
### Scheduled Flow (Existing Stores)
For existing stores, `payload_fetch_all` schedule runs every 4 hours:
```
TaskScheduler (every 60s)
└─► Checks task_schedules for due schedules
└─► payload_fetch_all is due
└─► Generates payload_fetch task for each dispensary
payload_fetch
└─► Hits Dutchie GraphQL API
└─► Saves raw JSON to /storage/payloads/
└─► Queues product_refresh task with payload_id
product_refresh
└─► Loads payload from filesystem (NOT API)
└─► Normalizes via DutchieNormalizer
└─► Upserts to store_products
└─► Creates snapshots
```
**Complete Scheduled Chain:**
```
payload_fetch → product_refresh
(queues) (reads local)
```
### Chaining Implementation
Task chaining is handled in two places:
1. **Internal chaining (handler calls handler):**
- `product_discovery` calls `handlePayloadFetch()` directly
2. **External chaining (chainNextTask() in task-service.ts):**
- Called after task completion
- `store_discovery` → queues `product_discovery` for each newStoreId
3. **Queue-based chaining (taskService.createTask):**
- `payload_fetch` queues `product_refresh` with `payload: { payload_id }`
---
## Payload API Endpoints
Raw crawl payloads can be accessed via the Payloads API:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/payloads` | GET | List payload metadata (paginated) |
| `GET /api/payloads/:id` | GET | Get payload metadata by ID |
| `GET /api/payloads/:id/data` | GET | Get full payload JSON (decompressed) |
| `GET /api/payloads/store/:dispensaryId` | GET | List payloads for a store |
| `GET /api/payloads/store/:dispensaryId/latest` | GET | Get latest payload for a store |
| `GET /api/payloads/store/:dispensaryId/diff` | GET | Diff two payloads for changes |
### Payload Diff Response
The diff endpoint returns:
```json
{
"success": true,
"from": { "id": 123, "fetchedAt": "...", "productCount": 100 },
"to": { "id": 456, "fetchedAt": "...", "productCount": 105 },
"diff": {
"added": 10,
"removed": 5,
"priceChanges": 8,
"stockChanges": 12
},
"details": {
"added": [...],
"removed": [...],
"priceChanges": [...],
"stockChanges": [...]
}
}
```
---
## API Endpoints
### Schedules (NEW)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/schedules` | GET | List all schedules |
| `PUT /api/schedules/:id` | PUT | Update schedule |
| `POST /api/schedules/:id/trigger` | POST | Run schedule immediately |
### Task Creation (rewired 2024-12-10)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `POST /api/job-queue/enqueue` | POST | Create single task |
| `POST /api/job-queue/enqueue-batch` | POST | Create batch tasks |
| `POST /api/job-queue/enqueue-state` | POST | Create tasks for state |
| `POST /api/tasks` | POST | Direct task creation |
### Task Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/tasks` | GET | List tasks |
| `GET /api/tasks/:id` | GET | Get single task |
| `GET /api/tasks/counts` | GET | Task counts by status |
| `POST /api/tasks/recover-stale` | POST | Recover stale tasks |
---
## Key Files
| File | Purpose |
|------|---------|
| `src/services/task-scheduler.ts` | **NEW** - DB-driven scheduler |
| `src/tasks/task-worker.ts` | Worker that processes tasks |
| `src/tasks/task-service.ts` | Task CRUD operations |
| `src/tasks/handlers/payload-fetch.ts` | **NEW** - Fetches from API, saves to disk |
| `src/tasks/handlers/product-refresh.ts` | **CHANGED** - Reads from disk, processes to DB |
| `src/utils/payload-storage.ts` | **NEW** - Payload save/load utilities |
| `src/routes/tasks.ts` | Task API endpoints |
| `src/routes/job-queue.ts` | Job Queue UI endpoints (rewired) |
| `migrations/079_task_schedules.sql` | Schedule table |
| `migrations/080_raw_crawl_payloads.sql` | Payload metadata table |
| `migrations/081_payload_fetch_columns.sql` | payload, last_fetch_at columns |
| `migrations/074_worker_task_queue.sql` | Task queue table |
---
## Legacy Code (DEPRECATED)
| File | Status | Replacement |
|------|--------|-------------|
| `src/services/scheduler.ts` | DEPRECATED | `task-scheduler.ts` |
| `dispensary_crawl_jobs` table | ORPHANED | `worker_tasks` |
| `job_schedules` table | LEGACY | `task_schedules` |
---
## Dashboard Integration
Both pages remain wired to the dashboard:
| Page | Data Source | Actions |
|------|-------------|---------|
| **Job Queue** | `worker_tasks`, `task_schedules` | Create tasks, view schedules |
| **Task Queue** | `worker_tasks` | View tasks, recover stale |
---
## Multi-Replica Safety
The scheduler uses `SELECT FOR UPDATE SKIP LOCKED` to ensure:
1. **Only one replica** executes a schedule at a time
2. **No duplicate tasks** created
3. **Survives pod restarts** - state in DB, not memory
4. **Self-healing** - recovers stale tasks on startup
```sql
-- This query is atomic across all API server replicas
SELECT * FROM task_schedules
WHERE enabled = true AND next_run_at <= NOW()
FOR UPDATE SKIP LOCKED
```
---
## Worker Scaling (K8s)
Workers run as a StatefulSet in Kubernetes. You can scale from the admin UI or CLI.
### From Admin UI
The Workers page (`/admin/workers`) provides:
- Current replica count display
- Scale up/down buttons
- Target replica input
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `GET /api/workers/k8s/replicas` | GET | Get current/desired replica counts |
| `POST /api/workers/k8s/scale` | POST | Scale to N replicas (body: `{ replicas: N }`) |
### From CLI
```bash
# View current replicas
kubectl get statefulset scraper-worker -n dispensary-scraper
# Scale to 10 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=10
# Scale down to 3 workers
kubectl scale statefulset scraper-worker -n dispensary-scraper --replicas=3
```
### Configuration
Environment variables for the API server:
| Variable | Default | Description |
|----------|---------|-------------|
| `K8S_NAMESPACE` | `dispensary-scraper` | Kubernetes namespace |
| `K8S_WORKER_STATEFULSET` | `scraper-worker` | StatefulSet name |
### RBAC Requirements
The API server pod needs these K8s permissions:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: worker-scaler
namespace: dispensary-scraper
rules:
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: scraper-worker-scaler
namespace: dispensary-scraper
subjects:
- kind: ServiceAccount
name: default
namespace: dispensary-scraper
roleRef:
kind: Role
name: worker-scaler
apiGroup: rbac.authorization.k8s.io
```

View File

@@ -0,0 +1,542 @@
# Worker Task Architecture
This document describes the unified task-based worker system that replaces the legacy fragmented job systems.
## Overview
The task worker architecture provides a single, unified system for managing all background work in CannaiQ:
- **Store discovery** - Find new dispensaries on platforms
- **Entry point discovery** - Resolve platform IDs from menu URLs
- **Product discovery** - Initial product fetch for new stores
- **Product resync** - Regular price/stock updates for existing stores
- **Analytics refresh** - Refresh materialized views and analytics
## Architecture
### Database Tables
**`worker_tasks`** - Central task queue
```sql
CREATE TABLE worker_tasks (
id SERIAL PRIMARY KEY,
role task_role NOT NULL, -- What type of work
dispensary_id INTEGER, -- Which store (if applicable)
platform VARCHAR(50), -- Which platform (dutchie, etc.)
status task_status DEFAULT 'pending',
priority INTEGER DEFAULT 0, -- Higher = process first
scheduled_for TIMESTAMP, -- Don't process before this time
worker_id VARCHAR(100), -- Which worker claimed it
claimed_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat_at TIMESTAMP, -- For stale detection
result JSONB, -- Output from handler
error_message TEXT,
retry_count INTEGER DEFAULT 0,
max_retries INTEGER DEFAULT 3,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
```
**Key indexes:**
- `idx_worker_tasks_pending_priority` - For efficient task claiming
- `idx_worker_tasks_active_dispensary` - Prevents concurrent tasks per store (partial unique index)
### Task Roles
| Role | Purpose | Per-Store | Scheduled |
|------|---------|-----------|-----------|
| `store_discovery` | Find new stores on a platform | No | Daily |
| `entry_point_discovery` | Resolve platform IDs | Yes | On-demand |
| `product_discovery` | Initial product fetch | Yes | After entry_point |
| `product_resync` | Price/stock updates | Yes | Every 4 hours |
| `analytics_refresh` | Refresh MVs | No | Daily |
### Task Lifecycle
```
pending → claimed → running → completed
failed
```
1. **pending** - Task is waiting to be picked up
2. **claimed** - Worker has claimed it (atomic via SELECT FOR UPDATE SKIP LOCKED)
3. **running** - Worker is actively processing
4. **completed** - Task finished successfully
5. **failed** - Task encountered an error
6. **stale** - Task lost its worker (recovered automatically)
## Files
### Core Files
| File | Purpose |
|------|---------|
| `src/tasks/task-service.ts` | TaskService - CRUD, claiming, capacity metrics |
| `src/tasks/task-worker.ts` | TaskWorker - Main worker loop |
| `src/tasks/index.ts` | Module exports |
| `src/routes/tasks.ts` | API endpoints |
| `migrations/074_worker_task_queue.sql` | Database schema |
### Task Handlers
| File | Role |
|------|------|
| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
| `src/tasks/handlers/product-resync.ts` | `product_resync` |
| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
## Running Workers
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `WORKER_ROLE` | (required) | Which task role to process |
| `WORKER_ID` | auto-generated | Custom worker identifier |
| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
### Starting a Worker
```bash
# Start a product resync worker
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
# Start with custom ID
WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
# Start multiple workers for different roles
WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts &
WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts &
```
### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: task-worker-resync
spec:
replicas: 3
template:
spec:
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
command: ["npx", "tsx", "src/tasks/task-worker.ts"]
env:
- name: WORKER_ROLE
value: "product_resync"
```
## API Endpoints
### Task Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks` | GET | List tasks with filters |
| `/api/tasks` | POST | Create a new task |
| `/api/tasks/:id` | GET | Get task by ID |
| `/api/tasks/counts` | GET | Get counts by status |
| `/api/tasks/capacity` | GET | Get capacity metrics |
| `/api/tasks/capacity/:role` | GET | Get role-specific capacity |
| `/api/tasks/recover-stale` | POST | Recover tasks from dead workers |
### Task Generation
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/generate/resync` | POST | Generate daily resync tasks |
| `/api/tasks/generate/discovery` | POST | Create store discovery task |
### Migration (from legacy systems)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/migration/status` | GET | Compare old vs new systems |
| `/api/tasks/migration/disable-old-schedules` | POST | Disable job_schedules |
| `/api/tasks/migration/cancel-pending-crawl-jobs` | POST | Cancel old crawl jobs |
| `/api/tasks/migration/create-resync-tasks` | POST | Create tasks for all stores |
| `/api/tasks/migration/full-migrate` | POST | One-click migration |
### Role-Specific Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tasks/role/:role/last-completion` | GET | Last completion time |
| `/api/tasks/role/:role/recent` | GET | Recent completions |
| `/api/tasks/store/:id/active` | GET | Check if store has active task |
## Capacity Planning
The `v_worker_capacity` view provides real-time metrics:
```sql
SELECT * FROM v_worker_capacity;
```
Returns:
- `pending_tasks` - Tasks waiting to be claimed
- `ready_tasks` - Tasks ready now (scheduled_for is null or past)
- `claimed_tasks` - Tasks claimed but not started
- `running_tasks` - Tasks actively processing
- `completed_last_hour` - Recent completions
- `failed_last_hour` - Recent failures
- `active_workers` - Workers with recent heartbeats
- `avg_duration_sec` - Average task duration
- `tasks_per_worker_hour` - Throughput estimate
- `estimated_hours_to_drain` - Time to clear queue
### Scaling Recommendations
```javascript
// API: GET /api/tasks/capacity/:role
{
"role": "product_resync",
"pending_tasks": 500,
"active_workers": 3,
"workers_needed": {
"for_1_hour": 10,
"for_4_hours": 3,
"for_8_hours": 2
}
}
```
## Task Chaining
Tasks can automatically create follow-up tasks:
```
store_discovery → entry_point_discovery → product_discovery
(store has platform_dispensary_id)
Daily resync tasks
```
The `chainNextTask()` method handles this automatically.
## Stale Task Recovery
Tasks are considered stale if `last_heartbeat_at` is older than the threshold (default 10 minutes).
```sql
SELECT recover_stale_tasks(10); -- 10 minute threshold
```
Or via API:
```bash
curl -X POST /api/tasks/recover-stale \
-H 'Content-Type: application/json' \
-d '{"threshold_minutes": 10}'
```
## Migration from Legacy Systems
### Legacy Systems Replaced
1. **job_schedules + job_run_logs** - Scheduled job definitions
2. **dispensary_crawl_jobs** - Per-dispensary crawl queue
3. **SyncOrchestrator + HydrationWorker** - Raw payload processing
### Migration Steps
**Option 1: One-Click Migration**
```bash
curl -X POST /api/tasks/migration/full-migrate
```
This will:
1. Disable all job_schedules
2. Cancel pending dispensary_crawl_jobs
3. Generate resync tasks for all stores
4. Create discovery and analytics tasks
**Option 2: Manual Migration**
```bash
# 1. Check current status
curl /api/tasks/migration/status
# 2. Disable old schedules
curl -X POST /api/tasks/migration/disable-old-schedules
# 3. Cancel pending crawl jobs
curl -X POST /api/tasks/migration/cancel-pending-crawl-jobs
# 4. Create resync tasks
curl -X POST /api/tasks/migration/create-resync-tasks \
-H 'Content-Type: application/json' \
-d '{"state_code": "AZ"}'
# 5. Generate daily resync schedule
curl -X POST /api/tasks/generate/resync \
-H 'Content-Type: application/json' \
-d '{"batches_per_day": 6}'
```
## Per-Store Locking
The system prevents concurrent tasks for the same store using a partial unique index:
```sql
CREATE UNIQUE INDEX idx_worker_tasks_active_dispensary
ON worker_tasks (dispensary_id)
WHERE dispensary_id IS NOT NULL
AND status IN ('claimed', 'running');
```
This ensures only one task can be active per store at any time.
## Task Priority
Tasks are claimed in priority order (higher first), then by creation time:
```sql
ORDER BY priority DESC, created_at ASC
```
Default priorities:
- `store_discovery`: 0
- `entry_point_discovery`: 10 (high - new stores)
- `product_discovery`: 10 (high - new stores)
- `product_resync`: 0
- `analytics_refresh`: 0
## Scheduled Tasks
Tasks can be scheduled for future execution:
```javascript
await taskService.createTask({
role: 'product_resync',
dispensary_id: 123,
scheduled_for: new Date('2025-01-10T06:00:00Z'),
});
```
The `generate_resync_tasks()` function creates staggered tasks throughout the day:
```sql
SELECT generate_resync_tasks(6, '2025-01-10'); -- 6 batches = every 4 hours
```
## Dashboard Integration
The admin dashboard shows task queue status in the main overview:
```
Task Queue Summary
------------------
Pending: 45
Running: 3
Completed: 1,234
Failed: 12
```
Full task management is available at `/admin/tasks`.
## Error Handling
Failed tasks include the error message in `error_message` and can be retried:
```sql
-- View failed tasks
SELECT id, role, dispensary_id, error_message, retry_count
FROM worker_tasks
WHERE status = 'failed'
ORDER BY completed_at DESC
LIMIT 20;
-- Retry failed tasks
UPDATE worker_tasks
SET status = 'pending', retry_count = retry_count + 1
WHERE status = 'failed' AND retry_count < max_retries;
```
## Concurrent Task Processing (Added 2024-12)
Workers can now process multiple tasks concurrently within a single worker instance. This improves throughput by utilizing async I/O efficiently.
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Pod (K8s) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ TaskWorker │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Task 1 │ │ Task 2 │ │ Task 3 │ (concurrent)│ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Resource Monitor │ │
│ │ ├── Memory: 65% (threshold: 85%) │ │
│ │ ├── CPU: 45% (threshold: 90%) │ │
│ │ └── Status: Normal │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MAX_CONCURRENT_TASKS` | 3 | Maximum tasks a worker will run concurrently |
| `MEMORY_BACKOFF_THRESHOLD` | 0.85 | Back off when heap memory exceeds 85% |
| `CPU_BACKOFF_THRESHOLD` | 0.90 | Back off when CPU exceeds 90% |
| `BACKOFF_DURATION_MS` | 10000 | How long to wait when backing off (10s) |
### How It Works
1. **Main Loop**: Worker continuously tries to fill up to `MAX_CONCURRENT_TASKS`
2. **Resource Monitoring**: Before claiming a new task, worker checks memory and CPU
3. **Backoff**: If resources exceed thresholds, worker pauses and stops claiming new tasks
4. **Concurrent Execution**: Tasks run in parallel using `Promise` - they don't block each other
5. **Graceful Shutdown**: On SIGTERM/decommission, worker stops claiming but waits for active tasks
### Resource Monitoring
```typescript
// ResourceStats interface
interface ResourceStats {
memoryPercent: number; // Current heap usage as decimal (0.0-1.0)
memoryMb: number; // Current heap used in MB
memoryTotalMb: number; // Total heap available in MB
cpuPercent: number; // CPU usage as percentage (0-100)
isBackingOff: boolean; // True if worker is in backoff state
backoffReason: string; // Why the worker is backing off
}
```
### Heartbeat Data
Workers report the following in their heartbeat:
```json
{
"worker_id": "worker-abc123",
"current_task_id": 456,
"current_task_ids": [456, 457, 458],
"active_task_count": 3,
"max_concurrent_tasks": 3,
"status": "active",
"resources": {
"memory_mb": 256,
"memory_total_mb": 512,
"memory_rss_mb": 320,
"memory_percent": 50,
"cpu_user_ms": 12500,
"cpu_system_ms": 3200,
"cpu_percent": 45,
"is_backing_off": false,
"backoff_reason": null
}
}
```
### Backoff Behavior
When resources exceed thresholds:
1. Worker logs the backoff reason:
```
[TaskWorker] MyWorker backing off: Memory at 87.3% (threshold: 85%)
```
2. Worker stops claiming new tasks but continues existing tasks
3. After `BACKOFF_DURATION_MS`, worker rechecks resources
4. When resources return to normal:
```
[TaskWorker] MyWorker resuming normal operation
```
### UI Display
The Workers Dashboard shows:
- **Tasks Column**: `2/3 tasks` (active/max concurrent)
- **Resources Column**: Memory % and CPU % with color coding
- Green: < 50%
- Yellow: 50-74%
- Amber: 75-89%
- Red: 90%+
- **Backing Off**: Orange warning badge when worker is in backoff state
### Task Count Badge Details
```
┌─────────────────────────────────────────────┐
│ Worker: "MyWorker" │
│ Tasks: 2/3 tasks #456, #457 │
│ Resources: 🧠 65% 💻 45% │
│ Status: ● Active │
└─────────────────────────────────────────────┘
```
### Best Practices
1. **Start Conservative**: Use `MAX_CONCURRENT_TASKS=3` initially
2. **Monitor Resources**: Watch for frequent backoffs in logs
3. **Tune Per Workload**: I/O-bound tasks benefit from higher concurrency
4. **Scale Horizontally**: Add more pods rather than cranking concurrency too high
### Code References
| File | Purpose |
|------|---------|
| `src/tasks/task-worker.ts:68-71` | Concurrency environment variables |
| `src/tasks/task-worker.ts:104-111` | ResourceStats interface |
| `src/tasks/task-worker.ts:149-179` | getResourceStats() method |
| `src/tasks/task-worker.ts:184-196` | shouldBackOff() method |
| `src/tasks/task-worker.ts:462-516` | mainLoop() with concurrent claiming |
| `src/routes/worker-registry.ts:148-195` | Heartbeat endpoint handling |
| `cannaiq/src/pages/WorkersDashboard.tsx:233-305` | UI components for resources |
## Monitoring
### Logs
Workers log to stdout:
```
[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
[TaskWorker] Task 123 completed successfully
```
### Health Check
Check if workers are active:
```sql
SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
FROM worker_tasks
WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
GROUP BY worker_id, role;
```
### Metrics
```sql
-- Tasks by status
SELECT status, COUNT(*) FROM worker_tasks GROUP BY status;
-- Tasks by role
SELECT role, status, COUNT(*) FROM worker_tasks GROUP BY role, status;
-- Average duration by role
SELECT role, AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds
FROM worker_tasks
WHERE status = 'completed' AND completed_at > NOW() - INTERVAL '24 hours'
GROUP BY role;
```