feat: Stealth worker system with mandatory proxy rotation

## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00
parent 0295637ed6
commit 56cc171287
61 changed files with 8591 additions and 2076 deletions
--- a/docs/CRAWL_SYSTEM_V2.md
+++ b/docs/CRAWL_SYSTEM_V2.md
@@ -0,0 +1,353 @@
+# CannaiQ Crawl System V2
+
+## Overview
+
+The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:
+
+1. **Phase 1: Store Discovery** - Weekly discovery of Dutchie-powered dispensaries
+2. **Phase 2: Product Crawling** - Regular product/price/stock updates (documented separately)
+
+---
+
+## Phase 1: Store Discovery
+
+### Purpose
+
+Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.
+
+### Schedule
+
+- **Frequency**: Weekly (typically Sunday night)
+- **Duration**: ~2-4 hours for full US coverage
+
+### Flow Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        PHASE 1: STORE DISCOVERY                     │
+└─────────────────────────────────────────────────────────────────────┘
+
+1. IDENTITY SETUP
+   ┌──────────────────┐
+   │ getRandomProxy() │ ──► Random IP from proxy pool
+   └──────────────────┘
+            │
+            ▼
+   ┌──────────────────┐
+   │  startSession()  │ ──► Random UA + fingerprint + locale matching proxy location
+   └──────────────────┘
+
+2. CITY DISCOVERY (per state)
+   ┌──────────────────────────────┐
+   │ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
+   └──────────────────────────────┘
+            │
+            ▼
+   ┌──────────────────────────────┐
+   │  Upsert dutchie_discovery_   │
+   │  cities table                │
+   └──────────────────────────────┘
+
+3. STORE DISCOVERY (per city)
+   ┌───────────────────────────────┐
+   │ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
+   └───────────────────────────────┘
+            │
+            ▼
+   ┌───────────────────────────────┐
+   │  Upsert dutchie_discovery_    │
+   │  locations table              │
+   └───────────────────────────────┘
+
+4. VALIDATION & PROMOTION
+   ┌──────────────────────────┐
+   │ validateForPromotion()  │ ──► Check required fields
+   └──────────────────────────┘
+            │
+            ▼
+   ┌──────────────────────────┐
+   │   promoteLocation()     │ ──► Upsert to dispensaries table
+   └──────────────────────────┘
+            │
+            ▼
+   ┌──────────────────────────┐
+   │ ensureCrawlerProfile()  │ ──► Create profile with status='sandbox'
+   └──────────────────────────┘
+
+5. DROPPED STORE DETECTION
+   ┌──────────────────────────┐
+   │  detectDroppedStores()  │ ──► Find stores missing from discovery
+   └──────────────────────────┘
+            │
+            ▼
+   ┌──────────────────────────┐
+   │  Mark status='dropped'  │ ──► Dashboard alert for review
+   └──────────────────────────┘
+```
+
+---
+
+## Key Files
+
+| File | Purpose |
+|------|---------|
+| `backend/src/platforms/dutchie/client.ts` | HTTP client with proxy/fingerprint rotation |
+| `backend/src/discovery/discovery-crawler.ts` | Main discovery orchestrator |
+| `backend/src/discovery/location-discovery.ts` | City/store GraphQL fetching |
+| `backend/src/discovery/promotion.ts` | Validation and promotion logic |
+| `backend/src/scripts/run-discovery.ts` | CLI entry point |
+
+---
+
+## Identity Masking
+
+Before any GraphQL queries, the system establishes a masked identity:
+
+### 1. Proxy Selection
+
+```typescript
+// backend/src/platforms/dutchie/client.ts
+
+// Get random proxy from active pool (NOT state-specific)
+const proxy = await getRandomProxy();
+setProxy(proxy.url);
+```
+
+The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.
+
+### 2. Fingerprint + Locale Harmonization
+
+```typescript
+// backend/src/platforms/dutchie/client.ts
+
+function startSession(stateCode: string, timezone: string) {
+  // 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
+  const fingerprint = getRandomFingerprint();
+
+  // 2. Match Accept-Language to proxy's timezone/location
+  const locale = getLocaleForTimezone(timezone);
+
+  // 3. Set headers for this session
+  currentSession = {
+    userAgent: fingerprint.ua,
+    acceptLanguage: locale,
+    secChUa: fingerprint.secChUa,
+    // ... other fingerprint headers
+  };
+}
+```
+
+### Fingerprint Pool
+
+6 browser fingerprints rotate on each session and on 403 errors:
+
+| Browser | Version | Platform |
+|---------|---------|----------|
+| Chrome | 120 | Windows |
+| Chrome | 120 | macOS |
+| Firefox | 121 | Windows |
+| Firefox | 121 | macOS |
+| Safari | 17.2 | macOS |
+| Edge | 120 | Windows |
+
+### Timezone → Locale Mapping
+
+```typescript
+const TIMEZONE_TO_LOCALE: Record<string, string> = {
+  'America/New_York': 'en-US,en;q=0.9',
+  'America/Chicago': 'en-US,en;q=0.9',
+  'America/Denver': 'en-US,en;q=0.9',
+  'America/Los_Angeles': 'en-US,en;q=0.9',
+  'America/Phoenix': 'en-US,en;q=0.9',
+  // ...
+};
+```
+
+---
+
+## GraphQL Queries
+
+### 1. getAllCitiesByState
+
+Fetches cities with active dispensaries for a state.
+
+```typescript
+// backend/src/discovery/location-discovery.ts
+
+const response = await executeGraphQL({
+  operationName: 'getAllCitiesByState',
+  variables: {
+    state: 'AZ',
+    countryCode: 'US'
+  }
+});
+// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }
+```
+
+**Hash**: `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6`
+
+### 2. ConsumerDispensaries
+
+Fetches store data for a city/state.
+
+```typescript
+// backend/src/discovery/location-discovery.ts
+
+const response = await executeGraphQL({
+  operationName: 'ConsumerDispensaries',
+  variables: {
+    dispensaryFilter: {
+      city: 'Phoenix',
+      state: 'AZ',
+      activeOnly: true
+    }
+  }
+});
+// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]
+```
+
+**Hash**: `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b`
+
+---
+
+## Database Tables
+
+### Discovery Tables (Staging)
+
+| Table | Purpose |
+|-------|---------|
+| `dutchie_discovery_cities` | Cities known to have dispensaries |
+| `dutchie_discovery_locations` | Raw discovered store data |
+
+### Canonical Tables
+
+| Table | Purpose |
+|-------|---------|
+| `dispensaries` | Promoted stores ready for crawling |
+| `dispensary_crawler_profiles` | Crawler configuration per store |
+| `dutchie_promotion_log` | Audit trail for all discovery actions |
+
+---
+
+## Validation Rules
+
+A discovery location must have these fields to be promoted:
+
+| Field | Requirement |
+|-------|-------------|
+| `platform_location_id` | MongoDB ObjectId (24 hex chars) |
+| `name` | Non-empty string |
+| `city` | Non-empty string |
+| `state_code` | Non-empty string |
+| `platform_menu_url` | Valid URL |
+
+Invalid records are marked `status='rejected'` with errors logged.
+
+---
+
+## Dropped Store Detection
+
+After discovery, the system identifies stores that may have left the Dutchie platform:
+
+### Detection Criteria
+
+A store is marked as "dropped" if:
+
+1. It has a `platform_dispensary_id` (was previously verified)
+2. It's currently `status='open'` and `crawl_enabled=true`
+3. It was NOT seen in the latest discovery (not in `dutchie_discovery_locations` with `last_seen_at` in last 24 hours)
+
+### Implementation
+
+```typescript
+// backend/src/discovery/discovery-crawler.ts
+
+export async function detectDroppedStores(pool: Pool, stateCode?: string) {
+  // 1. Find dispensaries not in recent discovery
+  // 2. Mark status='dropped'
+  // 3. Log to dutchie_promotion_log
+  // 4. Return list for dashboard alert
+}
+```
+
+### Admin UI
+
+- **Dashboard**: Red alert banner when dropped stores exist
+- **Dispensaries page**: Filter by `status=dropped` to review
+
+---
+
+## CLI Usage
+
+```bash
+# Discover all stores in a state
+npx tsx src/scripts/run-discovery.ts discover:state AZ
+
+# Discover all US states
+npx tsx src/scripts/run-discovery.ts discover:all
+
+# Dry run (no DB writes)
+npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run
+
+# Check stats
+npx tsx src/scripts/run-discovery.ts stats
+```
+
+---
+
+## Rate Limiting
+
+- **2 seconds** between city requests
+- **Exponential backoff** on 429/403 responses
+- **Fingerprint rotation** on 403 errors
+
+---
+
+## Error Handling
+
+| Error | Action |
+|-------|--------|
+| 403 Forbidden | Rotate fingerprint, retry |
+| 429 Rate Limited | Wait 30s, retry |
+| Network timeout | Retry up to 3 times |
+| GraphQL error | Log and continue to next city |
+
+---
+
+## Monitoring
+
+### Logs
+
+Discovery progress is logged to stdout:
+
+```
+[Discovery] Starting discovery for state: AZ
+[Discovery] Step 1: Initializing proxy...
+[Discovery] Step 2: Fetching cities...
+[Discovery] Found 45 cities for AZ
+[Discovery] Step 3: Discovering locations...
+[Discovery] City 1/45: Phoenix - found 28 stores
+...
+[Discovery] Step 4: Auto-promoting discovered locations...
+[Discovery] Created: 5 new dispensaries
+[Discovery] Updated: 40 existing dispensaries
+[Discovery] Step 5: Detecting dropped stores...
+[Discovery] Found 2 dropped stores
+```
+
+### Audit Log
+
+All actions logged to `dutchie_promotion_log`:
+
+| Action | Description |
+|--------|-------------|
+| `promoted_create` | New dispensary created |
+| `promoted_update` | Existing dispensary updated |
+| `rejected` | Validation failed |
+| `dropped` | Store not found in discovery |
+
+---
+
+## Next: Phase 2
+
+See `docs/PRODUCT_CRAWL_V2.md` for the product crawling phase (coming next).
--- a/docs/WORKER_SYSTEM.md
+++ b/docs/WORKER_SYSTEM.md
@@ -0,0 +1,408 @@
+# CannaiQ Worker System
+
+## Overview
+
+The Worker System is a role-based task queue that processes background jobs. All tasks go into a single pool, and workers claim tasks based on their assigned role.
+
+---
+
+## Design Pattern: Single Pool, Role-Based Claiming
+
+```
+                    ┌─────────────────────────────────────────┐
+                    │           TASK POOL (worker_tasks)      │
+                    │                                         │
+                    │  ┌─────────────────────────────────┐   │
+                    │  │ role=store_discovery    pending │   │
+                    │  │ role=product_resync     pending │   │
+                    │  │ role=product_resync     pending │   │
+                    │  │ role=product_resync     pending │   │
+                    │  │ role=analytics_refresh  pending │   │
+                    │  │ role=entry_point_disc   pending │   │
+                    │  └─────────────────────────────────┘   │
+                    └─────────────────────────────────────────┘
+                                       │
+          ┌────────────────────────────┼────────────────────────────┐
+          │                            │                            │
+          ▼                            ▼                            ▼
+┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
+│ WORKER           │      │ WORKER           │      │ WORKER           │
+│ role=product_    │      │ role=product_    │      │ role=store_      │
+│      resync      │      │      resync      │      │      discovery   │
+│                  │      │                  │      │                  │
+│ Claims ONLY      │      │ Claims ONLY      │      │ Claims ONLY      │
+│ product_resync   │      │ product_resync   │      │ store_discovery  │
+│ tasks            │      │ tasks            │      │ tasks            │
+└──────────────────┘      └──────────────────┘      └──────────────────┘
+```
+
+**Key Points:**
+- All tasks go into ONE table (`worker_tasks`)
+- Each worker is assigned ONE role at startup
+- Workers only claim tasks matching their role
+- Multiple workers can share the same role (horizontal scaling)
+
+---
+
+## Worker Roles
+
+| Role | Purpose | Per-Store? | Schedule |
+|------|---------|------------|----------|
+| `store_discovery` | Find new dispensaries via GraphQL | No | Weekly |
+| `entry_point_discovery` | Resolve platform IDs from menu URLs | Yes | On-demand |
+| `product_discovery` | Initial product fetch for new stores | Yes | On-demand |
+| `product_resync` | Regular price/stock updates | Yes | Every 4 hours |
+| `analytics_refresh` | Refresh materialized views | No | Daily |
+
+---
+
+## Task Lifecycle
+
+```
+pending → claimed → running → completed
+                        ↓
+                      failed
+                        ↓
+                      (retry if < max_retries)
+```
+
+| Status | Meaning |
+|--------|---------|
+| `pending` | Waiting to be claimed |
+| `claimed` | Worker has claimed, not yet started |
+| `running` | Worker is actively processing |
+| `completed` | Successfully finished |
+| `failed` | Error occurred |
+| `stale` | Worker died (heartbeat timeout) |
+
+---
+
+## Task Chaining
+
+Tasks automatically create follow-up tasks:
+
+```
+store_discovery (finds new stores)
+       │
+       ├─ Returns newStoreIds[] in result
+       ▼
+entry_point_discovery (for each new store)
+       │
+       ├─ Resolves platform_dispensary_id
+       ▼
+product_discovery (initial crawl)
+       │
+       ▼
+(store enters regular schedule)
+       │
+       ▼
+product_resync (every 4 hours)
+```
+
+---
+
+## How Claiming Works
+
+### 1. Worker starts with a role
+
+```bash
+WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
+```
+
+### 2. Worker loop polls for tasks
+
+```typescript
+// Simplified worker loop
+while (running) {
+  const task = await claimTask(this.role, this.workerId);
+
+  if (!task) {
+    await sleep(5000);  // No tasks, wait 5 seconds
+    continue;
+  }
+
+  await processTask(task);
+}
+```
+
+### 3. SQL function claims atomically
+
+```sql
+-- claim_task(role, worker_id)
+UPDATE worker_tasks
+SET status = 'claimed', worker_id = $2, claimed_at = NOW()
+WHERE id = (
+  SELECT id FROM worker_tasks
+  WHERE role = $1                              -- Filter by worker's role
+    AND status = 'pending'
+    AND (scheduled_for IS NULL OR scheduled_for <= NOW())
+    AND dispensary_id NOT IN (                 -- Per-store locking
+      SELECT dispensary_id FROM worker_tasks
+      WHERE status IN ('claimed', 'running')
+    )
+  ORDER BY priority DESC, created_at ASC       -- Priority ordering
+  LIMIT 1
+  FOR UPDATE SKIP LOCKED                       -- Atomic, no race conditions
+)
+RETURNING *;
+```
+
+**Key Features:**
+- `FOR UPDATE SKIP LOCKED` - Prevents race conditions between workers
+- Role filtering - Worker only sees tasks for its role
+- Per-store locking - Only one active task per dispensary
+- Priority ordering - Higher priority tasks first
+- Scheduled tasks - Respects `scheduled_for` timestamp
+
+---
+
+## Heartbeat & Stale Recovery
+
+Workers send heartbeats every 30 seconds while processing:
+
+```typescript
+// During task processing
+setInterval(() => {
+  await pool.query(
+    'UPDATE worker_tasks SET last_heartbeat_at = NOW() WHERE id = $1',
+    [taskId]
+  );
+}, 30000);
+```
+
+If a worker dies, its tasks are recovered:
+
+```sql
+-- recover_stale_tasks(threshold_minutes)
+UPDATE worker_tasks
+SET status = 'pending', worker_id = NULL, retry_count = retry_count + 1
+WHERE status IN ('claimed', 'running')
+  AND last_heartbeat_at < NOW() - INTERVAL '10 minutes'
+  AND retry_count < max_retries;
+```
+
+---
+
+## Scheduling
+
+### Daily Resync Generation
+
+```sql
+SELECT generate_resync_tasks(6, CURRENT_DATE);  -- 6 batches = every 4 hours
+```
+
+Creates staggered tasks:
+| Batch | Time | Stores |
+|-------|------|--------|
+| 1 | 00:00 | 1-50 |
+| 2 | 04:00 | 51-100 |
+| 3 | 08:00 | 101-150 |
+| 4 | 12:00 | 151-200 |
+| 5 | 16:00 | 201-250 |
+| 6 | 20:00 | 251-300 |
+
+---
+
+## Files
+
+### Core
+
+| File | Purpose |
+|------|---------|
+| `src/tasks/task-service.ts` | Task CRUD, claiming, capacity metrics |
+| `src/tasks/task-worker.ts` | Worker loop, heartbeat, handler dispatch |
+| `src/routes/tasks.ts` | REST API endpoints |
+| `migrations/074_worker_task_queue.sql` | Database schema + SQL functions |
+
+### Handlers
+
+| File | Role |
+|------|------|
+| `src/tasks/handlers/store-discovery.ts` | `store_discovery` |
+| `src/tasks/handlers/entry-point-discovery.ts` | `entry_point_discovery` |
+| `src/tasks/handlers/product-discovery.ts` | `product_discovery` |
+| `src/tasks/handlers/product-resync.ts` | `product_resync` |
+| `src/tasks/handlers/analytics-refresh.ts` | `analytics_refresh` |
+
+---
+
+## Running Workers
+
+### Local Development
+
+```bash
+# Start a single worker
+WORKER_ROLE=product_resync npx tsx src/tasks/task-worker.ts
+
+# Start multiple workers (different terminals)
+WORKER_ROLE=product_resync WORKER_ID=resync-1 npx tsx src/tasks/task-worker.ts
+WORKER_ROLE=product_resync WORKER_ID=resync-2 npx tsx src/tasks/task-worker.ts
+WORKER_ROLE=store_discovery npx tsx src/tasks/task-worker.ts
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `WORKER_ROLE` | (required) | Which task role to process |
+| `WORKER_ID` | auto-generated | Custom worker identifier |
+| `POLL_INTERVAL_MS` | 5000 | How often to check for tasks |
+| `HEARTBEAT_INTERVAL_MS` | 30000 | How often to update heartbeat |
+
+### Kubernetes
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: task-worker-resync
+spec:
+  replicas: 5  # Scale horizontally
+  template:
+    spec:
+      containers:
+      - name: worker
+        image: code.cannabrands.app/creationshop/dispensary-scraper:latest
+        command: ["npx", "tsx", "src/tasks/task-worker.ts"]
+        env:
+        - name: WORKER_ROLE
+          value: "product_resync"
+```
+
+---
+
+## API Endpoints
+
+### Task Management
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/tasks` | List tasks (with filters) |
+| POST | `/api/tasks` | Create a task |
+| GET | `/api/tasks/:id` | Get task by ID |
+| GET | `/api/tasks/counts` | Counts by status |
+| GET | `/api/tasks/capacity` | Capacity metrics |
+| POST | `/api/tasks/recover-stale` | Recover dead worker tasks |
+
+### Task Generation
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| POST | `/api/tasks/generate/resync` | Generate daily resync batch |
+| POST | `/api/tasks/generate/discovery` | Create store discovery task |
+
+---
+
+## Capacity Planning
+
+The `v_worker_capacity` view provides metrics:
+
+```sql
+SELECT * FROM v_worker_capacity;
+```
+
+| Metric | Description |
+|--------|-------------|
+| `pending_tasks` | Tasks waiting |
+| `ready_tasks` | Tasks ready now (scheduled_for passed) |
+| `running_tasks` | Tasks being processed |
+| `active_workers` | Workers with recent heartbeat |
+| `tasks_per_worker_hour` | Throughput estimate |
+| `estimated_hours_to_drain` | Time to clear queue |
+
+### Scaling API
+
+```bash
+GET /api/tasks/capacity/product_resync
+```
+
+```json
+{
+  "pending_tasks": 500,
+  "active_workers": 3,
+  "workers_needed": {
+    "for_1_hour": 10,
+    "for_4_hours": 3,
+    "for_8_hours": 2
+  }
+}
+```
+
+---
+
+## Database Schema
+
+### worker_tasks
+
+```sql
+CREATE TABLE worker_tasks (
+  id SERIAL PRIMARY KEY,
+
+  -- Task identification
+  role VARCHAR(50) NOT NULL,
+  dispensary_id INTEGER REFERENCES dispensaries(id),
+  platform VARCHAR(20),
+
+  -- State
+  status VARCHAR(20) DEFAULT 'pending',
+  priority INTEGER DEFAULT 0,
+  scheduled_for TIMESTAMPTZ,
+
+  -- Ownership
+  worker_id VARCHAR(100),
+  claimed_at TIMESTAMPTZ,
+  started_at TIMESTAMPTZ,
+  completed_at TIMESTAMPTZ,
+  last_heartbeat_at TIMESTAMPTZ,
+
+  -- Results
+  result JSONB,
+  error_message TEXT,
+  retry_count INTEGER DEFAULT 0,
+  max_retries INTEGER DEFAULT 3,
+
+  created_at TIMESTAMPTZ DEFAULT NOW(),
+  updated_at TIMESTAMPTZ DEFAULT NOW()
+);
+```
+
+### Key Indexes
+
+```sql
+-- Fast claiming by role
+CREATE INDEX idx_worker_tasks_pending
+  ON worker_tasks(role, priority DESC, created_at ASC)
+  WHERE status = 'pending';
+
+-- Prevent duplicate active tasks per store
+CREATE UNIQUE INDEX idx_worker_tasks_unique_active_store
+  ON worker_tasks(dispensary_id)
+  WHERE status IN ('claimed', 'running') AND dispensary_id IS NOT NULL;
+```
+
+---
+
+## Monitoring
+
+### Logs
+
+```
+[TaskWorker] Starting worker worker-product_resync-a1b2c3d4 for role: product_resync
+[TaskWorker] Claimed task 123 (product_resync) for dispensary 456
+[TaskWorker] Task 123 completed successfully
+```
+
+### Health Check
+
+```sql
+-- Active workers
+SELECT worker_id, role, COUNT(*), MAX(last_heartbeat_at)
+FROM worker_tasks
+WHERE last_heartbeat_at > NOW() - INTERVAL '5 minutes'
+GROUP BY worker_id, role;
+
+-- Task counts by role/status
+SELECT role, status, COUNT(*)
+FROM worker_tasks
+GROUP BY role, status;
+```
--- a/docs/legacy_mapping.md
+++ b/docs/legacy_mapping.md
@@ -33,8 +33,8 @@ or overwrites of existing data.
 | Table | Purpose | Key Columns |
 |-------|---------|-------------|
 | `dispensaries` | Store locations | id, name, slug, city, state, platform_dispensary_id |
-| `dutchie_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
-| `dutchie_product_snapshots` | Historical snapshots | dutchie_product_id, crawled_at, rec_min_price_cents |
+| `store_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
+| `store_product_snapshots` | Historical snapshots | store_product_id, crawled_at, rec_min_price_cents |
 | `brands` (view: v_brands) | Derived from products | brand_name, brand_id, product_count |
 | `categories` (view: v_categories) | Derived from products | type, subcategory, product_count |

@@ -147,12 +147,10 @@ CREATE TABLE IF NOT EXISTS products_from_legacy (

 ---

-### 3. Dutchie Products
+### 3. Products (Legacy dutchie_products)

 **Source:** `dutchie_legacy.dutchie_products`
-**Target:** `cannaiq.dutchie_products`
-
-These tables have nearly identical schemas. The mapping is direct:
+**Target:** `cannaiq.store_products`

 | Legacy Column | Canonical Column | Notes |
 |---------------|------------------|-------|
@@ -180,15 +178,15 @@ ON CONFLICT (dispensary_id, external_product_id) DO NOTHING

 ---

-### 4. Dutchie Product Snapshots
+### 4. Product Snapshots (Legacy dutchie_product_snapshots)

 **Source:** `dutchie_legacy.dutchie_product_snapshots`
-**Target:** `cannaiq.dutchie_product_snapshots`
+**Target:** `cannaiq.store_product_snapshots`

 | Legacy Column | Canonical Column | Notes |
 |---------------|------------------|-------|
 | id | - | Generate new |
-| dutchie_product_id | dutchie_product_id | Map via product lookup |
+| dutchie_product_id | store_product_id | Map via product lookup |
 | dispensary_id | dispensary_id | Map via dispensary lookup |
 | crawled_at | crawled_at | Direct |
 | rec_min_price_cents | rec_min_price_cents | Direct |
@@ -201,7 +199,7 @@ ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
 ```sql
 -- No unique constraint on snapshots - all are historical records
 -- Just INSERT, no conflict handling needed
-INSERT INTO dutchie_product_snapshots (...) VALUES (...)
+INSERT INTO store_product_snapshots (...) VALUES (...)
 ```

 ---