## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
323 lines
9.8 KiB
Markdown
323 lines
9.8 KiB
Markdown
# Legacy Data Mapping: dutchie_legacy → CannaiQ
|
|
|
|
## Overview
|
|
|
|
This document describes the ETL mapping from the legacy `dutchie_legacy` database
|
|
to the canonical CannaiQ schema. All imports are **INSERT-ONLY** with no deletions
|
|
or overwrites of existing data.
|
|
|
|
## Database Locations
|
|
|
|
| Database | Host | Purpose |
|
|
|----------|------|---------|
|
|
| `cannaiq` | localhost:54320 | Main CannaiQ application schema |
|
|
| `dutchie_legacy` | localhost:54320 | Imported historical data from old dutchie_menus |
|
|
|
|
## Schema Comparison
|
|
|
|
### Legacy Tables (dutchie_legacy)
|
|
|
|
| Table | Row Purpose | Key Columns |
|
|
|-------|-------------|-------------|
|
|
| `dispensaries` | Store locations | id, name, slug, city, state, menu_url, menu_provider, product_provider |
|
|
| `products` | Legacy product records | id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage |
|
|
| `dutchie_products` | Dutchie-specific products | id, dispensary_id, external_product_id, name, brand_name, type, stock_status |
|
|
| `dutchie_product_snapshots` | Historical price/stock snapshots | dutchie_product_id, crawled_at, rec_min_price_cents, stock_status |
|
|
| `brands` | Brand entities | id, store_id, name, dispensary_id |
|
|
| `categories` | Product categories | id, store_id, name, slug |
|
|
| `price_history` | Legacy price tracking | product_id, price, recorded_at |
|
|
| `specials` | Deals/promotions | id, dispensary_id, name, discount_type |
|
|
|
|
### CannaiQ Canonical Tables
|
|
|
|
| Table | Purpose | Key Columns |
|
|
|-------|---------|-------------|
|
|
| `dispensaries` | Store locations | id, name, slug, city, state, platform_dispensary_id |
|
|
| `store_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
|
|
| `store_product_snapshots` | Historical snapshots | store_product_id, crawled_at, rec_min_price_cents |
|
|
| `brands` (view: v_brands) | Derived from products | brand_name, brand_id, product_count |
|
|
| `categories` (view: v_categories) | Derived from products | type, subcategory, product_count |
|
|
|
|
---
|
|
|
|
## Mapping Plan
|
|
|
|
### 1. Dispensaries
|
|
|
|
**Source:** `dutchie_legacy.dispensaries`
|
|
**Target:** `cannaiq.dispensaries`
|
|
|
|
| Legacy Column | Canonical Column | Notes |
|
|
|---------------|------------------|-------|
|
|
| id | - | Generate new ID, store legacy_id |
|
|
| name | name | Direct map |
|
|
| slug | slug | Direct map |
|
|
| city | city | Direct map |
|
|
| state | state | Direct map |
|
|
| address | address | Direct map |
|
|
| zip | postal_code | Rename |
|
|
| latitude | latitude | Direct map |
|
|
| longitude | longitude | Direct map |
|
|
| menu_url | menu_url | Direct map |
|
|
| menu_provider | - | Store in raw_metadata |
|
|
| product_provider | - | Store in raw_metadata |
|
|
| website | website | Direct map |
|
|
| dba_name | - | Store in raw_metadata |
|
|
| - | platform | Set to 'dutchie' |
|
|
| - | legacy_id | New column: original ID from legacy |
|
|
|
|
**Conflict Resolution:**
|
|
- ON CONFLICT (slug, city, state) DO NOTHING
|
|
- Match on slug+city+state combination
|
|
- Never overwrite existing dispensary data
|
|
|
|
**Staging Table:** `dispensaries_from_legacy`
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
|
|
id SERIAL PRIMARY KEY,
|
|
legacy_id INTEGER NOT NULL,
|
|
name VARCHAR(255) NOT NULL,
|
|
slug VARCHAR(255) NOT NULL,
|
|
city VARCHAR(100) NOT NULL,
|
|
state VARCHAR(10) NOT NULL,
|
|
postal_code VARCHAR(20),
|
|
address TEXT,
|
|
latitude DECIMAL(10,7),
|
|
longitude DECIMAL(10,7),
|
|
menu_url TEXT,
|
|
website TEXT,
|
|
legacy_metadata JSONB, -- All other legacy fields
|
|
imported_at TIMESTAMPTZ DEFAULT NOW(),
|
|
UNIQUE(legacy_id)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Products (Legacy products table)
|
|
|
|
**Source:** `dutchie_legacy.products`
|
|
**Target:** `cannaiq.products_from_legacy` (new staging table)
|
|
|
|
| Legacy Column | Canonical Column | Notes |
|
|
|---------------|------------------|-------|
|
|
| id | legacy_product_id | Original ID |
|
|
| dispensary_id | legacy_dispensary_id | FK to legacy dispensary |
|
|
| dutchie_product_id | external_product_id | Dutchie's _id |
|
|
| name | name | Direct map |
|
|
| brand | brand_name | Direct map |
|
|
| price | price_cents | Multiply by 100 |
|
|
| original_price | original_price_cents | Multiply by 100 |
|
|
| thc_percentage | thc | Direct map |
|
|
| cbd_percentage | cbd | Direct map |
|
|
| strain_type | strain_type | Direct map |
|
|
| weight | weight | Direct map |
|
|
| image_url | primary_image_url | Direct map |
|
|
| in_stock | stock_status | Map: true→'in_stock', false→'out_of_stock' |
|
|
| first_seen_at | first_seen_at | Direct map |
|
|
| last_seen_at | last_seen_at | Direct map |
|
|
| raw_data | latest_raw_payload | Direct map |
|
|
|
|
**Staging Table:** `products_from_legacy`
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS products_from_legacy (
|
|
id SERIAL PRIMARY KEY,
|
|
legacy_product_id INTEGER NOT NULL,
|
|
legacy_dispensary_id INTEGER,
|
|
external_product_id VARCHAR(255),
|
|
name VARCHAR(500) NOT NULL,
|
|
brand_name VARCHAR(255),
|
|
type VARCHAR(100),
|
|
subcategory VARCHAR(100),
|
|
strain_type VARCHAR(50),
|
|
thc DECIMAL(10,4),
|
|
cbd DECIMAL(10,4),
|
|
price_cents INTEGER,
|
|
original_price_cents INTEGER,
|
|
stock_status VARCHAR(20),
|
|
weight VARCHAR(100),
|
|
primary_image_url TEXT,
|
|
first_seen_at TIMESTAMPTZ,
|
|
last_seen_at TIMESTAMPTZ,
|
|
legacy_raw_payload JSONB,
|
|
imported_at TIMESTAMPTZ DEFAULT NOW(),
|
|
UNIQUE(legacy_product_id)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Products (Legacy dutchie_products)
|
|
|
|
**Source:** `dutchie_legacy.dutchie_products`
|
|
**Target:** `cannaiq.store_products`
|
|
|
|
| Legacy Column | Canonical Column | Notes |
|
|
|---------------|------------------|-------|
|
|
| id | - | Generate new, store as legacy_dutchie_product_id |
|
|
| dispensary_id | dispensary_id | Map via dispensary slug lookup |
|
|
| external_product_id | external_product_id | Direct (Dutchie _id) |
|
|
| platform_dispensary_id | platform_dispensary_id | Direct |
|
|
| name | name | Direct |
|
|
| brand_name | brand_name | Direct |
|
|
| type | type | Direct |
|
|
| subcategory | subcategory | Direct |
|
|
| strain_type | strain_type | Direct |
|
|
| thc/thc_content | thc/thc_content | Direct |
|
|
| cbd/cbd_content | cbd/cbd_content | Direct |
|
|
| stock_status | stock_status | Direct |
|
|
| images | images | Direct (JSONB) |
|
|
| latest_raw_payload | latest_raw_payload | Direct |
|
|
|
|
**Conflict Resolution:**
|
|
```sql
|
|
ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
|
|
```
|
|
- Never overwrite existing products
|
|
- Skip duplicates silently
|
|
|
|
---
|
|
|
|
### 4. Product Snapshots (Legacy dutchie_product_snapshots)
|
|
|
|
**Source:** `dutchie_legacy.dutchie_product_snapshots`
|
|
**Target:** `cannaiq.store_product_snapshots`
|
|
|
|
| Legacy Column | Canonical Column | Notes |
|
|
|---------------|------------------|-------|
|
|
| id | - | Generate new |
|
|
| dutchie_product_id | store_product_id | Map via product lookup |
|
|
| dispensary_id | dispensary_id | Map via dispensary lookup |
|
|
| crawled_at | crawled_at | Direct |
|
|
| rec_min_price_cents | rec_min_price_cents | Direct |
|
|
| rec_max_price_cents | rec_max_price_cents | Direct |
|
|
| stock_status | stock_status | Direct |
|
|
| options | options | Direct (JSONB) |
|
|
| raw_payload | raw_payload | Direct (JSONB) |
|
|
|
|
**Conflict Resolution:**
|
|
```sql
|
|
-- No unique constraint on snapshots - all are historical records
|
|
-- Just INSERT, no conflict handling needed
|
|
INSERT INTO store_product_snapshots (...) VALUES (...)
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Price History
|
|
|
|
**Source:** `dutchie_legacy.price_history`
|
|
**Target:** `cannaiq.price_history_legacy` (new staging table)
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS price_history_legacy (
|
|
id SERIAL PRIMARY KEY,
|
|
legacy_product_id INTEGER NOT NULL,
|
|
price_cents INTEGER,
|
|
recorded_at TIMESTAMPTZ,
|
|
imported_at TIMESTAMPTZ DEFAULT NOW()
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## ETL Process
|
|
|
|
### Phase 1: Staging Tables (INSERT-ONLY)
|
|
|
|
1. Create staging tables with `_from_legacy` or `_legacy` suffix
|
|
2. Read from `dutchie_legacy.*` tables in batches
|
|
3. INSERT into staging tables with ON CONFLICT DO NOTHING
|
|
4. Log counts: read, inserted, skipped
|
|
|
|
### Phase 2: ID Mapping
|
|
|
|
1. Build ID mapping tables:
|
|
- `legacy_dispensary_id` → `canonical_dispensary_id`
|
|
- `legacy_product_id` → `canonical_product_id`
|
|
2. Match on unique keys (slug+city+state for dispensaries, external_product_id for products)
|
|
|
|
### Phase 3: Canonical Merge (Optional, User-Approved)
|
|
|
|
Only if explicitly requested:
|
|
1. INSERT new records into canonical tables
|
|
2. Never UPDATE existing records
|
|
3. Never DELETE any records
|
|
|
|
---
|
|
|
|
## Safety Rules
|
|
|
|
1. **INSERT-ONLY**: No UPDATE, no DELETE, no TRUNCATE
|
|
2. **ON CONFLICT DO NOTHING**: Skip duplicates, never overwrite
|
|
3. **Batch Processing**: 500-1000 rows per batch to avoid memory issues
|
|
4. **Manual Invocation Only**: ETL script requires explicit user execution
|
|
5. **Logging**: Record all operations with counts and timestamps
|
|
6. **Dry Run Mode**: Support `--dry-run` flag to preview without writes
|
|
|
|
---
|
|
|
|
## Validation Queries
|
|
|
|
After import, verify with:
|
|
|
|
```sql
|
|
-- Count imported dispensaries
|
|
SELECT COUNT(*) FROM dispensaries_from_legacy;
|
|
|
|
-- Count imported products
|
|
SELECT COUNT(*) FROM products_from_legacy;
|
|
|
|
-- Check for duplicates that were skipped
|
|
SELECT
|
|
(SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count,
|
|
(SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count;
|
|
|
|
-- Verify no data loss
|
|
SELECT
|
|
l.id as legacy_id,
|
|
l.name as legacy_name,
|
|
c.id as canonical_id
|
|
FROM dutchie_legacy.dispensaries l
|
|
LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state
|
|
WHERE c.id IS NULL
|
|
LIMIT 10;
|
|
```
|
|
|
|
---
|
|
|
|
## Invocation
|
|
|
|
```bash
|
|
# From backend directory
|
|
npx tsx src/scripts/etl/legacy-import.ts
|
|
|
|
# With dry-run
|
|
npx tsx src/scripts/etl/legacy-import.ts --dry-run
|
|
|
|
# Import specific tables only
|
|
npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products
|
|
```
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
The ETL script expects these environment variables (user configures):
|
|
|
|
```bash
|
|
# Connection to cannaiq-postgres (same host, different databases)
|
|
CANNAIQ_DB_HOST=localhost
|
|
CANNAIQ_DB_PORT=54320
|
|
CANNAIQ_DB_USER=cannaiq
|
|
CANNAIQ_DB_PASSWORD=<password>
|
|
CANNAIQ_DB_NAME=cannaiq
|
|
|
|
# Legacy database (same host, different database)
|
|
LEGACY_DB_HOST=localhost
|
|
LEGACY_DB_PORT=54320
|
|
LEGACY_DB_USER=dutchie
|
|
LEGACY_DB_PASSWORD=<password>
|
|
LEGACY_DB_NAME=dutchie_legacy
|
|
```
|