Major additions: - Multi-state expansion: states table, StateSelector, NationalDashboard, StateHeatmap, CrossStateCompare - Orchestrator services: trace service, error taxonomy, retry manager, proxy rotator - Discovery system: dutchie discovery service, geo validation, city seeding scripts - Analytics infrastructure: analytics v2 routes, brand/pricing/stores intelligence pages - Local development: setup-local.sh starts all 5 services (postgres, backend, cannaiq, findadispo, findagram) - Migrations 037-056: crawler profiles, states, analytics indexes, worker metadata Frontend pages added: - Discovery, ChainsDashboard, IntelligenceBrands, IntelligencePricing, IntelligenceStores - StateHeatmap, CrossStateCompare, SyncInfoPanel Components added: - StateSelector, OrchestratorTraceModal, WorkflowStepper 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.8 KiB
Legacy Data Mapping: dutchie_legacy → CannaiQ
Overview
This document describes the ETL mapping from the legacy dutchie_legacy database
to the canonical CannaiQ schema. All imports are INSERT-ONLY with no deletions
or overwrites of existing data.
Database Locations
| Database | Host | Purpose |
|---|---|---|
cannaiq |
localhost:54320 | Main CannaiQ application schema |
dutchie_legacy |
localhost:54320 | Imported historical data from old dutchie_menus |
Schema Comparison
Legacy Tables (dutchie_legacy)
| Table | Row Purpose | Key Columns |
|---|---|---|
dispensaries |
Store locations | id, name, slug, city, state, menu_url, menu_provider, product_provider |
products |
Legacy product records | id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage |
dutchie_products |
Dutchie-specific products | id, dispensary_id, external_product_id, name, brand_name, type, stock_status |
dutchie_product_snapshots |
Historical price/stock snapshots | dutchie_product_id, crawled_at, rec_min_price_cents, stock_status |
brands |
Brand entities | id, store_id, name, dispensary_id |
categories |
Product categories | id, store_id, name, slug |
price_history |
Legacy price tracking | product_id, price, recorded_at |
specials |
Deals/promotions | id, dispensary_id, name, discount_type |
CannaiQ Canonical Tables
| Table | Purpose | Key Columns |
|---|---|---|
dispensaries |
Store locations | id, name, slug, city, state, platform_dispensary_id |
dutchie_products |
Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
dutchie_product_snapshots |
Historical snapshots | dutchie_product_id, crawled_at, rec_min_price_cents |
brands (view: v_brands) |
Derived from products | brand_name, brand_id, product_count |
categories (view: v_categories) |
Derived from products | type, subcategory, product_count |
Mapping Plan
1. Dispensaries
Source: dutchie_legacy.dispensaries
Target: cannaiq.dispensaries
| Legacy Column | Canonical Column | Notes |
|---|---|---|
| id | - | Generate new ID, store legacy_id |
| name | name | Direct map |
| slug | slug | Direct map |
| city | city | Direct map |
| state | state | Direct map |
| address | address | Direct map |
| zip | postal_code | Rename |
| latitude | latitude | Direct map |
| longitude | longitude | Direct map |
| menu_url | menu_url | Direct map |
| menu_provider | - | Store in raw_metadata |
| product_provider | - | Store in raw_metadata |
| website | website | Direct map |
| dba_name | - | Store in raw_metadata |
| - | platform | Set to 'dutchie' |
| - | legacy_id | New column: original ID from legacy |
Conflict Resolution:
- ON CONFLICT (slug, city, state) DO NOTHING
- Match on slug+city+state combination
- Never overwrite existing dispensary data
Staging Table: dispensaries_from_legacy
CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
id SERIAL PRIMARY KEY,
legacy_id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(10) NOT NULL,
postal_code VARCHAR(20),
address TEXT,
latitude DECIMAL(10,7),
longitude DECIMAL(10,7),
menu_url TEXT,
website TEXT,
legacy_metadata JSONB, -- All other legacy fields
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_id)
);
2. Products (Legacy products table)
Source: dutchie_legacy.products
Target: cannaiq.products_from_legacy (new staging table)
| Legacy Column | Canonical Column | Notes |
|---|---|---|
| id | legacy_product_id | Original ID |
| dispensary_id | legacy_dispensary_id | FK to legacy dispensary |
| dutchie_product_id | external_product_id | Dutchie's _id |
| name | name | Direct map |
| brand | brand_name | Direct map |
| price | price_cents | Multiply by 100 |
| original_price | original_price_cents | Multiply by 100 |
| thc_percentage | thc | Direct map |
| cbd_percentage | cbd | Direct map |
| strain_type | strain_type | Direct map |
| weight | weight | Direct map |
| image_url | primary_image_url | Direct map |
| in_stock | stock_status | Map: true→'in_stock', false→'out_of_stock' |
| first_seen_at | first_seen_at | Direct map |
| last_seen_at | last_seen_at | Direct map |
| raw_data | latest_raw_payload | Direct map |
Staging Table: products_from_legacy
CREATE TABLE IF NOT EXISTS products_from_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
legacy_dispensary_id INTEGER,
external_product_id VARCHAR(255),
name VARCHAR(500) NOT NULL,
brand_name VARCHAR(255),
type VARCHAR(100),
subcategory VARCHAR(100),
strain_type VARCHAR(50),
thc DECIMAL(10,4),
cbd DECIMAL(10,4),
price_cents INTEGER,
original_price_cents INTEGER,
stock_status VARCHAR(20),
weight VARCHAR(100),
primary_image_url TEXT,
first_seen_at TIMESTAMPTZ,
last_seen_at TIMESTAMPTZ,
legacy_raw_payload JSONB,
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_product_id)
);
3. Dutchie Products
Source: dutchie_legacy.dutchie_products
Target: cannaiq.dutchie_products
These tables have nearly identical schemas. The mapping is direct:
| Legacy Column | Canonical Column | Notes |
|---|---|---|
| id | - | Generate new, store as legacy_dutchie_product_id |
| dispensary_id | dispensary_id | Map via dispensary slug lookup |
| external_product_id | external_product_id | Direct (Dutchie _id) |
| platform_dispensary_id | platform_dispensary_id | Direct |
| name | name | Direct |
| brand_name | brand_name | Direct |
| type | type | Direct |
| subcategory | subcategory | Direct |
| strain_type | strain_type | Direct |
| thc/thc_content | thc/thc_content | Direct |
| cbd/cbd_content | cbd/cbd_content | Direct |
| stock_status | stock_status | Direct |
| images | images | Direct (JSONB) |
| latest_raw_payload | latest_raw_payload | Direct |
Conflict Resolution:
ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
- Never overwrite existing products
- Skip duplicates silently
4. Dutchie Product Snapshots
Source: dutchie_legacy.dutchie_product_snapshots
Target: cannaiq.dutchie_product_snapshots
| Legacy Column | Canonical Column | Notes |
|---|---|---|
| id | - | Generate new |
| dutchie_product_id | dutchie_product_id | Map via product lookup |
| dispensary_id | dispensary_id | Map via dispensary lookup |
| crawled_at | crawled_at | Direct |
| rec_min_price_cents | rec_min_price_cents | Direct |
| rec_max_price_cents | rec_max_price_cents | Direct |
| stock_status | stock_status | Direct |
| options | options | Direct (JSONB) |
| raw_payload | raw_payload | Direct (JSONB) |
Conflict Resolution:
-- No unique constraint on snapshots - all are historical records
-- Just INSERT, no conflict handling needed
INSERT INTO dutchie_product_snapshots (...) VALUES (...)
5. Price History
Source: dutchie_legacy.price_history
Target: cannaiq.price_history_legacy (new staging table)
CREATE TABLE IF NOT EXISTS price_history_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
price_cents INTEGER,
recorded_at TIMESTAMPTZ,
imported_at TIMESTAMPTZ DEFAULT NOW()
);
ETL Process
Phase 1: Staging Tables (INSERT-ONLY)
- Create staging tables with
_from_legacyor_legacysuffix - Read from
dutchie_legacy.*tables in batches - INSERT into staging tables with ON CONFLICT DO NOTHING
- Log counts: read, inserted, skipped
Phase 2: ID Mapping
- Build ID mapping tables:
legacy_dispensary_id→canonical_dispensary_idlegacy_product_id→canonical_product_id
- Match on unique keys (slug+city+state for dispensaries, external_product_id for products)
Phase 3: Canonical Merge (Optional, User-Approved)
Only if explicitly requested:
- INSERT new records into canonical tables
- Never UPDATE existing records
- Never DELETE any records
Safety Rules
- INSERT-ONLY: No UPDATE, no DELETE, no TRUNCATE
- ON CONFLICT DO NOTHING: Skip duplicates, never overwrite
- Batch Processing: 500-1000 rows per batch to avoid memory issues
- Manual Invocation Only: ETL script requires explicit user execution
- Logging: Record all operations with counts and timestamps
- Dry Run Mode: Support
--dry-runflag to preview without writes
Validation Queries
After import, verify with:
-- Count imported dispensaries
SELECT COUNT(*) FROM dispensaries_from_legacy;
-- Count imported products
SELECT COUNT(*) FROM products_from_legacy;
-- Check for duplicates that were skipped
SELECT
(SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count,
(SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count;
-- Verify no data loss
SELECT
l.id as legacy_id,
l.name as legacy_name,
c.id as canonical_id
FROM dutchie_legacy.dispensaries l
LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state
WHERE c.id IS NULL
LIMIT 10;
Invocation
# From backend directory
npx tsx src/scripts/etl/legacy-import.ts
# With dry-run
npx tsx src/scripts/etl/legacy-import.ts --dry-run
# Import specific tables only
npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products
Environment Variables
The ETL script expects these environment variables (user configures):
# Connection to cannaiq-postgres (same host, different databases)
CANNAIQ_DB_HOST=localhost
CANNAIQ_DB_PORT=54320
CANNAIQ_DB_USER=cannaiq
CANNAIQ_DB_PASSWORD=<password>
CANNAIQ_DB_NAME=cannaiq
# Legacy database (same host, different database)
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_USER=dutchie
LEGACY_DB_PASSWORD=<password>
LEGACY_DB_NAME=dutchie_legacy