# Legacy Data Mapping: dutchie_legacy → CannaiQ ## Overview This document describes the ETL mapping from the legacy `dutchie_legacy` database to the canonical CannaiQ schema. All imports are **INSERT-ONLY** with no deletions or overwrites of existing data. ## Database Locations | Database | Host | Purpose | |----------|------|---------| | `cannaiq` | localhost:54320 | Main CannaiQ application schema | | `dutchie_legacy` | localhost:54320 | Imported historical data from old dutchie_menus | ## Schema Comparison ### Legacy Tables (dutchie_legacy) | Table | Row Purpose | Key Columns | |-------|-------------|-------------| | `dispensaries` | Store locations | id, name, slug, city, state, menu_url, menu_provider, product_provider | | `products` | Legacy product records | id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage | | `dutchie_products` | Dutchie-specific products | id, dispensary_id, external_product_id, name, brand_name, type, stock_status | | `dutchie_product_snapshots` | Historical price/stock snapshots | dutchie_product_id, crawled_at, rec_min_price_cents, stock_status | | `brands` | Brand entities | id, store_id, name, dispensary_id | | `categories` | Product categories | id, store_id, name, slug | | `price_history` | Legacy price tracking | product_id, price, recorded_at | | `specials` | Deals/promotions | id, dispensary_id, name, discount_type | ### CannaiQ Canonical Tables | Table | Purpose | Key Columns | |-------|---------|-------------| | `dispensaries` | Store locations | id, name, slug, city, state, platform_dispensary_id | | `store_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status | | `store_product_snapshots` | Historical snapshots | store_product_id, crawled_at, rec_min_price_cents | | `brands` (view: v_brands) | Derived from products | brand_name, brand_id, product_count | | `categories` (view: v_categories) | Derived from products | type, subcategory, product_count | --- ## Mapping Plan ### 1. Dispensaries **Source:** `dutchie_legacy.dispensaries` **Target:** `cannaiq.dispensaries` | Legacy Column | Canonical Column | Notes | |---------------|------------------|-------| | id | - | Generate new ID, store legacy_id | | name | name | Direct map | | slug | slug | Direct map | | city | city | Direct map | | state | state | Direct map | | address | address | Direct map | | zip | postal_code | Rename | | latitude | latitude | Direct map | | longitude | longitude | Direct map | | menu_url | menu_url | Direct map | | menu_provider | - | Store in raw_metadata | | product_provider | - | Store in raw_metadata | | website | website | Direct map | | dba_name | - | Store in raw_metadata | | - | platform | Set to 'dutchie' | | - | legacy_id | New column: original ID from legacy | **Conflict Resolution:** - ON CONFLICT (slug, city, state) DO NOTHING - Match on slug+city+state combination - Never overwrite existing dispensary data **Staging Table:** `dispensaries_from_legacy` ```sql CREATE TABLE IF NOT EXISTS dispensaries_from_legacy ( id SERIAL PRIMARY KEY, legacy_id INTEGER NOT NULL, name VARCHAR(255) NOT NULL, slug VARCHAR(255) NOT NULL, city VARCHAR(100) NOT NULL, state VARCHAR(10) NOT NULL, postal_code VARCHAR(20), address TEXT, latitude DECIMAL(10,7), longitude DECIMAL(10,7), menu_url TEXT, website TEXT, legacy_metadata JSONB, -- All other legacy fields imported_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(legacy_id) ); ``` --- ### 2. Products (Legacy products table) **Source:** `dutchie_legacy.products` **Target:** `cannaiq.products_from_legacy` (new staging table) | Legacy Column | Canonical Column | Notes | |---------------|------------------|-------| | id | legacy_product_id | Original ID | | dispensary_id | legacy_dispensary_id | FK to legacy dispensary | | dutchie_product_id | external_product_id | Dutchie's _id | | name | name | Direct map | | brand | brand_name | Direct map | | price | price_cents | Multiply by 100 | | original_price | original_price_cents | Multiply by 100 | | thc_percentage | thc | Direct map | | cbd_percentage | cbd | Direct map | | strain_type | strain_type | Direct map | | weight | weight | Direct map | | image_url | primary_image_url | Direct map | | in_stock | stock_status | Map: true→'in_stock', false→'out_of_stock' | | first_seen_at | first_seen_at | Direct map | | last_seen_at | last_seen_at | Direct map | | raw_data | latest_raw_payload | Direct map | **Staging Table:** `products_from_legacy` ```sql CREATE TABLE IF NOT EXISTS products_from_legacy ( id SERIAL PRIMARY KEY, legacy_product_id INTEGER NOT NULL, legacy_dispensary_id INTEGER, external_product_id VARCHAR(255), name VARCHAR(500) NOT NULL, brand_name VARCHAR(255), type VARCHAR(100), subcategory VARCHAR(100), strain_type VARCHAR(50), thc DECIMAL(10,4), cbd DECIMAL(10,4), price_cents INTEGER, original_price_cents INTEGER, stock_status VARCHAR(20), weight VARCHAR(100), primary_image_url TEXT, first_seen_at TIMESTAMPTZ, last_seen_at TIMESTAMPTZ, legacy_raw_payload JSONB, imported_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(legacy_product_id) ); ``` --- ### 3. Products (Legacy dutchie_products) **Source:** `dutchie_legacy.dutchie_products` **Target:** `cannaiq.store_products` | Legacy Column | Canonical Column | Notes | |---------------|------------------|-------| | id | - | Generate new, store as legacy_dutchie_product_id | | dispensary_id | dispensary_id | Map via dispensary slug lookup | | external_product_id | external_product_id | Direct (Dutchie _id) | | platform_dispensary_id | platform_dispensary_id | Direct | | name | name | Direct | | brand_name | brand_name | Direct | | type | type | Direct | | subcategory | subcategory | Direct | | strain_type | strain_type | Direct | | thc/thc_content | thc/thc_content | Direct | | cbd/cbd_content | cbd/cbd_content | Direct | | stock_status | stock_status | Direct | | images | images | Direct (JSONB) | | latest_raw_payload | latest_raw_payload | Direct | **Conflict Resolution:** ```sql ON CONFLICT (dispensary_id, external_product_id) DO NOTHING ``` - Never overwrite existing products - Skip duplicates silently --- ### 4. Product Snapshots (Legacy dutchie_product_snapshots) **Source:** `dutchie_legacy.dutchie_product_snapshots` **Target:** `cannaiq.store_product_snapshots` | Legacy Column | Canonical Column | Notes | |---------------|------------------|-------| | id | - | Generate new | | dutchie_product_id | store_product_id | Map via product lookup | | dispensary_id | dispensary_id | Map via dispensary lookup | | crawled_at | crawled_at | Direct | | rec_min_price_cents | rec_min_price_cents | Direct | | rec_max_price_cents | rec_max_price_cents | Direct | | stock_status | stock_status | Direct | | options | options | Direct (JSONB) | | raw_payload | raw_payload | Direct (JSONB) | **Conflict Resolution:** ```sql -- No unique constraint on snapshots - all are historical records -- Just INSERT, no conflict handling needed INSERT INTO store_product_snapshots (...) VALUES (...) ``` --- ### 5. Price History **Source:** `dutchie_legacy.price_history` **Target:** `cannaiq.price_history_legacy` (new staging table) ```sql CREATE TABLE IF NOT EXISTS price_history_legacy ( id SERIAL PRIMARY KEY, legacy_product_id INTEGER NOT NULL, price_cents INTEGER, recorded_at TIMESTAMPTZ, imported_at TIMESTAMPTZ DEFAULT NOW() ); ``` --- ## ETL Process ### Phase 1: Staging Tables (INSERT-ONLY) 1. Create staging tables with `_from_legacy` or `_legacy` suffix 2. Read from `dutchie_legacy.*` tables in batches 3. INSERT into staging tables with ON CONFLICT DO NOTHING 4. Log counts: read, inserted, skipped ### Phase 2: ID Mapping 1. Build ID mapping tables: - `legacy_dispensary_id` → `canonical_dispensary_id` - `legacy_product_id` → `canonical_product_id` 2. Match on unique keys (slug+city+state for dispensaries, external_product_id for products) ### Phase 3: Canonical Merge (Optional, User-Approved) Only if explicitly requested: 1. INSERT new records into canonical tables 2. Never UPDATE existing records 3. Never DELETE any records --- ## Safety Rules 1. **INSERT-ONLY**: No UPDATE, no DELETE, no TRUNCATE 2. **ON CONFLICT DO NOTHING**: Skip duplicates, never overwrite 3. **Batch Processing**: 500-1000 rows per batch to avoid memory issues 4. **Manual Invocation Only**: ETL script requires explicit user execution 5. **Logging**: Record all operations with counts and timestamps 6. **Dry Run Mode**: Support `--dry-run` flag to preview without writes --- ## Validation Queries After import, verify with: ```sql -- Count imported dispensaries SELECT COUNT(*) FROM dispensaries_from_legacy; -- Count imported products SELECT COUNT(*) FROM products_from_legacy; -- Check for duplicates that were skipped SELECT (SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count, (SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count; -- Verify no data loss SELECT l.id as legacy_id, l.name as legacy_name, c.id as canonical_id FROM dutchie_legacy.dispensaries l LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state WHERE c.id IS NULL LIMIT 10; ``` --- ## Invocation ```bash # From backend directory npx tsx src/scripts/etl/legacy-import.ts # With dry-run npx tsx src/scripts/etl/legacy-import.ts --dry-run # Import specific tables only npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products ``` --- ## Environment Variables The ETL script expects these environment variables (user configures): ```bash # Connection to cannaiq-postgres (same host, different databases) CANNAIQ_DB_HOST=localhost CANNAIQ_DB_PORT=54320 CANNAIQ_DB_USER=cannaiq CANNAIQ_DB_PASSWORD= CANNAIQ_DB_NAME=cannaiq # Legacy database (same host, different database) LEGACY_DB_HOST=localhost LEGACY_DB_PORT=54320 LEGACY_DB_USER=dutchie LEGACY_DB_PASSWORD= LEGACY_DB_NAME=dutchie_legacy ```