Files
cannaiq/docs/legacy_mapping.md
Kelly b4a2fb7d03 feat: Add v2 architecture with multi-state support and orchestrator services
Major additions:
- Multi-state expansion: states table, StateSelector, NationalDashboard, StateHeatmap, CrossStateCompare
- Orchestrator services: trace service, error taxonomy, retry manager, proxy rotator
- Discovery system: dutchie discovery service, geo validation, city seeding scripts
- Analytics infrastructure: analytics v2 routes, brand/pricing/stores intelligence pages
- Local development: setup-local.sh starts all 5 services (postgres, backend, cannaiq, findadispo, findagram)
- Migrations 037-056: crawler profiles, states, analytics indexes, worker metadata

Frontend pages added:
- Discovery, ChainsDashboard, IntelligenceBrands, IntelligencePricing, IntelligenceStores
- StateHeatmap, CrossStateCompare, SyncInfoPanel

Components added:
- StateSelector, OrchestratorTraceModal, WorkflowStepper

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-07 11:30:57 -07:00

9.8 KiB

Legacy Data Mapping: dutchie_legacy → CannaiQ

Overview

This document describes the ETL mapping from the legacy dutchie_legacy database to the canonical CannaiQ schema. All imports are INSERT-ONLY with no deletions or overwrites of existing data.

Database Locations

Database Host Purpose
cannaiq localhost:54320 Main CannaiQ application schema
dutchie_legacy localhost:54320 Imported historical data from old dutchie_menus

Schema Comparison

Legacy Tables (dutchie_legacy)

Table Row Purpose Key Columns
dispensaries Store locations id, name, slug, city, state, menu_url, menu_provider, product_provider
products Legacy product records id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage
dutchie_products Dutchie-specific products id, dispensary_id, external_product_id, name, brand_name, type, stock_status
dutchie_product_snapshots Historical price/stock snapshots dutchie_product_id, crawled_at, rec_min_price_cents, stock_status
brands Brand entities id, store_id, name, dispensary_id
categories Product categories id, store_id, name, slug
price_history Legacy price tracking product_id, price, recorded_at
specials Deals/promotions id, dispensary_id, name, discount_type

CannaiQ Canonical Tables

Table Purpose Key Columns
dispensaries Store locations id, name, slug, city, state, platform_dispensary_id
dutchie_products Canonical products id, dispensary_id, external_product_id, name, brand_name, stock_status
dutchie_product_snapshots Historical snapshots dutchie_product_id, crawled_at, rec_min_price_cents
brands (view: v_brands) Derived from products brand_name, brand_id, product_count
categories (view: v_categories) Derived from products type, subcategory, product_count

Mapping Plan

1. Dispensaries

Source: dutchie_legacy.dispensaries Target: cannaiq.dispensaries

Legacy Column Canonical Column Notes
id - Generate new ID, store legacy_id
name name Direct map
slug slug Direct map
city city Direct map
state state Direct map
address address Direct map
zip postal_code Rename
latitude latitude Direct map
longitude longitude Direct map
menu_url menu_url Direct map
menu_provider - Store in raw_metadata
product_provider - Store in raw_metadata
website website Direct map
dba_name - Store in raw_metadata
- platform Set to 'dutchie'
- legacy_id New column: original ID from legacy

Conflict Resolution:

  • ON CONFLICT (slug, city, state) DO NOTHING
  • Match on slug+city+state combination
  • Never overwrite existing dispensary data

Staging Table: dispensaries_from_legacy

CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
  id SERIAL PRIMARY KEY,
  legacy_id INTEGER NOT NULL,
  name VARCHAR(255) NOT NULL,
  slug VARCHAR(255) NOT NULL,
  city VARCHAR(100) NOT NULL,
  state VARCHAR(10) NOT NULL,
  postal_code VARCHAR(20),
  address TEXT,
  latitude DECIMAL(10,7),
  longitude DECIMAL(10,7),
  menu_url TEXT,
  website TEXT,
  legacy_metadata JSONB,  -- All other legacy fields
  imported_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(legacy_id)
);

2. Products (Legacy products table)

Source: dutchie_legacy.products Target: cannaiq.products_from_legacy (new staging table)

Legacy Column Canonical Column Notes
id legacy_product_id Original ID
dispensary_id legacy_dispensary_id FK to legacy dispensary
dutchie_product_id external_product_id Dutchie's _id
name name Direct map
brand brand_name Direct map
price price_cents Multiply by 100
original_price original_price_cents Multiply by 100
thc_percentage thc Direct map
cbd_percentage cbd Direct map
strain_type strain_type Direct map
weight weight Direct map
image_url primary_image_url Direct map
in_stock stock_status Map: true→'in_stock', false→'out_of_stock'
first_seen_at first_seen_at Direct map
last_seen_at last_seen_at Direct map
raw_data latest_raw_payload Direct map

Staging Table: products_from_legacy

CREATE TABLE IF NOT EXISTS products_from_legacy (
  id SERIAL PRIMARY KEY,
  legacy_product_id INTEGER NOT NULL,
  legacy_dispensary_id INTEGER,
  external_product_id VARCHAR(255),
  name VARCHAR(500) NOT NULL,
  brand_name VARCHAR(255),
  type VARCHAR(100),
  subcategory VARCHAR(100),
  strain_type VARCHAR(50),
  thc DECIMAL(10,4),
  cbd DECIMAL(10,4),
  price_cents INTEGER,
  original_price_cents INTEGER,
  stock_status VARCHAR(20),
  weight VARCHAR(100),
  primary_image_url TEXT,
  first_seen_at TIMESTAMPTZ,
  last_seen_at TIMESTAMPTZ,
  legacy_raw_payload JSONB,
  imported_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(legacy_product_id)
);

3. Dutchie Products

Source: dutchie_legacy.dutchie_products Target: cannaiq.dutchie_products

These tables have nearly identical schemas. The mapping is direct:

Legacy Column Canonical Column Notes
id - Generate new, store as legacy_dutchie_product_id
dispensary_id dispensary_id Map via dispensary slug lookup
external_product_id external_product_id Direct (Dutchie _id)
platform_dispensary_id platform_dispensary_id Direct
name name Direct
brand_name brand_name Direct
type type Direct
subcategory subcategory Direct
strain_type strain_type Direct
thc/thc_content thc/thc_content Direct
cbd/cbd_content cbd/cbd_content Direct
stock_status stock_status Direct
images images Direct (JSONB)
latest_raw_payload latest_raw_payload Direct

Conflict Resolution:

ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
  • Never overwrite existing products
  • Skip duplicates silently

4. Dutchie Product Snapshots

Source: dutchie_legacy.dutchie_product_snapshots Target: cannaiq.dutchie_product_snapshots

Legacy Column Canonical Column Notes
id - Generate new
dutchie_product_id dutchie_product_id Map via product lookup
dispensary_id dispensary_id Map via dispensary lookup
crawled_at crawled_at Direct
rec_min_price_cents rec_min_price_cents Direct
rec_max_price_cents rec_max_price_cents Direct
stock_status stock_status Direct
options options Direct (JSONB)
raw_payload raw_payload Direct (JSONB)

Conflict Resolution:

-- No unique constraint on snapshots - all are historical records
-- Just INSERT, no conflict handling needed
INSERT INTO dutchie_product_snapshots (...) VALUES (...)

5. Price History

Source: dutchie_legacy.price_history Target: cannaiq.price_history_legacy (new staging table)

CREATE TABLE IF NOT EXISTS price_history_legacy (
  id SERIAL PRIMARY KEY,
  legacy_product_id INTEGER NOT NULL,
  price_cents INTEGER,
  recorded_at TIMESTAMPTZ,
  imported_at TIMESTAMPTZ DEFAULT NOW()
);

ETL Process

Phase 1: Staging Tables (INSERT-ONLY)

  1. Create staging tables with _from_legacy or _legacy suffix
  2. Read from dutchie_legacy.* tables in batches
  3. INSERT into staging tables with ON CONFLICT DO NOTHING
  4. Log counts: read, inserted, skipped

Phase 2: ID Mapping

  1. Build ID mapping tables:
    • legacy_dispensary_idcanonical_dispensary_id
    • legacy_product_idcanonical_product_id
  2. Match on unique keys (slug+city+state for dispensaries, external_product_id for products)

Phase 3: Canonical Merge (Optional, User-Approved)

Only if explicitly requested:

  1. INSERT new records into canonical tables
  2. Never UPDATE existing records
  3. Never DELETE any records

Safety Rules

  1. INSERT-ONLY: No UPDATE, no DELETE, no TRUNCATE
  2. ON CONFLICT DO NOTHING: Skip duplicates, never overwrite
  3. Batch Processing: 500-1000 rows per batch to avoid memory issues
  4. Manual Invocation Only: ETL script requires explicit user execution
  5. Logging: Record all operations with counts and timestamps
  6. Dry Run Mode: Support --dry-run flag to preview without writes

Validation Queries

After import, verify with:

-- Count imported dispensaries
SELECT COUNT(*) FROM dispensaries_from_legacy;

-- Count imported products
SELECT COUNT(*) FROM products_from_legacy;

-- Check for duplicates that were skipped
SELECT
  (SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count,
  (SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count;

-- Verify no data loss
SELECT
  l.id as legacy_id,
  l.name as legacy_name,
  c.id as canonical_id
FROM dutchie_legacy.dispensaries l
LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state
WHERE c.id IS NULL
LIMIT 10;

Invocation

# From backend directory
npx tsx src/scripts/etl/legacy-import.ts

# With dry-run
npx tsx src/scripts/etl/legacy-import.ts --dry-run

# Import specific tables only
npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products

Environment Variables

The ETL script expects these environment variables (user configures):

# Connection to cannaiq-postgres (same host, different databases)
CANNAIQ_DB_HOST=localhost
CANNAIQ_DB_PORT=54320
CANNAIQ_DB_USER=cannaiq
CANNAIQ_DB_PASSWORD=<password>
CANNAIQ_DB_NAME=cannaiq

# Legacy database (same host, different database)
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_USER=dutchie
LEGACY_DB_PASSWORD=<password>
LEGACY_DB_NAME=dutchie_legacy