Files
cannaiq/docs/legacy_mapping.md
Kelly b4a2fb7d03 feat: Add v2 architecture with multi-state support and orchestrator services
Major additions:
- Multi-state expansion: states table, StateSelector, NationalDashboard, StateHeatmap, CrossStateCompare
- Orchestrator services: trace service, error taxonomy, retry manager, proxy rotator
- Discovery system: dutchie discovery service, geo validation, city seeding scripts
- Analytics infrastructure: analytics v2 routes, brand/pricing/stores intelligence pages
- Local development: setup-local.sh starts all 5 services (postgres, backend, cannaiq, findadispo, findagram)
- Migrations 037-056: crawler profiles, states, analytics indexes, worker metadata

Frontend pages added:
- Discovery, ChainsDashboard, IntelligenceBrands, IntelligencePricing, IntelligenceStores
- StateHeatmap, CrossStateCompare, SyncInfoPanel

Components added:
- StateSelector, OrchestratorTraceModal, WorkflowStepper

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-07 11:30:57 -07:00

325 lines
9.8 KiB
Markdown

# Legacy Data Mapping: dutchie_legacy → CannaiQ
## Overview
This document describes the ETL mapping from the legacy `dutchie_legacy` database
to the canonical CannaiQ schema. All imports are **INSERT-ONLY** with no deletions
or overwrites of existing data.
## Database Locations
| Database | Host | Purpose |
|----------|------|---------|
| `cannaiq` | localhost:54320 | Main CannaiQ application schema |
| `dutchie_legacy` | localhost:54320 | Imported historical data from old dutchie_menus |
## Schema Comparison
### Legacy Tables (dutchie_legacy)
| Table | Row Purpose | Key Columns |
|-------|-------------|-------------|
| `dispensaries` | Store locations | id, name, slug, city, state, menu_url, menu_provider, product_provider |
| `products` | Legacy product records | id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage |
| `dutchie_products` | Dutchie-specific products | id, dispensary_id, external_product_id, name, brand_name, type, stock_status |
| `dutchie_product_snapshots` | Historical price/stock snapshots | dutchie_product_id, crawled_at, rec_min_price_cents, stock_status |
| `brands` | Brand entities | id, store_id, name, dispensary_id |
| `categories` | Product categories | id, store_id, name, slug |
| `price_history` | Legacy price tracking | product_id, price, recorded_at |
| `specials` | Deals/promotions | id, dispensary_id, name, discount_type |
### CannaiQ Canonical Tables
| Table | Purpose | Key Columns |
|-------|---------|-------------|
| `dispensaries` | Store locations | id, name, slug, city, state, platform_dispensary_id |
| `dutchie_products` | Canonical products | id, dispensary_id, external_product_id, name, brand_name, stock_status |
| `dutchie_product_snapshots` | Historical snapshots | dutchie_product_id, crawled_at, rec_min_price_cents |
| `brands` (view: v_brands) | Derived from products | brand_name, brand_id, product_count |
| `categories` (view: v_categories) | Derived from products | type, subcategory, product_count |
---
## Mapping Plan
### 1. Dispensaries
**Source:** `dutchie_legacy.dispensaries`
**Target:** `cannaiq.dispensaries`
| Legacy Column | Canonical Column | Notes |
|---------------|------------------|-------|
| id | - | Generate new ID, store legacy_id |
| name | name | Direct map |
| slug | slug | Direct map |
| city | city | Direct map |
| state | state | Direct map |
| address | address | Direct map |
| zip | postal_code | Rename |
| latitude | latitude | Direct map |
| longitude | longitude | Direct map |
| menu_url | menu_url | Direct map |
| menu_provider | - | Store in raw_metadata |
| product_provider | - | Store in raw_metadata |
| website | website | Direct map |
| dba_name | - | Store in raw_metadata |
| - | platform | Set to 'dutchie' |
| - | legacy_id | New column: original ID from legacy |
**Conflict Resolution:**
- ON CONFLICT (slug, city, state) DO NOTHING
- Match on slug+city+state combination
- Never overwrite existing dispensary data
**Staging Table:** `dispensaries_from_legacy`
```sql
CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
id SERIAL PRIMARY KEY,
legacy_id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
slug VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(10) NOT NULL,
postal_code VARCHAR(20),
address TEXT,
latitude DECIMAL(10,7),
longitude DECIMAL(10,7),
menu_url TEXT,
website TEXT,
legacy_metadata JSONB, -- All other legacy fields
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_id)
);
```
---
### 2. Products (Legacy products table)
**Source:** `dutchie_legacy.products`
**Target:** `cannaiq.products_from_legacy` (new staging table)
| Legacy Column | Canonical Column | Notes |
|---------------|------------------|-------|
| id | legacy_product_id | Original ID |
| dispensary_id | legacy_dispensary_id | FK to legacy dispensary |
| dutchie_product_id | external_product_id | Dutchie's _id |
| name | name | Direct map |
| brand | brand_name | Direct map |
| price | price_cents | Multiply by 100 |
| original_price | original_price_cents | Multiply by 100 |
| thc_percentage | thc | Direct map |
| cbd_percentage | cbd | Direct map |
| strain_type | strain_type | Direct map |
| weight | weight | Direct map |
| image_url | primary_image_url | Direct map |
| in_stock | stock_status | Map: true→'in_stock', false→'out_of_stock' |
| first_seen_at | first_seen_at | Direct map |
| last_seen_at | last_seen_at | Direct map |
| raw_data | latest_raw_payload | Direct map |
**Staging Table:** `products_from_legacy`
```sql
CREATE TABLE IF NOT EXISTS products_from_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
legacy_dispensary_id INTEGER,
external_product_id VARCHAR(255),
name VARCHAR(500) NOT NULL,
brand_name VARCHAR(255),
type VARCHAR(100),
subcategory VARCHAR(100),
strain_type VARCHAR(50),
thc DECIMAL(10,4),
cbd DECIMAL(10,4),
price_cents INTEGER,
original_price_cents INTEGER,
stock_status VARCHAR(20),
weight VARCHAR(100),
primary_image_url TEXT,
first_seen_at TIMESTAMPTZ,
last_seen_at TIMESTAMPTZ,
legacy_raw_payload JSONB,
imported_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(legacy_product_id)
);
```
---
### 3. Dutchie Products
**Source:** `dutchie_legacy.dutchie_products`
**Target:** `cannaiq.dutchie_products`
These tables have nearly identical schemas. The mapping is direct:
| Legacy Column | Canonical Column | Notes |
|---------------|------------------|-------|
| id | - | Generate new, store as legacy_dutchie_product_id |
| dispensary_id | dispensary_id | Map via dispensary slug lookup |
| external_product_id | external_product_id | Direct (Dutchie _id) |
| platform_dispensary_id | platform_dispensary_id | Direct |
| name | name | Direct |
| brand_name | brand_name | Direct |
| type | type | Direct |
| subcategory | subcategory | Direct |
| strain_type | strain_type | Direct |
| thc/thc_content | thc/thc_content | Direct |
| cbd/cbd_content | cbd/cbd_content | Direct |
| stock_status | stock_status | Direct |
| images | images | Direct (JSONB) |
| latest_raw_payload | latest_raw_payload | Direct |
**Conflict Resolution:**
```sql
ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
```
- Never overwrite existing products
- Skip duplicates silently
---
### 4. Dutchie Product Snapshots
**Source:** `dutchie_legacy.dutchie_product_snapshots`
**Target:** `cannaiq.dutchie_product_snapshots`
| Legacy Column | Canonical Column | Notes |
|---------------|------------------|-------|
| id | - | Generate new |
| dutchie_product_id | dutchie_product_id | Map via product lookup |
| dispensary_id | dispensary_id | Map via dispensary lookup |
| crawled_at | crawled_at | Direct |
| rec_min_price_cents | rec_min_price_cents | Direct |
| rec_max_price_cents | rec_max_price_cents | Direct |
| stock_status | stock_status | Direct |
| options | options | Direct (JSONB) |
| raw_payload | raw_payload | Direct (JSONB) |
**Conflict Resolution:**
```sql
-- No unique constraint on snapshots - all are historical records
-- Just INSERT, no conflict handling needed
INSERT INTO dutchie_product_snapshots (...) VALUES (...)
```
---
### 5. Price History
**Source:** `dutchie_legacy.price_history`
**Target:** `cannaiq.price_history_legacy` (new staging table)
```sql
CREATE TABLE IF NOT EXISTS price_history_legacy (
id SERIAL PRIMARY KEY,
legacy_product_id INTEGER NOT NULL,
price_cents INTEGER,
recorded_at TIMESTAMPTZ,
imported_at TIMESTAMPTZ DEFAULT NOW()
);
```
---
## ETL Process
### Phase 1: Staging Tables (INSERT-ONLY)
1. Create staging tables with `_from_legacy` or `_legacy` suffix
2. Read from `dutchie_legacy.*` tables in batches
3. INSERT into staging tables with ON CONFLICT DO NOTHING
4. Log counts: read, inserted, skipped
### Phase 2: ID Mapping
1. Build ID mapping tables:
- `legacy_dispensary_id``canonical_dispensary_id`
- `legacy_product_id``canonical_product_id`
2. Match on unique keys (slug+city+state for dispensaries, external_product_id for products)
### Phase 3: Canonical Merge (Optional, User-Approved)
Only if explicitly requested:
1. INSERT new records into canonical tables
2. Never UPDATE existing records
3. Never DELETE any records
---
## Safety Rules
1. **INSERT-ONLY**: No UPDATE, no DELETE, no TRUNCATE
2. **ON CONFLICT DO NOTHING**: Skip duplicates, never overwrite
3. **Batch Processing**: 500-1000 rows per batch to avoid memory issues
4. **Manual Invocation Only**: ETL script requires explicit user execution
5. **Logging**: Record all operations with counts and timestamps
6. **Dry Run Mode**: Support `--dry-run` flag to preview without writes
---
## Validation Queries
After import, verify with:
```sql
-- Count imported dispensaries
SELECT COUNT(*) FROM dispensaries_from_legacy;
-- Count imported products
SELECT COUNT(*) FROM products_from_legacy;
-- Check for duplicates that were skipped
SELECT
(SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count,
(SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count;
-- Verify no data loss
SELECT
l.id as legacy_id,
l.name as legacy_name,
c.id as canonical_id
FROM dutchie_legacy.dispensaries l
LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state
WHERE c.id IS NULL
LIMIT 10;
```
---
## Invocation
```bash
# From backend directory
npx tsx src/scripts/etl/legacy-import.ts
# With dry-run
npx tsx src/scripts/etl/legacy-import.ts --dry-run
# Import specific tables only
npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products
```
---
## Environment Variables
The ETL script expects these environment variables (user configures):
```bash
# Connection to cannaiq-postgres (same host, different databases)
CANNAIQ_DB_HOST=localhost
CANNAIQ_DB_PORT=54320
CANNAIQ_DB_USER=cannaiq
CANNAIQ_DB_PASSWORD=<password>
CANNAIQ_DB_NAME=cannaiq
# Legacy database (same host, different database)
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_USER=dutchie
LEGACY_DB_PASSWORD=<password>
LEGACY_DB_NAME=dutchie_legacy
```