Files
cannaiq/docs/legacy_mapping.md
Kelly 56cc171287 feat: Stealth worker system with mandatory proxy rotation
## Worker System
- Role-agnostic workers that can handle any task type
- Pod-based architecture with StatefulSet (5-15 pods, 5 workers each)
- Custom pod names (Aethelgard, Xylos, Kryll, etc.)
- Worker registry with friendly names and resource monitoring
- Hub-and-spoke visualization on JobQueue page

## Stealth & Anti-Detection (REQUIRED)
- Proxies are MANDATORY - workers fail to start without active proxies
- CrawlRotator initializes on worker startup
- Loads proxies from `proxies` table
- Auto-rotates proxy + fingerprint on 403 errors
- 12 browser fingerprints (Chrome, Firefox, Safari, Edge)
- Locale/timezone matching for geographic consistency

## Task System
- Renamed product_resync → product_refresh
- Task chaining: store_discovery → entry_point → product_discovery
- Priority-based claiming with FOR UPDATE SKIP LOCKED
- Heartbeat and stale task recovery

## UI Updates
- JobQueue: Pod visualization, resource monitoring on hover
- WorkersDashboard: Simplified worker list
- Removed unused filters from task list

## Other
- IP2Location service for visitor analytics
- Findagram consumer features scaffolding
- Documentation updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00

9.8 KiB

Legacy Data Mapping: dutchie_legacy → CannaiQ

Overview

This document describes the ETL mapping from the legacy dutchie_legacy database to the canonical CannaiQ schema. All imports are INSERT-ONLY with no deletions or overwrites of existing data.

Database Locations

Database Host Purpose
cannaiq localhost:54320 Main CannaiQ application schema
dutchie_legacy localhost:54320 Imported historical data from old dutchie_menus

Schema Comparison

Legacy Tables (dutchie_legacy)

Table Row Purpose Key Columns
dispensaries Store locations id, name, slug, city, state, menu_url, menu_provider, product_provider
products Legacy product records id, dispensary_id, dutchie_product_id, name, brand, price, thc_percentage
dutchie_products Dutchie-specific products id, dispensary_id, external_product_id, name, brand_name, type, stock_status
dutchie_product_snapshots Historical price/stock snapshots dutchie_product_id, crawled_at, rec_min_price_cents, stock_status
brands Brand entities id, store_id, name, dispensary_id
categories Product categories id, store_id, name, slug
price_history Legacy price tracking product_id, price, recorded_at
specials Deals/promotions id, dispensary_id, name, discount_type

CannaiQ Canonical Tables

Table Purpose Key Columns
dispensaries Store locations id, name, slug, city, state, platform_dispensary_id
store_products Canonical products id, dispensary_id, external_product_id, name, brand_name, stock_status
store_product_snapshots Historical snapshots store_product_id, crawled_at, rec_min_price_cents
brands (view: v_brands) Derived from products brand_name, brand_id, product_count
categories (view: v_categories) Derived from products type, subcategory, product_count

Mapping Plan

1. Dispensaries

Source: dutchie_legacy.dispensaries Target: cannaiq.dispensaries

Legacy Column Canonical Column Notes
id - Generate new ID, store legacy_id
name name Direct map
slug slug Direct map
city city Direct map
state state Direct map
address address Direct map
zip postal_code Rename
latitude latitude Direct map
longitude longitude Direct map
menu_url menu_url Direct map
menu_provider - Store in raw_metadata
product_provider - Store in raw_metadata
website website Direct map
dba_name - Store in raw_metadata
- platform Set to 'dutchie'
- legacy_id New column: original ID from legacy

Conflict Resolution:

  • ON CONFLICT (slug, city, state) DO NOTHING
  • Match on slug+city+state combination
  • Never overwrite existing dispensary data

Staging Table: dispensaries_from_legacy

CREATE TABLE IF NOT EXISTS dispensaries_from_legacy (
  id SERIAL PRIMARY KEY,
  legacy_id INTEGER NOT NULL,
  name VARCHAR(255) NOT NULL,
  slug VARCHAR(255) NOT NULL,
  city VARCHAR(100) NOT NULL,
  state VARCHAR(10) NOT NULL,
  postal_code VARCHAR(20),
  address TEXT,
  latitude DECIMAL(10,7),
  longitude DECIMAL(10,7),
  menu_url TEXT,
  website TEXT,
  legacy_metadata JSONB,  -- All other legacy fields
  imported_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(legacy_id)
);

2. Products (Legacy products table)

Source: dutchie_legacy.products Target: cannaiq.products_from_legacy (new staging table)

Legacy Column Canonical Column Notes
id legacy_product_id Original ID
dispensary_id legacy_dispensary_id FK to legacy dispensary
dutchie_product_id external_product_id Dutchie's _id
name name Direct map
brand brand_name Direct map
price price_cents Multiply by 100
original_price original_price_cents Multiply by 100
thc_percentage thc Direct map
cbd_percentage cbd Direct map
strain_type strain_type Direct map
weight weight Direct map
image_url primary_image_url Direct map
in_stock stock_status Map: true→'in_stock', false→'out_of_stock'
first_seen_at first_seen_at Direct map
last_seen_at last_seen_at Direct map
raw_data latest_raw_payload Direct map

Staging Table: products_from_legacy

CREATE TABLE IF NOT EXISTS products_from_legacy (
  id SERIAL PRIMARY KEY,
  legacy_product_id INTEGER NOT NULL,
  legacy_dispensary_id INTEGER,
  external_product_id VARCHAR(255),
  name VARCHAR(500) NOT NULL,
  brand_name VARCHAR(255),
  type VARCHAR(100),
  subcategory VARCHAR(100),
  strain_type VARCHAR(50),
  thc DECIMAL(10,4),
  cbd DECIMAL(10,4),
  price_cents INTEGER,
  original_price_cents INTEGER,
  stock_status VARCHAR(20),
  weight VARCHAR(100),
  primary_image_url TEXT,
  first_seen_at TIMESTAMPTZ,
  last_seen_at TIMESTAMPTZ,
  legacy_raw_payload JSONB,
  imported_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(legacy_product_id)
);

3. Products (Legacy dutchie_products)

Source: dutchie_legacy.dutchie_products Target: cannaiq.store_products

Legacy Column Canonical Column Notes
id - Generate new, store as legacy_dutchie_product_id
dispensary_id dispensary_id Map via dispensary slug lookup
external_product_id external_product_id Direct (Dutchie _id)
platform_dispensary_id platform_dispensary_id Direct
name name Direct
brand_name brand_name Direct
type type Direct
subcategory subcategory Direct
strain_type strain_type Direct
thc/thc_content thc/thc_content Direct
cbd/cbd_content cbd/cbd_content Direct
stock_status stock_status Direct
images images Direct (JSONB)
latest_raw_payload latest_raw_payload Direct

Conflict Resolution:

ON CONFLICT (dispensary_id, external_product_id) DO NOTHING
  • Never overwrite existing products
  • Skip duplicates silently

4. Product Snapshots (Legacy dutchie_product_snapshots)

Source: dutchie_legacy.dutchie_product_snapshots Target: cannaiq.store_product_snapshots

Legacy Column Canonical Column Notes
id - Generate new
dutchie_product_id store_product_id Map via product lookup
dispensary_id dispensary_id Map via dispensary lookup
crawled_at crawled_at Direct
rec_min_price_cents rec_min_price_cents Direct
rec_max_price_cents rec_max_price_cents Direct
stock_status stock_status Direct
options options Direct (JSONB)
raw_payload raw_payload Direct (JSONB)

Conflict Resolution:

-- No unique constraint on snapshots - all are historical records
-- Just INSERT, no conflict handling needed
INSERT INTO store_product_snapshots (...) VALUES (...)

5. Price History

Source: dutchie_legacy.price_history Target: cannaiq.price_history_legacy (new staging table)

CREATE TABLE IF NOT EXISTS price_history_legacy (
  id SERIAL PRIMARY KEY,
  legacy_product_id INTEGER NOT NULL,
  price_cents INTEGER,
  recorded_at TIMESTAMPTZ,
  imported_at TIMESTAMPTZ DEFAULT NOW()
);

ETL Process

Phase 1: Staging Tables (INSERT-ONLY)

  1. Create staging tables with _from_legacy or _legacy suffix
  2. Read from dutchie_legacy.* tables in batches
  3. INSERT into staging tables with ON CONFLICT DO NOTHING
  4. Log counts: read, inserted, skipped

Phase 2: ID Mapping

  1. Build ID mapping tables:
    • legacy_dispensary_idcanonical_dispensary_id
    • legacy_product_idcanonical_product_id
  2. Match on unique keys (slug+city+state for dispensaries, external_product_id for products)

Phase 3: Canonical Merge (Optional, User-Approved)

Only if explicitly requested:

  1. INSERT new records into canonical tables
  2. Never UPDATE existing records
  3. Never DELETE any records

Safety Rules

  1. INSERT-ONLY: No UPDATE, no DELETE, no TRUNCATE
  2. ON CONFLICT DO NOTHING: Skip duplicates, never overwrite
  3. Batch Processing: 500-1000 rows per batch to avoid memory issues
  4. Manual Invocation Only: ETL script requires explicit user execution
  5. Logging: Record all operations with counts and timestamps
  6. Dry Run Mode: Support --dry-run flag to preview without writes

Validation Queries

After import, verify with:

-- Count imported dispensaries
SELECT COUNT(*) FROM dispensaries_from_legacy;

-- Count imported products
SELECT COUNT(*) FROM products_from_legacy;

-- Check for duplicates that were skipped
SELECT
  (SELECT COUNT(*) FROM dutchie_legacy.dispensaries) as legacy_count,
  (SELECT COUNT(*) FROM dispensaries_from_legacy) as imported_count;

-- Verify no data loss
SELECT
  l.id as legacy_id,
  l.name as legacy_name,
  c.id as canonical_id
FROM dutchie_legacy.dispensaries l
LEFT JOIN dispensaries c ON c.slug = l.slug AND c.city = l.city AND c.state = l.state
WHERE c.id IS NULL
LIMIT 10;

Invocation

# From backend directory
npx tsx src/scripts/etl/legacy-import.ts

# With dry-run
npx tsx src/scripts/etl/legacy-import.ts --dry-run

# Import specific tables only
npx tsx src/scripts/etl/legacy-import.ts --tables=dispensaries,products

Environment Variables

The ETL script expects these environment variables (user configures):

# Connection to cannaiq-postgres (same host, different databases)
CANNAIQ_DB_HOST=localhost
CANNAIQ_DB_PORT=54320
CANNAIQ_DB_USER=cannaiq
CANNAIQ_DB_PASSWORD=<password>
CANNAIQ_DB_NAME=cannaiq

# Legacy database (same host, different database)
LEGACY_DB_HOST=localhost
LEGACY_DB_PORT=54320
LEGACY_DB_USER=dutchie
LEGACY_DB_PASSWORD=<password>
LEGACY_DB_NAME=dutchie_legacy