Files
cannaiq/backend/docs/CODEBASE_MAP.md
Kelly 55b26e9153 feat: Auto-healing entry_point_discovery with browser-first transport
- Rewrote entry_point_discovery with auto-healing scheme:
  1. Check dutchie_discovery_locations for existing platform_location_id
  2. Browser-based GraphQL with 5x network retries
  3. Mark as needs_investigation on hard failure
- Browser (Puppeteer) is now DEFAULT transport - curl only when explicit
- Added migration 091 for tracking columns:
  - last_store_discovery_at: When store_discovery updated record
  - last_payload_at: When last product payload was saved
- Updated CODEBASE_MAP.md with transport rules documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2025-12-12 22:55:21 -07:00

9.7 KiB

CannaiQ Backend Codebase Map

Last Updated: 2025-12-12 Purpose: Help Claude and developers understand which code is current vs deprecated


Quick Reference: What to Use

For Crawling/Scraping

Task Use This NOT This
Fetch products src/tasks/handlers/payload-fetch.ts src/hydration/*
Process products src/tasks/handlers/product-refresh.ts src/scraper-v2/*
GraphQL client src/platforms/dutchie/client.ts src/dutchie-az/services/graphql-client.ts
Worker system src/tasks/task-worker.ts src/dutchie-az/services/worker.ts

For Database

Task Use This NOT This
Get DB pool src/db/pool.ts src/dutchie-az/db/connection.ts
Run migrations src/db/migrate.ts (CLI only) Never import at runtime
Query products store_products table products, dutchie_products
Query stores dispensaries table stores table

For Discovery

Task Use This
Discover stores src/discovery/*.ts
Run discovery npx tsx src/scripts/run-discovery.ts

Directory Status

ACTIVE DIRECTORIES (Use These)

src/
├── auth/               # JWT/session auth, middleware
├── db/                 # Database pool, migrations
├── discovery/          # Dutchie store discovery pipeline
├── middleware/         # Express middleware
├── multi-state/        # Multi-state query support
├── platforms/          # Platform-specific clients (Dutchie, Jane, etc)
│   └── dutchie/        # THE Dutchie client - use this one
├── routes/             # Express API routes
├── services/           # Core services (logger, scheduler, etc)
├── tasks/              # Task system (workers, handlers, scheduler)
│   └── handlers/       # Task handlers (payload_fetch, product_refresh, etc)
├── types/              # TypeScript types
└── utils/              # Utilities (storage, image processing)

DEPRECATED DIRECTORIES (DO NOT USE)

src/
├── hydration/          # DEPRECATED - Old pipeline approach
├── scraper-v2/         # DEPRECATED - Old scraper engine
├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers
├── dutchie-az/         # PARTIAL - Some parts deprecated, some active
│   ├── db/             # DEPRECATED - Use src/db/pool.ts
│   └── services/       # PARTIAL - worker.ts still runs, graphql-client.ts deprecated
├── portals/            # FUTURE - Not yet implemented
├── seo/                # PARTIAL - Settings work, templates WIP
└── system/             # DEPRECATED - Old orchestration system

DEPRECATED FILES (DO NOT USE)

src/dutchie-az/db/connection.ts      # Use src/db/pool.ts instead
src/dutchie-az/services/graphql-client.ts  # Use src/platforms/dutchie/client.ts
src/hydration/*.ts                   # Entire directory deprecated
src/scraper-v2/*.ts                  # Entire directory deprecated

Key Files Reference

Entry Points

File Purpose Status
src/index.ts Main Express server ACTIVE
src/dutchie-az/services/worker.ts Worker process entry ACTIVE
src/tasks/task-worker.ts Task worker (new system) ACTIVE

Dutchie Integration

File Purpose Status
src/platforms/dutchie/client.ts GraphQL client, hashes, curl PRIMARY
src/platforms/dutchie/queries.ts High-level query functions ACTIVE
src/platforms/dutchie/index.ts Re-exports ACTIVE

Task Handlers

File Purpose Status
src/tasks/handlers/payload-fetch.ts Fetch products from Dutchie PRIMARY
src/tasks/handlers/product-refresh.ts Process payload into DB PRIMARY
src/tasks/handlers/entry-point-discovery.ts Resolve platform IDs (auto-healing) PRIMARY
src/tasks/handlers/menu-detection.ts Detect menu type ACTIVE
src/tasks/handlers/id-resolution.ts Resolve platform IDs (legacy) LEGACY
src/tasks/handlers/image-download.ts Download product images ACTIVE

Transport Rules (CRITICAL)

Browser-based (Puppeteer) is the DEFAULT transport. curl is ONLY allowed when explicitly specified.

Transport Selection

task.method Transport Used Notes
null Browser (Puppeteer) DEFAULT - use this for most tasks
'http' Browser (Puppeteer) Explicit browser request
'curl' curl-impersonate ONLY when explicitly needed

Why Browser-First?

  1. Anti-detection: Puppeteer with StealthPlugin evades bot detection
  2. Session cookies: Browser maintains session state automatically
  3. Fingerprinting: Real browser fingerprint (TLS, headers, etc.)
  4. Age gates: Browser can click through age verification

Entry Point Discovery Auto-Healing

The entry_point_discovery handler uses a healing strategy:

1. FIRST: Check dutchie_discovery_locations for existing platform_location_id
   - By linked dutchie_discovery_id
   - By slug match in discovery data
   → If found, NO network call needed

2. SECOND: Browser-based GraphQL (Puppeteer)
   - 5x retries for network/proxy failures
   - On HTTP 403: rotate proxy and retry
   - On HTTP 404 after 2 attempts: mark as 'removed'

3. HARD FAILURE: After exhausting options → 'needs_investigation'

DO NOT Use curl Unless:

  • Task explicitly has method = 'curl'
  • You're testing curl-impersonate binaries
  • The API explicitly requires curl fingerprinting

Files

File Transport Purpose
src/services/puppeteer-preflight.ts Browser Preflight check
src/services/curl-preflight.ts curl Preflight check
src/tasks/handlers/entry-point-discovery.ts Browser Platform ID resolution
src/tasks/handlers/payload-fetch.ts Both Product fetching

Database

File Purpose Status
src/db/pool.ts Canonical DB pool PRIMARY
src/db/migrate.ts Migration runner (CLI only) CLI ONLY
src/db/auto-migrate.ts Auto-run migrations on startup ACTIVE

Configuration

File Purpose Status
.env Environment variables ACTIVE
package.json Dependencies ACTIVE
tsconfig.json TypeScript config ACTIVE

GraphQL Hashes (CRITICAL)

The correct hashes are in src/platforms/dutchie/client.ts:

export const GRAPHQL_HASHES = {
  FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
  GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
  ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
  GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};

ALWAYS use Status: 'Active' for FilteredProducts (not null or 'All').


Scripts Reference

Useful Scripts (in src/scripts/)

Script Purpose
run-discovery.ts Run Dutchie discovery
crawl-single-store.ts Test crawl a single store
test-dutchie-graphql.ts Test GraphQL queries

One-Off Scripts (probably don't need)

Script Purpose
harmonize-az-dispensaries.ts One-time data cleanup
bootstrap-stores-for-dispensaries.ts One-time migration
backfill-*.ts Historical backfill scripts

API Routes

Active Routes (in src/routes/)

Route File Mount Point Purpose
auth.ts /api/auth Login/logout/session
stores.ts /api/stores Store CRUD
dashboard.ts /api/dashboard Dashboard stats
workers.ts /api/workers Worker monitoring
pipeline.ts /api/pipeline Crawl triggers
discovery.ts /api/discovery Discovery management
analytics.ts /api/analytics Analytics queries
wordpress.ts /api/v1/wordpress WordPress plugin API

Documentation Files

Current Docs (in backend/docs/)

Doc Purpose Currency
TASK_WORKFLOW_2024-12-10.md Task system architecture CURRENT
WORKER_TASK_ARCHITECTURE.md Worker/task design CURRENT
CRAWL_PIPELINE.md Crawl pipeline overview CURRENT
ORGANIC_SCRAPING_GUIDE.md Browser-based scraping CURRENT
CODEBASE_MAP.md This file CURRENT
ANALYTICS_V2_EXAMPLES.md Analytics API examples CURRENT
BRAND_INTELLIGENCE_API.md Brand API docs CURRENT

Root Docs

Doc Purpose Currency
CLAUDE.md Claude instructions PRIMARY
README.md Project overview NEEDS UPDATE

Common Mistakes to Avoid

  1. Don't use src/hydration/ - It's an old approach that was superseded by the task system

  2. Don't use src/dutchie-az/db/connection.ts - Use src/db/pool.ts instead

  3. Don't import src/db/migrate.ts at runtime - It will crash. Only use for CLI migrations.

  4. Don't query stores table - It's empty. Use dispensaries.

  5. Don't query products table - It's empty. Use store_products.

  6. Don't use wrong GraphQL hash - Always get hash from GRAPHQL_HASHES in client.ts

  7. Don't use Status: null - It returns 0 products. Use Status: 'Active'.


When in Doubt

  1. Check if the file is imported in src/index.ts - if not, it may be deprecated
  2. Check the last modified date - older files may be stale
  3. Look for DEPRECATED comments in the code
  4. Ask: "Is there a newer version of this in src/tasks/ or src/platforms/?"
  5. Read the relevant doc in docs/ before modifying code