# CannaiQ Crawl System V2 ## Overview The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases: 1. **Phase 1: Store Discovery** - Weekly discovery of Dutchie-powered dispensaries 2. **Phase 2: Product Crawling** - Regular product/price/stock updates (documented separately) --- ## Phase 1: Store Discovery ### Purpose Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states. ### Schedule - **Frequency**: Weekly (typically Sunday night) - **Duration**: ~2-4 hours for full US coverage ### Flow Diagram ``` ┌─────────────────────────────────────────────────────────────────────┐ │ PHASE 1: STORE DISCOVERY │ └─────────────────────────────────────────────────────────────────────┘ 1. IDENTITY SETUP ┌──────────────────┐ │ getRandomProxy() │ ──► Random IP from proxy pool └──────────────────┘ │ ▼ ┌──────────────────┐ │ startSession() │ ──► Random UA + fingerprint + locale matching proxy location └──────────────────┘ 2. CITY DISCOVERY (per state) ┌──────────────────────────────┐ │ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Upsert dutchie_discovery_ │ │ cities table │ └──────────────────────────────┘ 3. STORE DISCOVERY (per city) ┌───────────────────────────────┐ │ GraphQL: ConsumerDispensaries │ ──► Returns store data for city └───────────────────────────────┘ │ ▼ ┌───────────────────────────────┐ │ Upsert dutchie_discovery_ │ │ locations table │ └───────────────────────────────┘ 4. VALIDATION & PROMOTION ┌──────────────────────────┐ │ validateForPromotion() │ ──► Check required fields └──────────────────────────┘ │ ▼ ┌──────────────────────────┐ │ promoteLocation() │ ──► Upsert to dispensaries table └──────────────────────────┘ │ ▼ ┌──────────────────────────┐ │ ensureCrawlerProfile() │ ──► Create profile with status='sandbox' └──────────────────────────┘ 5. DROPPED STORE DETECTION ┌──────────────────────────┐ │ detectDroppedStores() │ ──► Find stores missing from discovery └──────────────────────────┘ │ ▼ ┌──────────────────────────┐ │ Mark status='dropped' │ ──► Dashboard alert for review └──────────────────────────┘ ``` --- ## Key Files | File | Purpose | |------|---------| | `backend/src/platforms/dutchie/client.ts` | HTTP client with proxy/fingerprint rotation | | `backend/src/discovery/discovery-crawler.ts` | Main discovery orchestrator | | `backend/src/discovery/location-discovery.ts` | City/store GraphQL fetching | | `backend/src/discovery/promotion.ts` | Validation and promotion logic | | `backend/src/scripts/run-discovery.ts` | CLI entry point | --- ## Identity Masking Before any GraphQL queries, the system establishes a masked identity: ### 1. Proxy Selection ```typescript // backend/src/platforms/dutchie/client.ts // Get random proxy from active pool (NOT state-specific) const proxy = await getRandomProxy(); setProxy(proxy.url); ``` The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled. ### 2. Fingerprint + Locale Harmonization ```typescript // backend/src/platforms/dutchie/client.ts function startSession(stateCode: string, timezone: string) { // 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants) const fingerprint = getRandomFingerprint(); // 2. Match Accept-Language to proxy's timezone/location const locale = getLocaleForTimezone(timezone); // 3. Set headers for this session currentSession = { userAgent: fingerprint.ua, acceptLanguage: locale, secChUa: fingerprint.secChUa, // ... other fingerprint headers }; } ``` ### Fingerprint Pool 6 browser fingerprints rotate on each session and on 403 errors: | Browser | Version | Platform | |---------|---------|----------| | Chrome | 120 | Windows | | Chrome | 120 | macOS | | Firefox | 121 | Windows | | Firefox | 121 | macOS | | Safari | 17.2 | macOS | | Edge | 120 | Windows | ### Timezone → Locale Mapping ```typescript const TIMEZONE_TO_LOCALE: Record = { 'America/New_York': 'en-US,en;q=0.9', 'America/Chicago': 'en-US,en;q=0.9', 'America/Denver': 'en-US,en;q=0.9', 'America/Los_Angeles': 'en-US,en;q=0.9', 'America/Phoenix': 'en-US,en;q=0.9', // ... }; ``` --- ## GraphQL Queries ### 1. getAllCitiesByState Fetches cities with active dispensaries for a state. ```typescript // backend/src/discovery/location-discovery.ts const response = await executeGraphQL({ operationName: 'getAllCitiesByState', variables: { state: 'AZ', countryCode: 'US' } }); // Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] } ``` **Hash**: `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6` ### 2. ConsumerDispensaries Fetches store data for a city/state. ```typescript // backend/src/discovery/location-discovery.ts const response = await executeGraphQL({ operationName: 'ConsumerDispensaries', variables: { dispensaryFilter: { city: 'Phoenix', state: 'AZ', activeOnly: true } } }); // Returns: [{ id, name, address, coords, menuUrl, ... }, ...] ``` **Hash**: `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b` --- ## Database Tables ### Discovery Tables (Staging) | Table | Purpose | |-------|---------| | `dutchie_discovery_cities` | Cities known to have dispensaries | | `dutchie_discovery_locations` | Raw discovered store data | ### Canonical Tables | Table | Purpose | |-------|---------| | `dispensaries` | Promoted stores ready for crawling | | `dispensary_crawler_profiles` | Crawler configuration per store | | `dutchie_promotion_log` | Audit trail for all discovery actions | --- ## Validation Rules A discovery location must have these fields to be promoted: | Field | Requirement | |-------|-------------| | `platform_location_id` | MongoDB ObjectId (24 hex chars) | | `name` | Non-empty string | | `city` | Non-empty string | | `state_code` | Non-empty string | | `platform_menu_url` | Valid URL | Invalid records are marked `status='rejected'` with errors logged. --- ## Dropped Store Detection After discovery, the system identifies stores that may have left the Dutchie platform: ### Detection Criteria A store is marked as "dropped" if: 1. It has a `platform_dispensary_id` (was previously verified) 2. It's currently `status='open'` and `crawl_enabled=true` 3. It was NOT seen in the latest discovery (not in `dutchie_discovery_locations` with `last_seen_at` in last 24 hours) ### Implementation ```typescript // backend/src/discovery/discovery-crawler.ts export async function detectDroppedStores(pool: Pool, stateCode?: string) { // 1. Find dispensaries not in recent discovery // 2. Mark status='dropped' // 3. Log to dutchie_promotion_log // 4. Return list for dashboard alert } ``` ### Admin UI - **Dashboard**: Red alert banner when dropped stores exist - **Dispensaries page**: Filter by `status=dropped` to review --- ## CLI Usage ```bash # Discover all stores in a state npx tsx src/scripts/run-discovery.ts discover:state AZ # Discover all US states npx tsx src/scripts/run-discovery.ts discover:all # Dry run (no DB writes) npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run # Check stats npx tsx src/scripts/run-discovery.ts stats ``` --- ## Rate Limiting - **2 seconds** between city requests - **Exponential backoff** on 429/403 responses - **Fingerprint rotation** on 403 errors --- ## Error Handling | Error | Action | |-------|--------| | 403 Forbidden | Rotate fingerprint, retry | | 429 Rate Limited | Wait 30s, retry | | Network timeout | Retry up to 3 times | | GraphQL error | Log and continue to next city | --- ## Monitoring ### Logs Discovery progress is logged to stdout: ``` [Discovery] Starting discovery for state: AZ [Discovery] Step 1: Initializing proxy... [Discovery] Step 2: Fetching cities... [Discovery] Found 45 cities for AZ [Discovery] Step 3: Discovering locations... [Discovery] City 1/45: Phoenix - found 28 stores ... [Discovery] Step 4: Auto-promoting discovered locations... [Discovery] Created: 5 new dispensaries [Discovery] Updated: 40 existing dispensaries [Discovery] Step 5: Detecting dropped stores... [Discovery] Found 2 dropped stores ``` ### Audit Log All actions logged to `dutchie_promotion_log`: | Action | Description | |--------|-------------| | `promoted_create` | New dispensary created | | `promoted_update` | Existing dispensary updated | | `rejected` | Validation failed | | `dropped` | Store not found in discovery | --- ## Next: Phase 2 See `docs/PRODUCT_CRAWL_V2.md` for the product crawling phase (coming next).