## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
10 KiB
CannaiQ Crawl System V2
Overview
The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:
- Phase 1: Store Discovery - Weekly discovery of Dutchie-powered dispensaries
- Phase 2: Product Crawling - Regular product/price/stock updates (documented separately)
Phase 1: Store Discovery
Purpose
Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.
Schedule
- Frequency: Weekly (typically Sunday night)
- Duration: ~2-4 hours for full US coverage
Flow Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 1: STORE DISCOVERY │
└─────────────────────────────────────────────────────────────────────┘
1. IDENTITY SETUP
┌──────────────────┐
│ getRandomProxy() │ ──► Random IP from proxy pool
└──────────────────┘
│
▼
┌──────────────────┐
│ startSession() │ ──► Random UA + fingerprint + locale matching proxy location
└──────────────────┘
2. CITY DISCOVERY (per state)
┌──────────────────────────────┐
│ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ Upsert dutchie_discovery_ │
│ cities table │
└──────────────────────────────┘
3. STORE DISCOVERY (per city)
┌───────────────────────────────┐
│ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Upsert dutchie_discovery_ │
│ locations table │
└───────────────────────────────┘
4. VALIDATION & PROMOTION
┌──────────────────────────┐
│ validateForPromotion() │ ──► Check required fields
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ promoteLocation() │ ──► Upsert to dispensaries table
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ ensureCrawlerProfile() │ ──► Create profile with status='sandbox'
└──────────────────────────┘
5. DROPPED STORE DETECTION
┌──────────────────────────┐
│ detectDroppedStores() │ ──► Find stores missing from discovery
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ Mark status='dropped' │ ──► Dashboard alert for review
└──────────────────────────┘
Key Files
| File | Purpose |
|---|---|
backend/src/platforms/dutchie/client.ts |
HTTP client with proxy/fingerprint rotation |
backend/src/discovery/discovery-crawler.ts |
Main discovery orchestrator |
backend/src/discovery/location-discovery.ts |
City/store GraphQL fetching |
backend/src/discovery/promotion.ts |
Validation and promotion logic |
backend/src/scripts/run-discovery.ts |
CLI entry point |
Identity Masking
Before any GraphQL queries, the system establishes a masked identity:
1. Proxy Selection
// backend/src/platforms/dutchie/client.ts
// Get random proxy from active pool (NOT state-specific)
const proxy = await getRandomProxy();
setProxy(proxy.url);
The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.
2. Fingerprint + Locale Harmonization
// backend/src/platforms/dutchie/client.ts
function startSession(stateCode: string, timezone: string) {
// 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
const fingerprint = getRandomFingerprint();
// 2. Match Accept-Language to proxy's timezone/location
const locale = getLocaleForTimezone(timezone);
// 3. Set headers for this session
currentSession = {
userAgent: fingerprint.ua,
acceptLanguage: locale,
secChUa: fingerprint.secChUa,
// ... other fingerprint headers
};
}
Fingerprint Pool
6 browser fingerprints rotate on each session and on 403 errors:
| Browser | Version | Platform |
|---|---|---|
| Chrome | 120 | Windows |
| Chrome | 120 | macOS |
| Firefox | 121 | Windows |
| Firefox | 121 | macOS |
| Safari | 17.2 | macOS |
| Edge | 120 | Windows |
Timezone → Locale Mapping
const TIMEZONE_TO_LOCALE: Record<string, string> = {
'America/New_York': 'en-US,en;q=0.9',
'America/Chicago': 'en-US,en;q=0.9',
'America/Denver': 'en-US,en;q=0.9',
'America/Los_Angeles': 'en-US,en;q=0.9',
'America/Phoenix': 'en-US,en;q=0.9',
// ...
};
GraphQL Queries
1. getAllCitiesByState
Fetches cities with active dispensaries for a state.
// backend/src/discovery/location-discovery.ts
const response = await executeGraphQL({
operationName: 'getAllCitiesByState',
variables: {
state: 'AZ',
countryCode: 'US'
}
});
// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }
Hash: ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6
2. ConsumerDispensaries
Fetches store data for a city/state.
// backend/src/discovery/location-discovery.ts
const response = await executeGraphQL({
operationName: 'ConsumerDispensaries',
variables: {
dispensaryFilter: {
city: 'Phoenix',
state: 'AZ',
activeOnly: true
}
}
});
// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]
Hash: 0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b
Database Tables
Discovery Tables (Staging)
| Table | Purpose |
|---|---|
dutchie_discovery_cities |
Cities known to have dispensaries |
dutchie_discovery_locations |
Raw discovered store data |
Canonical Tables
| Table | Purpose |
|---|---|
dispensaries |
Promoted stores ready for crawling |
dispensary_crawler_profiles |
Crawler configuration per store |
dutchie_promotion_log |
Audit trail for all discovery actions |
Validation Rules
A discovery location must have these fields to be promoted:
| Field | Requirement |
|---|---|
platform_location_id |
MongoDB ObjectId (24 hex chars) |
name |
Non-empty string |
city |
Non-empty string |
state_code |
Non-empty string |
platform_menu_url |
Valid URL |
Invalid records are marked status='rejected' with errors logged.
Dropped Store Detection
After discovery, the system identifies stores that may have left the Dutchie platform:
Detection Criteria
A store is marked as "dropped" if:
- It has a
platform_dispensary_id(was previously verified) - It's currently
status='open'andcrawl_enabled=true - It was NOT seen in the latest discovery (not in
dutchie_discovery_locationswithlast_seen_atin last 24 hours)
Implementation
// backend/src/discovery/discovery-crawler.ts
export async function detectDroppedStores(pool: Pool, stateCode?: string) {
// 1. Find dispensaries not in recent discovery
// 2. Mark status='dropped'
// 3. Log to dutchie_promotion_log
// 4. Return list for dashboard alert
}
Admin UI
- Dashboard: Red alert banner when dropped stores exist
- Dispensaries page: Filter by
status=droppedto review
CLI Usage
# Discover all stores in a state
npx tsx src/scripts/run-discovery.ts discover:state AZ
# Discover all US states
npx tsx src/scripts/run-discovery.ts discover:all
# Dry run (no DB writes)
npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run
# Check stats
npx tsx src/scripts/run-discovery.ts stats
Rate Limiting
- 2 seconds between city requests
- Exponential backoff on 429/403 responses
- Fingerprint rotation on 403 errors
Error Handling
| Error | Action |
|---|---|
| 403 Forbidden | Rotate fingerprint, retry |
| 429 Rate Limited | Wait 30s, retry |
| Network timeout | Retry up to 3 times |
| GraphQL error | Log and continue to next city |
Monitoring
Logs
Discovery progress is logged to stdout:
[Discovery] Starting discovery for state: AZ
[Discovery] Step 1: Initializing proxy...
[Discovery] Step 2: Fetching cities...
[Discovery] Found 45 cities for AZ
[Discovery] Step 3: Discovering locations...
[Discovery] City 1/45: Phoenix - found 28 stores
...
[Discovery] Step 4: Auto-promoting discovered locations...
[Discovery] Created: 5 new dispensaries
[Discovery] Updated: 40 existing dispensaries
[Discovery] Step 5: Detecting dropped stores...
[Discovery] Found 2 dropped stores
Audit Log
All actions logged to dutchie_promotion_log:
| Action | Description |
|---|---|
promoted_create |
New dispensary created |
promoted_update |
Existing dispensary updated |
rejected |
Validation failed |
dropped |
Store not found in discovery |
Next: Phase 2
See docs/PRODUCT_CRAWL_V2.md for the product crawling phase (coming next).