## Worker System - Role-agnostic workers that can handle any task type - Pod-based architecture with StatefulSet (5-15 pods, 5 workers each) - Custom pod names (Aethelgard, Xylos, Kryll, etc.) - Worker registry with friendly names and resource monitoring - Hub-and-spoke visualization on JobQueue page ## Stealth & Anti-Detection (REQUIRED) - Proxies are MANDATORY - workers fail to start without active proxies - CrawlRotator initializes on worker startup - Loads proxies from `proxies` table - Auto-rotates proxy + fingerprint on 403 errors - 12 browser fingerprints (Chrome, Firefox, Safari, Edge) - Locale/timezone matching for geographic consistency ## Task System - Renamed product_resync → product_refresh - Task chaining: store_discovery → entry_point → product_discovery - Priority-based claiming with FOR UPDATE SKIP LOCKED - Heartbeat and stale task recovery ## UI Updates - JobQueue: Pod visualization, resource monitoring on hover - WorkersDashboard: Simplified worker list - Removed unused filters from task list ## Other - IP2Location service for visitor analytics - Findagram consumer features scaffolding - Documentation updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
354 lines
10 KiB
Markdown
354 lines
10 KiB
Markdown
# CannaiQ Crawl System V2
|
|
|
|
## Overview
|
|
|
|
The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:
|
|
|
|
1. **Phase 1: Store Discovery** - Weekly discovery of Dutchie-powered dispensaries
|
|
2. **Phase 2: Product Crawling** - Regular product/price/stock updates (documented separately)
|
|
|
|
---
|
|
|
|
## Phase 1: Store Discovery
|
|
|
|
### Purpose
|
|
|
|
Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.
|
|
|
|
### Schedule
|
|
|
|
- **Frequency**: Weekly (typically Sunday night)
|
|
- **Duration**: ~2-4 hours for full US coverage
|
|
|
|
### Flow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ PHASE 1: STORE DISCOVERY │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
1. IDENTITY SETUP
|
|
┌──────────────────┐
|
|
│ getRandomProxy() │ ──► Random IP from proxy pool
|
|
└──────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ startSession() │ ──► Random UA + fingerprint + locale matching proxy location
|
|
└──────────────────┘
|
|
|
|
2. CITY DISCOVERY (per state)
|
|
┌──────────────────────────────┐
|
|
│ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
|
|
└──────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ Upsert dutchie_discovery_ │
|
|
│ cities table │
|
|
└──────────────────────────────┘
|
|
|
|
3. STORE DISCOVERY (per city)
|
|
┌───────────────────────────────┐
|
|
│ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
|
|
└───────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────────┐
|
|
│ Upsert dutchie_discovery_ │
|
|
│ locations table │
|
|
└───────────────────────────────┘
|
|
|
|
4. VALIDATION & PROMOTION
|
|
┌──────────────────────────┐
|
|
│ validateForPromotion() │ ──► Check required fields
|
|
└──────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────┐
|
|
│ promoteLocation() │ ──► Upsert to dispensaries table
|
|
└──────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────┐
|
|
│ ensureCrawlerProfile() │ ──► Create profile with status='sandbox'
|
|
└──────────────────────────┘
|
|
|
|
5. DROPPED STORE DETECTION
|
|
┌──────────────────────────┐
|
|
│ detectDroppedStores() │ ──► Find stores missing from discovery
|
|
└──────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────┐
|
|
│ Mark status='dropped' │ ──► Dashboard alert for review
|
|
└──────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Key Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `backend/src/platforms/dutchie/client.ts` | HTTP client with proxy/fingerprint rotation |
|
|
| `backend/src/discovery/discovery-crawler.ts` | Main discovery orchestrator |
|
|
| `backend/src/discovery/location-discovery.ts` | City/store GraphQL fetching |
|
|
| `backend/src/discovery/promotion.ts` | Validation and promotion logic |
|
|
| `backend/src/scripts/run-discovery.ts` | CLI entry point |
|
|
|
|
---
|
|
|
|
## Identity Masking
|
|
|
|
Before any GraphQL queries, the system establishes a masked identity:
|
|
|
|
### 1. Proxy Selection
|
|
|
|
```typescript
|
|
// backend/src/platforms/dutchie/client.ts
|
|
|
|
// Get random proxy from active pool (NOT state-specific)
|
|
const proxy = await getRandomProxy();
|
|
setProxy(proxy.url);
|
|
```
|
|
|
|
The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.
|
|
|
|
### 2. Fingerprint + Locale Harmonization
|
|
|
|
```typescript
|
|
// backend/src/platforms/dutchie/client.ts
|
|
|
|
function startSession(stateCode: string, timezone: string) {
|
|
// 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
|
|
const fingerprint = getRandomFingerprint();
|
|
|
|
// 2. Match Accept-Language to proxy's timezone/location
|
|
const locale = getLocaleForTimezone(timezone);
|
|
|
|
// 3. Set headers for this session
|
|
currentSession = {
|
|
userAgent: fingerprint.ua,
|
|
acceptLanguage: locale,
|
|
secChUa: fingerprint.secChUa,
|
|
// ... other fingerprint headers
|
|
};
|
|
}
|
|
```
|
|
|
|
### Fingerprint Pool
|
|
|
|
6 browser fingerprints rotate on each session and on 403 errors:
|
|
|
|
| Browser | Version | Platform |
|
|
|---------|---------|----------|
|
|
| Chrome | 120 | Windows |
|
|
| Chrome | 120 | macOS |
|
|
| Firefox | 121 | Windows |
|
|
| Firefox | 121 | macOS |
|
|
| Safari | 17.2 | macOS |
|
|
| Edge | 120 | Windows |
|
|
|
|
### Timezone → Locale Mapping
|
|
|
|
```typescript
|
|
const TIMEZONE_TO_LOCALE: Record<string, string> = {
|
|
'America/New_York': 'en-US,en;q=0.9',
|
|
'America/Chicago': 'en-US,en;q=0.9',
|
|
'America/Denver': 'en-US,en;q=0.9',
|
|
'America/Los_Angeles': 'en-US,en;q=0.9',
|
|
'America/Phoenix': 'en-US,en;q=0.9',
|
|
// ...
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## GraphQL Queries
|
|
|
|
### 1. getAllCitiesByState
|
|
|
|
Fetches cities with active dispensaries for a state.
|
|
|
|
```typescript
|
|
// backend/src/discovery/location-discovery.ts
|
|
|
|
const response = await executeGraphQL({
|
|
operationName: 'getAllCitiesByState',
|
|
variables: {
|
|
state: 'AZ',
|
|
countryCode: 'US'
|
|
}
|
|
});
|
|
// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }
|
|
```
|
|
|
|
**Hash**: `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6`
|
|
|
|
### 2. ConsumerDispensaries
|
|
|
|
Fetches store data for a city/state.
|
|
|
|
```typescript
|
|
// backend/src/discovery/location-discovery.ts
|
|
|
|
const response = await executeGraphQL({
|
|
operationName: 'ConsumerDispensaries',
|
|
variables: {
|
|
dispensaryFilter: {
|
|
city: 'Phoenix',
|
|
state: 'AZ',
|
|
activeOnly: true
|
|
}
|
|
}
|
|
});
|
|
// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]
|
|
```
|
|
|
|
**Hash**: `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b`
|
|
|
|
---
|
|
|
|
## Database Tables
|
|
|
|
### Discovery Tables (Staging)
|
|
|
|
| Table | Purpose |
|
|
|-------|---------|
|
|
| `dutchie_discovery_cities` | Cities known to have dispensaries |
|
|
| `dutchie_discovery_locations` | Raw discovered store data |
|
|
|
|
### Canonical Tables
|
|
|
|
| Table | Purpose |
|
|
|-------|---------|
|
|
| `dispensaries` | Promoted stores ready for crawling |
|
|
| `dispensary_crawler_profiles` | Crawler configuration per store |
|
|
| `dutchie_promotion_log` | Audit trail for all discovery actions |
|
|
|
|
---
|
|
|
|
## Validation Rules
|
|
|
|
A discovery location must have these fields to be promoted:
|
|
|
|
| Field | Requirement |
|
|
|-------|-------------|
|
|
| `platform_location_id` | MongoDB ObjectId (24 hex chars) |
|
|
| `name` | Non-empty string |
|
|
| `city` | Non-empty string |
|
|
| `state_code` | Non-empty string |
|
|
| `platform_menu_url` | Valid URL |
|
|
|
|
Invalid records are marked `status='rejected'` with errors logged.
|
|
|
|
---
|
|
|
|
## Dropped Store Detection
|
|
|
|
After discovery, the system identifies stores that may have left the Dutchie platform:
|
|
|
|
### Detection Criteria
|
|
|
|
A store is marked as "dropped" if:
|
|
|
|
1. It has a `platform_dispensary_id` (was previously verified)
|
|
2. It's currently `status='open'` and `crawl_enabled=true`
|
|
3. It was NOT seen in the latest discovery (not in `dutchie_discovery_locations` with `last_seen_at` in last 24 hours)
|
|
|
|
### Implementation
|
|
|
|
```typescript
|
|
// backend/src/discovery/discovery-crawler.ts
|
|
|
|
export async function detectDroppedStores(pool: Pool, stateCode?: string) {
|
|
// 1. Find dispensaries not in recent discovery
|
|
// 2. Mark status='dropped'
|
|
// 3. Log to dutchie_promotion_log
|
|
// 4. Return list for dashboard alert
|
|
}
|
|
```
|
|
|
|
### Admin UI
|
|
|
|
- **Dashboard**: Red alert banner when dropped stores exist
|
|
- **Dispensaries page**: Filter by `status=dropped` to review
|
|
|
|
---
|
|
|
|
## CLI Usage
|
|
|
|
```bash
|
|
# Discover all stores in a state
|
|
npx tsx src/scripts/run-discovery.ts discover:state AZ
|
|
|
|
# Discover all US states
|
|
npx tsx src/scripts/run-discovery.ts discover:all
|
|
|
|
# Dry run (no DB writes)
|
|
npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run
|
|
|
|
# Check stats
|
|
npx tsx src/scripts/run-discovery.ts stats
|
|
```
|
|
|
|
---
|
|
|
|
## Rate Limiting
|
|
|
|
- **2 seconds** between city requests
|
|
- **Exponential backoff** on 429/403 responses
|
|
- **Fingerprint rotation** on 403 errors
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
| Error | Action |
|
|
|-------|--------|
|
|
| 403 Forbidden | Rotate fingerprint, retry |
|
|
| 429 Rate Limited | Wait 30s, retry |
|
|
| Network timeout | Retry up to 3 times |
|
|
| GraphQL error | Log and continue to next city |
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Logs
|
|
|
|
Discovery progress is logged to stdout:
|
|
|
|
```
|
|
[Discovery] Starting discovery for state: AZ
|
|
[Discovery] Step 1: Initializing proxy...
|
|
[Discovery] Step 2: Fetching cities...
|
|
[Discovery] Found 45 cities for AZ
|
|
[Discovery] Step 3: Discovering locations...
|
|
[Discovery] City 1/45: Phoenix - found 28 stores
|
|
...
|
|
[Discovery] Step 4: Auto-promoting discovered locations...
|
|
[Discovery] Created: 5 new dispensaries
|
|
[Discovery] Updated: 40 existing dispensaries
|
|
[Discovery] Step 5: Detecting dropped stores...
|
|
[Discovery] Found 2 dropped stores
|
|
```
|
|
|
|
### Audit Log
|
|
|
|
All actions logged to `dutchie_promotion_log`:
|
|
|
|
| Action | Description |
|
|
|--------|-------------|
|
|
| `promoted_create` | New dispensary created |
|
|
| `promoted_update` | Existing dispensary updated |
|
|
| `rejected` | Validation failed |
|
|
| `dropped` | Store not found in discovery |
|
|
|
|
---
|
|
|
|
## Next: Phase 2
|
|
|
|
See `docs/PRODUCT_CRAWL_V2.md` for the product crawling phase (coming next).
|