Files
cannaiq/docs/CRAWL_SYSTEM_V2.md
Kelly 56cc171287 feat: Stealth worker system with mandatory proxy rotation
## Worker System
- Role-agnostic workers that can handle any task type
- Pod-based architecture with StatefulSet (5-15 pods, 5 workers each)
- Custom pod names (Aethelgard, Xylos, Kryll, etc.)
- Worker registry with friendly names and resource monitoring
- Hub-and-spoke visualization on JobQueue page

## Stealth & Anti-Detection (REQUIRED)
- Proxies are MANDATORY - workers fail to start without active proxies
- CrawlRotator initializes on worker startup
- Loads proxies from `proxies` table
- Auto-rotates proxy + fingerprint on 403 errors
- 12 browser fingerprints (Chrome, Firefox, Safari, Edge)
- Locale/timezone matching for geographic consistency

## Task System
- Renamed product_resync → product_refresh
- Task chaining: store_discovery → entry_point → product_discovery
- Priority-based claiming with FOR UPDATE SKIP LOCKED
- Heartbeat and stale task recovery

## UI Updates
- JobQueue: Pod visualization, resource monitoring on hover
- WorkersDashboard: Simplified worker list
- Removed unused filters from task list

## Other
- IP2Location service for visitor analytics
- Findagram consumer features scaffolding
- Documentation updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 00:44:59 -07:00

354 lines
10 KiB
Markdown

# CannaiQ Crawl System V2
## Overview
The CannaiQ Crawl System is a GraphQL-based data pipeline that discovers and monitors cannabis dispensaries using the Dutchie platform. It operates in two phases:
1. **Phase 1: Store Discovery** - Weekly discovery of Dutchie-powered dispensaries
2. **Phase 2: Product Crawling** - Regular product/price/stock updates (documented separately)
---
## Phase 1: Store Discovery
### Purpose
Automatically discover and maintain a database of dispensaries that use Dutchie menus across all US states.
### Schedule
- **Frequency**: Weekly (typically Sunday night)
- **Duration**: ~2-4 hours for full US coverage
### Flow Diagram
```
┌─────────────────────────────────────────────────────────────────────┐
│ PHASE 1: STORE DISCOVERY │
└─────────────────────────────────────────────────────────────────────┘
1. IDENTITY SETUP
┌──────────────────┐
│ getRandomProxy() │ ──► Random IP from proxy pool
└──────────────────┘
┌──────────────────┐
│ startSession() │ ──► Random UA + fingerprint + locale matching proxy location
└──────────────────┘
2. CITY DISCOVERY (per state)
┌──────────────────────────────┐
│ GraphQL: getAllCitiesByState │ ──► Returns cities with active dispensaries
└──────────────────────────────┘
┌──────────────────────────────┐
│ Upsert dutchie_discovery_ │
│ cities table │
└──────────────────────────────┘
3. STORE DISCOVERY (per city)
┌───────────────────────────────┐
│ GraphQL: ConsumerDispensaries │ ──► Returns store data for city
└───────────────────────────────┘
┌───────────────────────────────┐
│ Upsert dutchie_discovery_ │
│ locations table │
└───────────────────────────────┘
4. VALIDATION & PROMOTION
┌──────────────────────────┐
│ validateForPromotion() │ ──► Check required fields
└──────────────────────────┘
┌──────────────────────────┐
│ promoteLocation() │ ──► Upsert to dispensaries table
└──────────────────────────┘
┌──────────────────────────┐
│ ensureCrawlerProfile() │ ──► Create profile with status='sandbox'
└──────────────────────────┘
5. DROPPED STORE DETECTION
┌──────────────────────────┐
│ detectDroppedStores() │ ──► Find stores missing from discovery
└──────────────────────────┘
┌──────────────────────────┐
│ Mark status='dropped' │ ──► Dashboard alert for review
└──────────────────────────┘
```
---
## Key Files
| File | Purpose |
|------|---------|
| `backend/src/platforms/dutchie/client.ts` | HTTP client with proxy/fingerprint rotation |
| `backend/src/discovery/discovery-crawler.ts` | Main discovery orchestrator |
| `backend/src/discovery/location-discovery.ts` | City/store GraphQL fetching |
| `backend/src/discovery/promotion.ts` | Validation and promotion logic |
| `backend/src/scripts/run-discovery.ts` | CLI entry point |
---
## Identity Masking
Before any GraphQL queries, the system establishes a masked identity:
### 1. Proxy Selection
```typescript
// backend/src/platforms/dutchie/client.ts
// Get random proxy from active pool (NOT state-specific)
const proxy = await getRandomProxy();
setProxy(proxy.url);
```
The proxy is selected randomly from the active proxy pool. It is NOT geo-targeted to the state being crawled.
### 2. Fingerprint + Locale Harmonization
```typescript
// backend/src/platforms/dutchie/client.ts
function startSession(stateCode: string, timezone: string) {
// 1. Random browser fingerprint (Chrome/Firefox/Safari/Edge variants)
const fingerprint = getRandomFingerprint();
// 2. Match Accept-Language to proxy's timezone/location
const locale = getLocaleForTimezone(timezone);
// 3. Set headers for this session
currentSession = {
userAgent: fingerprint.ua,
acceptLanguage: locale,
secChUa: fingerprint.secChUa,
// ... other fingerprint headers
};
}
```
### Fingerprint Pool
6 browser fingerprints rotate on each session and on 403 errors:
| Browser | Version | Platform |
|---------|---------|----------|
| Chrome | 120 | Windows |
| Chrome | 120 | macOS |
| Firefox | 121 | Windows |
| Firefox | 121 | macOS |
| Safari | 17.2 | macOS |
| Edge | 120 | Windows |
### Timezone → Locale Mapping
```typescript
const TIMEZONE_TO_LOCALE: Record<string, string> = {
'America/New_York': 'en-US,en;q=0.9',
'America/Chicago': 'en-US,en;q=0.9',
'America/Denver': 'en-US,en;q=0.9',
'America/Los_Angeles': 'en-US,en;q=0.9',
'America/Phoenix': 'en-US,en;q=0.9',
// ...
};
```
---
## GraphQL Queries
### 1. getAllCitiesByState
Fetches cities with active dispensaries for a state.
```typescript
// backend/src/discovery/location-discovery.ts
const response = await executeGraphQL({
operationName: 'getAllCitiesByState',
variables: {
state: 'AZ',
countryCode: 'US'
}
});
// Returns: { cities: [{ name: 'Phoenix', slug: 'phoenix' }, ...] }
```
**Hash**: `ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6`
### 2. ConsumerDispensaries
Fetches store data for a city/state.
```typescript
// backend/src/discovery/location-discovery.ts
const response = await executeGraphQL({
operationName: 'ConsumerDispensaries',
variables: {
dispensaryFilter: {
city: 'Phoenix',
state: 'AZ',
activeOnly: true
}
}
});
// Returns: [{ id, name, address, coords, menuUrl, ... }, ...]
```
**Hash**: `0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b`
---
## Database Tables
### Discovery Tables (Staging)
| Table | Purpose |
|-------|---------|
| `dutchie_discovery_cities` | Cities known to have dispensaries |
| `dutchie_discovery_locations` | Raw discovered store data |
### Canonical Tables
| Table | Purpose |
|-------|---------|
| `dispensaries` | Promoted stores ready for crawling |
| `dispensary_crawler_profiles` | Crawler configuration per store |
| `dutchie_promotion_log` | Audit trail for all discovery actions |
---
## Validation Rules
A discovery location must have these fields to be promoted:
| Field | Requirement |
|-------|-------------|
| `platform_location_id` | MongoDB ObjectId (24 hex chars) |
| `name` | Non-empty string |
| `city` | Non-empty string |
| `state_code` | Non-empty string |
| `platform_menu_url` | Valid URL |
Invalid records are marked `status='rejected'` with errors logged.
---
## Dropped Store Detection
After discovery, the system identifies stores that may have left the Dutchie platform:
### Detection Criteria
A store is marked as "dropped" if:
1. It has a `platform_dispensary_id` (was previously verified)
2. It's currently `status='open'` and `crawl_enabled=true`
3. It was NOT seen in the latest discovery (not in `dutchie_discovery_locations` with `last_seen_at` in last 24 hours)
### Implementation
```typescript
// backend/src/discovery/discovery-crawler.ts
export async function detectDroppedStores(pool: Pool, stateCode?: string) {
// 1. Find dispensaries not in recent discovery
// 2. Mark status='dropped'
// 3. Log to dutchie_promotion_log
// 4. Return list for dashboard alert
}
```
### Admin UI
- **Dashboard**: Red alert banner when dropped stores exist
- **Dispensaries page**: Filter by `status=dropped` to review
---
## CLI Usage
```bash
# Discover all stores in a state
npx tsx src/scripts/run-discovery.ts discover:state AZ
# Discover all US states
npx tsx src/scripts/run-discovery.ts discover:all
# Dry run (no DB writes)
npx tsx src/scripts/run-discovery.ts discover:state CA --dry-run
# Check stats
npx tsx src/scripts/run-discovery.ts stats
```
---
## Rate Limiting
- **2 seconds** between city requests
- **Exponential backoff** on 429/403 responses
- **Fingerprint rotation** on 403 errors
---
## Error Handling
| Error | Action |
|-------|--------|
| 403 Forbidden | Rotate fingerprint, retry |
| 429 Rate Limited | Wait 30s, retry |
| Network timeout | Retry up to 3 times |
| GraphQL error | Log and continue to next city |
---
## Monitoring
### Logs
Discovery progress is logged to stdout:
```
[Discovery] Starting discovery for state: AZ
[Discovery] Step 1: Initializing proxy...
[Discovery] Step 2: Fetching cities...
[Discovery] Found 45 cities for AZ
[Discovery] Step 3: Discovering locations...
[Discovery] City 1/45: Phoenix - found 28 stores
...
[Discovery] Step 4: Auto-promoting discovered locations...
[Discovery] Created: 5 new dispensaries
[Discovery] Updated: 40 existing dispensaries
[Discovery] Step 5: Detecting dropped stores...
[Discovery] Found 2 dropped stores
```
### Audit Log
All actions logged to `dutchie_promotion_log`:
| Action | Description |
|--------|-------------|
| `promoted_create` | New dispensary created |
| `promoted_update` | Existing dispensary updated |
| `rejected` | Validation failed |
| `dropped` | Store not found in discovery |
---
## Next: Phase 2
See `docs/PRODUCT_CRAWL_V2.md` for the product crawling phase (coming next).