fix: Add curl to Docker, add active flag to worker_tasks
- Install curl in Docker container for Dutchie HTTP requests - Add 'active' column to worker_tasks (default false) to prevent accidental task execution on startup - Update task-service to only claim tasks where active=true 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
365
workflow-12102025.md
Normal file
365
workflow-12102025.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Workflow Documentation - December 10, 2025
|
||||
|
||||
## Purpose
|
||||
|
||||
This document captures the intended behavior for the CannaiQ crawl system, specifically around proxy rotation, fingerprinting, and anti-detection.
|
||||
|
||||
---
|
||||
|
||||
## Stealth & Anti-Detection Requirements
|
||||
|
||||
### 1. Task Determines Work, Proxy Determines Identity
|
||||
|
||||
The task payload contains:
|
||||
- `dispensary_id` - which store to crawl
|
||||
- `role` - what type of work (product_resync, entry_point_discovery, etc.)
|
||||
|
||||
The **proxy** determines the session identity:
|
||||
- Proxy location (city, state, timezone) → sets Accept-Language and timezone headers
|
||||
- Language is always English (`en-US`)
|
||||
|
||||
**Flow:**
|
||||
```
|
||||
Task claimed
|
||||
│
|
||||
└─► Get proxy from rotation
|
||||
│
|
||||
└─► Proxy has location (city, state, timezone)
|
||||
│
|
||||
└─► Build headers using proxy's timezone
|
||||
- Accept-Language: en-US,en;q=0.9
|
||||
- Timezone-consistent behavior
|
||||
```
|
||||
|
||||
### 2. On 403 Block - Immediate Backoff
|
||||
|
||||
When a 403 is received:
|
||||
|
||||
1. **Immediately** stop using current IP
|
||||
2. Get a new proxy (new IP)
|
||||
3. Get a new UA/fingerprint
|
||||
4. Retry the request
|
||||
|
||||
**Per-proxy failure tracking:**
|
||||
- Track UA rotation attempts per proxy
|
||||
- After 3 UA/fingerprint rotations on the same proxy → disable that proxy
|
||||
- This means: if we rotate UA 3 times and still get 403, the proxy is burned
|
||||
|
||||
### 3. Fingerprint Rotation Rules
|
||||
|
||||
Each request uses:
|
||||
- Proxy (IP)
|
||||
- User-Agent
|
||||
- sec-ch-ua headers (Client Hints)
|
||||
- Accept-Language (from proxy location)
|
||||
|
||||
On 403:
|
||||
1. Record failure on current proxy
|
||||
2. Rotate to new proxy
|
||||
3. Pick new random fingerprint
|
||||
4. If same proxy fails 3 times with different fingerprints → disable proxy
|
||||
|
||||
### 4. Proxy Table Schema
|
||||
|
||||
```sql
|
||||
CREATE TABLE proxies (
|
||||
id SERIAL PRIMARY KEY,
|
||||
host VARCHAR(255) NOT NULL,
|
||||
port INTEGER NOT NULL,
|
||||
username VARCHAR(100),
|
||||
password VARCHAR(100),
|
||||
protocol VARCHAR(10) DEFAULT 'http',
|
||||
active BOOLEAN DEFAULT true,
|
||||
|
||||
-- Location (determines session headers)
|
||||
city VARCHAR(100),
|
||||
state VARCHAR(50),
|
||||
country VARCHAR(100),
|
||||
country_code VARCHAR(10),
|
||||
timezone VARCHAR(50),
|
||||
|
||||
-- Health tracking
|
||||
failure_count INTEGER DEFAULT 0,
|
||||
consecutive_403_count INTEGER DEFAULT 0, -- Track 403s specifically
|
||||
last_used_at TIMESTAMPTZ,
|
||||
last_failure_at TIMESTAMPTZ,
|
||||
last_error TEXT,
|
||||
|
||||
-- Performance
|
||||
response_time_ms INTEGER,
|
||||
max_connections INTEGER DEFAULT 1
|
||||
);
|
||||
```
|
||||
|
||||
### 5. Failure Threshold
|
||||
|
||||
- **3 consecutive 403s** with different fingerprints → disable proxy
|
||||
- Reset `consecutive_403_count` to 0 on successful request
|
||||
- General `failure_count` tracks all errors (timeouts, connection errors, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### COMPLETED - December 10, 2025
|
||||
|
||||
All code changes have been implemented per this specification:
|
||||
|
||||
#### 1. crawl-rotator.ts ✅
|
||||
|
||||
- [x] Added `consecutive403Count` to Proxy interface
|
||||
- [x] Added `markBlocked()` method that increments `consecutive_403_count` and disables proxy at 3
|
||||
- [x] Added `getProxyTimezone()` to return current proxy's timezone
|
||||
- [x] `markSuccess()` now resets `consecutive_403_count` to 0
|
||||
- [x] Replaced hardcoded UA list with `intoli/user-agents` library for realistic fingerprints
|
||||
- [x] `BrowserFingerprint` interface includes full fingerprint data (UA, platform, screen size, viewport, sec-ch-ua headers)
|
||||
|
||||
#### 2. client.ts ✅
|
||||
|
||||
- [x] `startSession()` no longer takes state/timezone params
|
||||
- [x] `startSession()` gets identity from proxy via `crawlRotator.getProxyLocation()`
|
||||
- [x] Added `handle403Block()` that:
|
||||
- Calls `crawlRotator.recordBlock()` (tracks consecutive 403s)
|
||||
- Immediately rotates both proxy and fingerprint via `rotateBoth()`
|
||||
- Returns false if no more proxies available
|
||||
- [x] `executeGraphQL()` calls `handle403Block()` on 403 (not `rotateProxyOn403`)
|
||||
- [x] `fetchPage()` uses same 403 handling
|
||||
- [x] 500ms backoff after rotation (not linear delay)
|
||||
|
||||
#### 3. Task Handlers ✅
|
||||
|
||||
- [x] `entry-point-discovery.ts`: `startSession()` called with no params
|
||||
- [x] `product-refresh.ts`: `startSession()` called with no params
|
||||
|
||||
#### 4. Dependencies ✅
|
||||
|
||||
- [x] Added `user-agents` npm package for realistic UA generation
|
||||
|
||||
---
|
||||
|
||||
## Files Changed
|
||||
|
||||
| File | Changes |
|
||||
|------|---------|
|
||||
| `backend/src/services/crawl-rotator.ts` | Complete rewrite with `consecutive403Count`, `markBlocked()`, `intoli/user-agents` |
|
||||
| `backend/src/platforms/dutchie/client.ts` | `startSession()` uses proxy location, `handle403Block()` for 403 handling |
|
||||
| `backend/src/tasks/handlers/entry-point-discovery.ts` | `startSession()` no params |
|
||||
| `backend/src/tasks/handlers/product-refresh.ts` | `startSession()` no params |
|
||||
| `backend/package.json` | Added `user-agents` dependency |
|
||||
|
||||
---
|
||||
|
||||
## Migration Required
|
||||
|
||||
The `proxies` table needs `consecutive_403_count` column if not already present:
|
||||
|
||||
```sql
|
||||
ALTER TABLE proxies ADD COLUMN IF NOT EXISTS consecutive_403_count INTEGER DEFAULT 0;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Behaviors Summary
|
||||
|
||||
| Behavior | Implementation |
|
||||
|----------|----------------|
|
||||
| Session identity | From proxy location (`getProxyLocation()`) |
|
||||
| Language | Always `en-US,en;q=0.9` |
|
||||
| 403 handling | `handle403Block()` → `recordBlock()` → `rotateBoth()` |
|
||||
| Proxy disable | After 3 consecutive 403s (`consecutive403Count >= 3`) |
|
||||
| Success reset | `markSuccess()` resets `consecutive403Count` to 0 |
|
||||
| UA generation | `intoli/user-agents` library (daily updated, realistic fingerprints) |
|
||||
| Fingerprint data | Full: UA, platform, screen size, viewport, sec-ch-ua headers |
|
||||
|
||||
---
|
||||
|
||||
## User-Agent Generation
|
||||
|
||||
### Data Source
|
||||
|
||||
The `intoli/user-agents` npm library provides daily-updated market share data collected from Intoli's residential proxy network (millions of real users). The package auto-releases new versions daily to npm.
|
||||
|
||||
### Device Category Distribution (hardcoded)
|
||||
|
||||
| Category | Share |
|
||||
|----------|-------|
|
||||
| Mobile | 62% |
|
||||
| Desktop | 36% |
|
||||
| Tablet | 2% |
|
||||
|
||||
### Browser Filter (whitelist only)
|
||||
|
||||
Only these browsers are allowed:
|
||||
- Chrome (67%)
|
||||
- Safari (20%)
|
||||
- Edge (6%)
|
||||
- Firefox (3%)
|
||||
|
||||
Samsung Internet, Opera, and other niche browsers are filtered out.
|
||||
|
||||
### Desktop OS Distribution (from library)
|
||||
|
||||
| OS | Share |
|
||||
|----|-------|
|
||||
| Windows | 72% |
|
||||
| macOS | 17% |
|
||||
| Linux | 4% |
|
||||
|
||||
### UA Lifecycle
|
||||
|
||||
1. **Session start** (new proxy IP obtained) → Roll device category (62/36/2) → Generate UA filtered to device + top 4 browsers → Store on session
|
||||
2. **UA sticks** until IP rotates (403 block or manual rotation)
|
||||
3. **IP rotation** triggers new UA generation
|
||||
|
||||
### Failure Handling
|
||||
|
||||
- If UA generation fails → Alert admin dashboard, **stop crawl immediately**
|
||||
- No fallback to static UA list
|
||||
- This forces investigation rather than silent degradation
|
||||
|
||||
### Session Logging
|
||||
|
||||
Each session logs:
|
||||
- Device category (mobile/desktop/tablet)
|
||||
- Full UA string
|
||||
- Browser name (Chrome/Safari/Edge/Firefox)
|
||||
- IP address (from proxy)
|
||||
- Session start timestamp
|
||||
|
||||
Logs are rotated monthly.
|
||||
|
||||
### Implementation
|
||||
|
||||
Located in `backend/src/services/crawl-rotator.ts`:
|
||||
|
||||
```typescript
|
||||
// Per workflow-12102025.md: Device category distribution
|
||||
const DEVICE_WEIGHTS = { mobile: 62, desktop: 36, tablet: 2 };
|
||||
|
||||
// Per workflow-12102025.md: Browser whitelist
|
||||
const ALLOWED_BROWSERS = ['Chrome', 'Safari', 'Edge', 'Firefox'];
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## HTTP Fingerprinting
|
||||
|
||||
### Goal
|
||||
|
||||
Make HTTP requests indistinguishable from real browser traffic. No repeatable footprint.
|
||||
|
||||
### Components
|
||||
|
||||
1. **Full Header Set** - All headers a real browser sends
|
||||
2. **Header Ordering** - Browser-specific order (Chrome vs Firefox vs Safari)
|
||||
3. **TLS Fingerprint** - Use `curl-impersonate` to match browser TLS signature
|
||||
4. **Dynamic Referer** - Set per dispensary being crawled
|
||||
5. **Natural Randomization** - Vary optional headers like real users
|
||||
|
||||
### Required Headers
|
||||
|
||||
| Header | Chrome | Firefox | Safari | Notes |
|
||||
|--------|--------|---------|--------|-------|
|
||||
| `User-Agent` | ✅ | ✅ | ✅ | From UA generation |
|
||||
| `Accept` | ✅ | ✅ | ✅ | Content types |
|
||||
| `Accept-Language` | ✅ | ✅ | ✅ | Always `en-US,en;q=0.9` |
|
||||
| `Accept-Encoding` | ✅ | ✅ | ✅ | `gzip, deflate, br` |
|
||||
| `Connection` | ✅ | ✅ | ✅ | `keep-alive` |
|
||||
| `Origin` | ✅ | ✅ | ✅ | `https://dutchie.com` (POST only) |
|
||||
| `Referer` | ✅ | ✅ | ✅ | Dynamic per dispensary |
|
||||
| `sec-ch-ua` | ✅ | ❌ | ❌ | Chromium only |
|
||||
| `sec-ch-ua-mobile` | ✅ | ❌ | ❌ | Chromium only |
|
||||
| `sec-ch-ua-platform` | ✅ | ❌ | ❌ | Chromium only |
|
||||
| `sec-fetch-dest` | ✅ | ✅ | ❌ | `empty` for XHR |
|
||||
| `sec-fetch-mode` | ✅ | ✅ | ❌ | `cors` for XHR |
|
||||
| `sec-fetch-site` | ✅ | ✅ | ❌ | `same-origin` |
|
||||
| `Upgrade-Insecure-Requests` | ✅ | ✅ | ✅ | `1` (page loads only) |
|
||||
| `DNT` | ~30% | ~30% | ~30% | Randomized per session |
|
||||
|
||||
### Header Ordering
|
||||
|
||||
Each browser sends headers in a specific order. Fingerprinting services detect mismatches.
|
||||
|
||||
**Chrome order (GraphQL request):**
|
||||
1. Host
|
||||
2. Connection
|
||||
3. Content-Length (POST)
|
||||
4. sec-ch-ua
|
||||
5. DNT (if enabled)
|
||||
6. sec-ch-ua-mobile
|
||||
7. User-Agent
|
||||
8. sec-ch-ua-platform
|
||||
9. Content-Type (POST)
|
||||
10. Accept
|
||||
11. Origin (POST)
|
||||
12. sec-fetch-site
|
||||
13. sec-fetch-mode
|
||||
14. sec-fetch-dest
|
||||
15. Referer
|
||||
16. Accept-Encoding
|
||||
17. Accept-Language
|
||||
|
||||
**Firefox order (GraphQL request):**
|
||||
1. Host
|
||||
2. User-Agent
|
||||
3. Accept
|
||||
4. Accept-Language
|
||||
5. Accept-Encoding
|
||||
6. Content-Type (POST)
|
||||
7. Content-Length (POST)
|
||||
8. Origin (POST)
|
||||
9. DNT (if enabled)
|
||||
10. Connection
|
||||
11. Referer
|
||||
12. sec-fetch-dest
|
||||
13. sec-fetch-mode
|
||||
14. sec-fetch-site
|
||||
|
||||
**Safari order (GraphQL request):**
|
||||
1. Host
|
||||
2. Connection
|
||||
3. Content-Length (POST)
|
||||
4. Accept
|
||||
5. User-Agent
|
||||
6. Content-Type (POST)
|
||||
7. Origin (POST)
|
||||
8. Referer
|
||||
9. Accept-Encoding
|
||||
10. Accept-Language
|
||||
|
||||
### TLS Fingerprinting
|
||||
|
||||
Use `curl-impersonate` instead of standard curl:
|
||||
- `curl_chrome131` - Mimics Chrome 131 TLS handshake
|
||||
- `curl_ff133` - Mimics Firefox 133 TLS handshake
|
||||
- `curl_safari17` - Mimics Safari 17 TLS handshake
|
||||
|
||||
Match TLS binary to browser in UA.
|
||||
|
||||
### Dynamic Referer
|
||||
|
||||
Set Referer to the dispensary's actual page URL:
|
||||
|
||||
```
|
||||
Crawling "harvest-of-tempe" → Referer: https://dutchie.com/dispensary/harvest-of-tempe
|
||||
Crawling "zen-leaf-mesa" → Referer: https://dutchie.com/dispensary/zen-leaf-mesa
|
||||
```
|
||||
|
||||
Derived from dispensary's `menu_url` field.
|
||||
|
||||
### Natural Randomization
|
||||
|
||||
Per-session randomization (set once when session starts, consistent for session):
|
||||
|
||||
| Feature | Distribution | Implementation |
|
||||
|---------|--------------|----------------|
|
||||
| DNT header | 30% have it | `Math.random() < 0.30` |
|
||||
| Accept quality values | Slight variation | `q=0.9` vs `q=0.8` |
|
||||
|
||||
### Implementation Files
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/services/crawl-rotator.ts` | `BrowserFingerprint` includes full header config |
|
||||
| `src/platforms/dutchie/client.ts` | Build headers from fingerprint, use curl-impersonate |
|
||||
| `src/services/http-fingerprint.ts` | Header ordering per browser (NEW) |
|
||||
Reference in New Issue
Block a user