feat: Auto-healing entry_point_discovery with browser-first transport

- Rewrote entry_point_discovery with auto-healing scheme:
  1. Check dutchie_discovery_locations for existing platform_location_id
  2. Browser-based GraphQL with 5x network retries
  3. Mark as needs_investigation on hard failure
- Browser (Puppeteer) is now DEFAULT transport - curl only when explicit
- Added migration 091 for tracking columns:
  - last_store_discovery_at: When store_discovery updated record
  - last_payload_at: When last product payload was saved
- Updated CODEBASE_MAP.md with transport rules documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
Kelly
2025-12-12 22:55:21 -07:00
parent 97bfdb9618
commit 55b26e9153
3 changed files with 468 additions and 111 deletions

View File

@@ -99,10 +99,60 @@ src/scraper-v2/*.ts # Entire directory deprecated
|------|---------|--------|
| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** |
| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** |
| `src/tasks/handlers/entry-point-discovery.ts` | Resolve platform IDs (auto-healing) | **PRIMARY** |
| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE |
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs | ACTIVE |
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs (legacy) | LEGACY |
| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE |
---
## Transport Rules (CRITICAL)
**Browser-based (Puppeteer) is the DEFAULT transport. curl is ONLY allowed when explicitly specified.**
### Transport Selection
| `task.method` | Transport Used | Notes |
|---------------|----------------|-------|
| `null` | Browser (Puppeteer) | DEFAULT - use this for most tasks |
| `'http'` | Browser (Puppeteer) | Explicit browser request |
| `'curl'` | curl-impersonate | ONLY when explicitly needed |
### Why Browser-First?
1. **Anti-detection**: Puppeteer with StealthPlugin evades bot detection
2. **Session cookies**: Browser maintains session state automatically
3. **Fingerprinting**: Real browser fingerprint (TLS, headers, etc.)
4. **Age gates**: Browser can click through age verification
### Entry Point Discovery Auto-Healing
The `entry_point_discovery` handler uses a healing strategy:
```
1. FIRST: Check dutchie_discovery_locations for existing platform_location_id
- By linked dutchie_discovery_id
- By slug match in discovery data
→ If found, NO network call needed
2. SECOND: Browser-based GraphQL (Puppeteer)
- 5x retries for network/proxy failures
- On HTTP 403: rotate proxy and retry
- On HTTP 404 after 2 attempts: mark as 'removed'
3. HARD FAILURE: After exhausting options → 'needs_investigation'
```
### DO NOT Use curl Unless:
- Task explicitly has `method = 'curl'`
- You're testing curl-impersonate binaries
- The API explicitly requires curl fingerprinting
### Files
| File | Transport | Purpose |
|------|-----------|---------|
| `src/services/puppeteer-preflight.ts` | Browser | Preflight check |
| `src/services/curl-preflight.ts` | curl | Preflight check |
| `src/tasks/handlers/entry-point-discovery.ts` | Browser | Platform ID resolution |
| `src/tasks/handlers/payload-fetch.ts` | Both | Product fetching |
### Database
| File | Purpose | Status |
|------|---------|--------|