Compare commits

..

16 Commits

Author SHA1 Message Date
Kelly
6bcadd9e71 fix(preflight): Correct parameter order and add IP/fingerprint reporting
- Fix update_worker_preflight call to use correct parameter order:
  (worker_id, transport, status, ip, response_ms, error, fingerprint)
- Add proxyIp to both curl and http preflight reports
- Add fingerprint JSONB with timezone, location, and bot detection data
- Log HTTP IP and timezone after preflight completes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 00:32:45 -07:00
kelly
a77bf8611a Merge pull request 'feat(workers): Preflight phase 1 - Schema, StatefulSet, and timezone matching' (#55) from feat/preflight-phase1-schema into master 2025-12-12 07:30:53 +00:00
Kelly
33feca3138 fix(antidetect): Match browser timezone to proxy IP location
- Add IP geolocation lookup via ip-api.com to get timezone from proxy IP
- Use ipify.org API for reliable proxy IP detection (replaces unreliable fingerprint.com scraping)
- Set browser timezone via CDP Emulation.setTimezoneOverride to match proxy location
- Add detectedTimezone and detectedLocation to preflight result
- Add /api/worker-registry/preflight-test endpoint for smoke testing

Fixes timezone mismatch where browser showed America/Phoenix while proxy was in America/New_York

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 00:25:39 -07:00
kelly
7d85a97b63 Merge pull request 'feat: Preflight schema and StatefulSet' (#54) from feat/preflight-phase1-schema into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/54
2025-12-12 07:14:40 +00:00
Kelly
ce081effd4 feat(workers): Add preflight schema and StatefulSet
- Migration 085: Add curl_ip, http_ip, fingerprint_data, preflight_status,
  preflight_at columns to worker_registry
- StatefulSet manifest for 8 persistent workers with OnDelete update strategy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:45:04 -07:00
kelly
2ed088b4d8 Merge pull request 'feat(api): Add preflight columns to worker registry API response' (#50) from feat/preflight-api-fields into master 2025-12-12 06:22:06 +00:00
Kelly
d3c49fa246 feat(api): Add preflight columns to worker registry API response
Exposes curl_ip, http_ip, preflight_status, preflight_at, and fingerprint_data
in the /api/worker-registry/workers response.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:12:55 -07:00
kelly
52cb5014fd Merge pull request 'feat: Dual-transport preflight system for worker fingerprinting' (#49) from fix/ci-and-preflight-enforcement into master 2025-12-12 06:08:13 +00:00
Kelly
50654be910 fix: Restore hydration and product_refresh for store updates
- Moved hydration module back from _deprecated (needed for product_refresh)
- Restored product_refresh handler for processing stored payloads
- Restored geolocation service for findadispo/findagram
- Stubbed system routes that depend on deprecated SyncOrchestrator
- Removed crawler-sandbox route (deprecated)
- Fixed all TypeScript compilation errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:03:39 -07:00
Kelly
cdab71a1ee feat(workers): Add dual-transport preflight system
Workers now run both curl and http (Puppeteer) preflights on startup:
- curl-preflight.ts: Tests axios + proxy via httpbin.org
- puppeteer-preflight.ts: Tests browser + StealthPlugin via fingerprint.com
  (with amiunique.org fallback)
- Migration 084: Adds preflight columns to worker_registry and method
  column to worker_tasks
- Workers report preflight status, IP, fingerprint, and response time
- Tasks can require specific transport method (curl/http)
- Dashboard shows Transport column with preflight status badges

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:47:52 -07:00
Kelly
a35976b9e9 chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00
kelly
c68210c485 Merge pull request 'fix(ci): Remove buildx cache and add preflight enforcement' (#48) from fix/ci-and-preflight-enforcement into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/48
2025-12-12 04:53:20 +00:00
Kelly
f2864bd2ad fix(ci): Remove buildx cache and add preflight enforcement
- Remove cache_from/cache_to from CI (plugin bug splitting commas)
- Add preflight() method to CrawlRotator - tests proxy + anti-detect
- Add pre-task preflight check - workers MUST pass before executing
- Add releaseTask() to release tasks back to pending on preflight fail
- Rename proxy_test task to whoami for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 21:37:22 -07:00
kelly
eca9e85242 Merge pull request 'fix(ci): Fix buildx cache syntax and add proxy_test task' (#47) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/47
2025-12-12 04:30:21 +00:00
kelly
e1c67dcee5 Merge pull request 'feat: UI polish and CI caching improvements' (#46) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/46
2025-12-12 03:47:46 +00:00
kelly
34c8a8cc67 Merge pull request 'feat(cannaiq): Add clickable logo, favicon, and remove state selector' (#45) from feat/cannaiq-ui-polish into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/45
2025-12-12 03:36:04 +00:00
73 changed files with 3180 additions and 1905 deletions

View File

@@ -69,6 +69,7 @@ steps:
# =========================================== # ===========================================
# MASTER DEPLOY: Parallel Docker builds # MASTER DEPLOY: Parallel Docker builds
# NOTE: cache_from/cache_to removed due to plugin bug splitting on commas
# =========================================== # ===========================================
docker-backend: docker-backend:
image: woodpeckerci/plugin-docker-buildx image: woodpeckerci/plugin-docker-buildx
@@ -86,10 +87,6 @@ steps:
from_secret: registry_password from_secret: registry_password
platforms: linux/amd64 platforms: linux/amd64
provenance: false provenance: false
cache_from:
- "type=registry,ref=code.cannabrands.app/creationshop/dispensary-scraper:cache"
cache_to:
- "type=registry,ref=code.cannabrands.app/creationshop/dispensary-scraper:cache,mode=max"
build_args: build_args:
APP_BUILD_VERSION: ${CI_COMMIT_SHA:0:8} APP_BUILD_VERSION: ${CI_COMMIT_SHA:0:8}
APP_GIT_SHA: ${CI_COMMIT_SHA} APP_GIT_SHA: ${CI_COMMIT_SHA}
@@ -116,10 +113,6 @@ steps:
from_secret: registry_password from_secret: registry_password
platforms: linux/amd64 platforms: linux/amd64
provenance: false provenance: false
cache_from:
- "type=registry,ref=code.cannabrands.app/creationshop/cannaiq-frontend:cache"
cache_to:
- "type=registry,ref=code.cannabrands.app/creationshop/cannaiq-frontend:cache,mode=max"
depends_on: [] depends_on: []
when: when:
branch: master branch: master
@@ -141,10 +134,6 @@ steps:
from_secret: registry_password from_secret: registry_password
platforms: linux/amd64 platforms: linux/amd64
provenance: false provenance: false
cache_from:
- "type=registry,ref=code.cannabrands.app/creationshop/findadispo-frontend:cache"
cache_to:
- "type=registry,ref=code.cannabrands.app/creationshop/findadispo-frontend:cache,mode=max"
depends_on: [] depends_on: []
when: when:
branch: master branch: master
@@ -166,10 +155,6 @@ steps:
from_secret: registry_password from_secret: registry_password
platforms: linux/amd64 platforms: linux/amd64
provenance: false provenance: false
cache_from:
- "type=registry,ref=code.cannabrands.app/creationshop/findagram-frontend:cache"
cache_to:
- "type=registry,ref=code.cannabrands.app/creationshop/findagram-frontend:cache,mode=max"
depends_on: [] depends_on: []
when: when:
branch: master branch: master

1423
CLAUDE.md

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,218 @@
# CannaiQ Backend Codebase Map
**Last Updated:** 2025-12-12
**Purpose:** Help Claude and developers understand which code is current vs deprecated
---
## Quick Reference: What to Use
### For Crawling/Scraping
| Task | Use This | NOT This |
|------|----------|----------|
| Fetch products | `src/tasks/handlers/payload-fetch.ts` | `src/hydration/*` |
| Process products | `src/tasks/handlers/product-refresh.ts` | `src/scraper-v2/*` |
| GraphQL client | `src/platforms/dutchie/client.ts` | `src/dutchie-az/services/graphql-client.ts` |
| Worker system | `src/tasks/task-worker.ts` | `src/dutchie-az/services/worker.ts` |
### For Database
| Task | Use This | NOT This |
|------|----------|----------|
| Get DB pool | `src/db/pool.ts` | `src/dutchie-az/db/connection.ts` |
| Run migrations | `src/db/migrate.ts` (CLI only) | Never import at runtime |
| Query products | `store_products` table | `products`, `dutchie_products` |
| Query stores | `dispensaries` table | `stores` table |
### For Discovery
| Task | Use This |
|------|----------|
| Discover stores | `src/discovery/*.ts` |
| Run discovery | `npx tsx src/scripts/run-discovery.ts` |
---
## Directory Status
### ACTIVE DIRECTORIES (Use These)
```
src/
├── auth/ # JWT/session auth, middleware
├── db/ # Database pool, migrations
├── discovery/ # Dutchie store discovery pipeline
├── middleware/ # Express middleware
├── multi-state/ # Multi-state query support
├── platforms/ # Platform-specific clients (Dutchie, Jane, etc)
│ └── dutchie/ # THE Dutchie client - use this one
├── routes/ # Express API routes
├── services/ # Core services (logger, scheduler, etc)
├── tasks/ # Task system (workers, handlers, scheduler)
│ └── handlers/ # Task handlers (payload_fetch, product_refresh, etc)
├── types/ # TypeScript types
└── utils/ # Utilities (storage, image processing)
```
### DEPRECATED DIRECTORIES (DO NOT USE)
```
src/
├── hydration/ # DEPRECATED - Old pipeline approach
├── scraper-v2/ # DEPRECATED - Old scraper engine
├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers
├── dutchie-az/ # PARTIAL - Some parts deprecated, some active
│ ├── db/ # DEPRECATED - Use src/db/pool.ts
│ └── services/ # PARTIAL - worker.ts still runs, graphql-client.ts deprecated
├── portals/ # FUTURE - Not yet implemented
├── seo/ # PARTIAL - Settings work, templates WIP
└── system/ # DEPRECATED - Old orchestration system
```
### DEPRECATED FILES (DO NOT USE)
```
src/dutchie-az/db/connection.ts # Use src/db/pool.ts instead
src/dutchie-az/services/graphql-client.ts # Use src/platforms/dutchie/client.ts
src/hydration/*.ts # Entire directory deprecated
src/scraper-v2/*.ts # Entire directory deprecated
```
---
## Key Files Reference
### Entry Points
| File | Purpose | Status |
|------|---------|--------|
| `src/index.ts` | Main Express server | ACTIVE |
| `src/dutchie-az/services/worker.ts` | Worker process entry | ACTIVE |
| `src/tasks/task-worker.ts` | Task worker (new system) | ACTIVE |
### Dutchie Integration
| File | Purpose | Status |
|------|---------|--------|
| `src/platforms/dutchie/client.ts` | GraphQL client, hashes, curl | **PRIMARY** |
| `src/platforms/dutchie/queries.ts` | High-level query functions | ACTIVE |
| `src/platforms/dutchie/index.ts` | Re-exports | ACTIVE |
### Task Handlers
| File | Purpose | Status |
|------|---------|--------|
| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** |
| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** |
| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE |
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs | ACTIVE |
| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE |
### Database
| File | Purpose | Status |
|------|---------|--------|
| `src/db/pool.ts` | Canonical DB pool | **PRIMARY** |
| `src/db/migrate.ts` | Migration runner (CLI only) | CLI ONLY |
| `src/db/auto-migrate.ts` | Auto-run migrations on startup | ACTIVE |
### Configuration
| File | Purpose | Status |
|------|---------|--------|
| `.env` | Environment variables | ACTIVE |
| `package.json` | Dependencies | ACTIVE |
| `tsconfig.json` | TypeScript config | ACTIVE |
---
## GraphQL Hashes (CRITICAL)
The correct hashes are in `src/platforms/dutchie/client.ts`:
```typescript
export const GRAPHQL_HASHES = {
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
};
```
**ALWAYS** use `Status: 'Active'` for FilteredProducts (not `null` or `'All'`).
---
## Scripts Reference
### Useful Scripts (in `src/scripts/`)
| Script | Purpose |
|--------|---------|
| `run-discovery.ts` | Run Dutchie discovery |
| `crawl-single-store.ts` | Test crawl a single store |
| `test-dutchie-graphql.ts` | Test GraphQL queries |
### One-Off Scripts (probably don't need)
| Script | Purpose |
|--------|---------|
| `harmonize-az-dispensaries.ts` | One-time data cleanup |
| `bootstrap-stores-for-dispensaries.ts` | One-time migration |
| `backfill-*.ts` | Historical backfill scripts |
---
## API Routes
### Active Routes (in `src/routes/`)
| Route File | Mount Point | Purpose |
|------------|-------------|---------|
| `auth.ts` | `/api/auth` | Login/logout/session |
| `stores.ts` | `/api/stores` | Store CRUD |
| `dashboard.ts` | `/api/dashboard` | Dashboard stats |
| `workers.ts` | `/api/workers` | Worker monitoring |
| `pipeline.ts` | `/api/pipeline` | Crawl triggers |
| `discovery.ts` | `/api/discovery` | Discovery management |
| `analytics.ts` | `/api/analytics` | Analytics queries |
| `wordpress.ts` | `/api/v1/wordpress` | WordPress plugin API |
---
## Documentation Files
### Current Docs (in `backend/docs/`)
| Doc | Purpose | Currency |
|-----|---------|----------|
| `TASK_WORKFLOW_2024-12-10.md` | Task system architecture | CURRENT |
| `WORKER_TASK_ARCHITECTURE.md` | Worker/task design | CURRENT |
| `CRAWL_PIPELINE.md` | Crawl pipeline overview | CURRENT |
| `ORGANIC_SCRAPING_GUIDE.md` | Browser-based scraping | CURRENT |
| `CODEBASE_MAP.md` | This file | CURRENT |
| `ANALYTICS_V2_EXAMPLES.md` | Analytics API examples | CURRENT |
| `BRAND_INTELLIGENCE_API.md` | Brand API docs | CURRENT |
### Root Docs
| Doc | Purpose | Currency |
|-----|---------|----------|
| `CLAUDE.md` | Claude instructions | **PRIMARY** |
| `README.md` | Project overview | NEEDS UPDATE |
---
## Common Mistakes to Avoid
1. **Don't use `src/hydration/`** - It's an old approach that was superseded by the task system
2. **Don't use `src/dutchie-az/db/connection.ts`** - Use `src/db/pool.ts` instead
3. **Don't import `src/db/migrate.ts` at runtime** - It will crash. Only use for CLI migrations.
4. **Don't query `stores` table** - It's empty. Use `dispensaries`.
5. **Don't query `products` table** - It's empty. Use `store_products`.
6. **Don't use wrong GraphQL hash** - Always get hash from `GRAPHQL_HASHES` in client.ts
7. **Don't use `Status: null`** - It returns 0 products. Use `Status: 'Active'`.
---
## When in Doubt
1. Check if the file is imported in `src/index.ts` - if not, it may be deprecated
2. Check the last modified date - older files may be stale
3. Look for `DEPRECATED` comments in the code
4. Ask: "Is there a newer version of this in `src/tasks/` or `src/platforms/`?"
5. Read the relevant doc in `docs/` before modifying code

View File

@@ -0,0 +1,297 @@
# Organic Browser-Based Scraping Guide
**Last Updated:** 2025-12-12
**Status:** Production-ready proof of concept
---
## Overview
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
---
## Why Organic Scraping?
| Approach | Detection Risk | Speed | Complexity |
|----------|---------------|-------|------------|
| Direct curl | Higher | Fast | Low |
| curl-impersonate | Medium | Fast | Medium |
| **Browser-based (organic)** | **Lowest** | Slower | Higher |
Direct curl requests can be fingerprinted via:
- TLS fingerprint (cipher suites, extensions)
- Header order and values
- Missing cookies/session data
- Request patterns
Browser-based requests inherit:
- Real Chrome TLS fingerprint
- Session cookies from page visit
- Natural header order
- JavaScript execution environment
---
## Implementation
### Dependencies
```bash
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
```
### Core Script: `test-intercept.js`
Located at: `backend/test-intercept.js`
```javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function capturePayload(config) {
const { dispensaryId, platformId, cName, outputPath } = config;
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// STEP 1: Establish session by visiting the menu
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
// STEP 2: Fetch ALL products using GraphQL from browser context
const result = await page.evaluate(async (platformId) => {
const allProducts = [];
let pageNum = 0;
const perPage = 100;
let totalCount = 0;
const sessionId = 'browser-session-' + Date.now();
while (pageNum < 30) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // CRITICAL: Must be 'Active', not null
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: pageNum,
perPage: perPage,
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
}
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions)
});
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
method: 'GET',
headers: {
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include'
});
const json = await response.json();
const data = json?.data?.filteredProducts;
if (!data?.products) break;
allProducts.push(...data.products);
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
if (allProducts.length >= totalCount) break;
pageNum++;
await new Promise(r => setTimeout(r, 200)); // Polite delay
}
return { products: allProducts, totalCount };
}, platformId);
await browser.close();
// STEP 3: Save payload
const payload = {
dispensaryId,
platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: result.products.length,
products: result.products,
};
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
return payload;
}
```
---
## Critical Parameters
### GraphQL Hash (FilteredProducts)
```
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
```
**WARNING:** Using the wrong hash returns HTTP 400.
### Status Parameter
| Value | Result |
|-------|--------|
| `'Active'` | Returns in-stock products (1019 in test) |
| `null` | Returns 0 products |
| `'All'` | Returns HTTP 400 |
**ALWAYS use `Status: 'Active'`**
### Required Headers
```javascript
{
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': 'unique-session-id',
'apollographql-client-name': 'Marketplace (production)',
}
```
### Endpoint
```
https://dutchie.com/api-3/graphql
```
---
## Performance Benchmarks
Test store: AZ-Deeply-Rooted (1019 products)
| Metric | Value |
|--------|-------|
| Total products | 1019 |
| Time | 18.5 seconds |
| Payload size | 11.8 MB |
| Pages fetched | 11 (100 per page) |
| Success rate | 100% |
---
## Payload Format
The output matches the existing `payload-fetch.ts` handler format:
```json
{
"dispensaryId": 123,
"platformId": "6405ef617056e8014d79101b",
"cName": "AZ-Deeply-Rooted",
"fetchedAt": "2025-12-12T05:05:19.837Z",
"productCount": 1019,
"products": [
{
"id": "6927508db4851262f629a869",
"Name": "Product Name",
"brand": { "name": "Brand Name", ... },
"type": "Flower",
"THC": "25%",
"Prices": [...],
"Options": [...],
...
}
]
}
```
---
## Integration Points
### As a Task Handler
The organic approach can be integrated as an alternative to curl-based fetching:
```typescript
// In src/tasks/handlers/organic-payload-fetch.ts
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
// Use puppeteer-based capture
// Save to same payload storage
// Queue product_refresh task
}
```
### Worker Configuration
Add to job_schedules:
```sql
INSERT INTO job_schedules (name, role, cron_expression)
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
```
---
## Troubleshooting
### HTTP 400 Bad Request
- Check hash is correct: `ee29c060...`
- Verify Status is `'Active'` (string, not null)
### 0 Products Returned
- Status was likely `null` or `'All'` - use `'Active'`
- Check platformId is valid MongoDB ObjectId
### Session Not Established
- Increase timeout on initial page.goto()
- Check cName is valid (matches embedded-menu URL)
### Detection/Blocking
- StealthPlugin should handle most cases
- Add random delays between pages
- Use headless: 'new' (not true/false)
---
## Files Reference
| File | Purpose |
|------|---------|
| `backend/test-intercept.js` | Proof of concept script |
| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation |
| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler |
| `backend/src/utils/payload-storage.ts` | Payload save/load utilities |
---
## See Also
- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation
- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture
- `CLAUDE.md` - Project rules and constraints

View File

@@ -0,0 +1,25 @@
# ARCHIVED DOCUMENTATION
**WARNING: These docs may be outdated or inaccurate.**
The code has evolved significantly. These docs are kept for historical reference only.
## What to Use Instead
**The single source of truth is:**
- `CLAUDE.md` (root) - Essential rules and quick reference
- `docs/CODEBASE_MAP.md` - Current file/directory reference
## Why Archive?
These docs were written during development iterations and may reference:
- Old file paths that no longer exist
- Deprecated approaches (hydration, scraper-v2)
- APIs that have changed
- Database schemas that evolved
## If You Need Details
1. First check CODEBASE_MAP.md for current file locations
2. Then read the actual source code
3. Only use archive docs as a last resort for historical context

View File

@@ -0,0 +1,77 @@
apiVersion: v1
kind: Service
metadata:
name: scraper-worker
namespace: dispensary-scraper
labels:
app: scraper-worker
spec:
clusterIP: None # Headless service required for StatefulSet
selector:
app: scraper-worker
ports:
- port: 3010
name: http
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: scraper-worker
namespace: dispensary-scraper
spec:
serviceName: scraper-worker
replicas: 8
podManagementPolicy: Parallel # Start all pods at once
updateStrategy:
type: OnDelete # Pods only update when manually deleted - no automatic restarts
selector:
matchLabels:
app: scraper-worker
template:
metadata:
labels:
app: scraper-worker
spec:
terminationGracePeriodSeconds: 60
imagePullSecrets:
- name: regcred
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:latest
imagePullPolicy: Always
command: ["node"]
args: ["dist/tasks/task-worker.js"]
env:
- name: WORKER_MODE
value: "true"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MAX_CONCURRENT_TASKS
value: "50"
- name: API_BASE_URL
value: http://scraper
- name: NODE_OPTIONS
value: --max-old-space-size=1500
envFrom:
- configMapRef:
name: scraper-config
- secretRef:
name: scraper-secrets
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pgrep -f 'task-worker' > /dev/null
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3

View File

@@ -0,0 +1,253 @@
-- Migration 084: Dual Transport Preflight System
-- Workers run both curl and http (Puppeteer) preflights on startup
-- Tasks can require a specific transport method
-- ===================================================================
-- PART 1: Add preflight columns to worker_registry
-- ===================================================================
-- Preflight status for curl/axios transport (proxy-based)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_status VARCHAR(20) DEFAULT 'pending';
-- Preflight status for http/Puppeteer transport (browser-based)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_status VARCHAR(20) DEFAULT 'pending';
-- Timestamps for when each preflight completed
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_at TIMESTAMPTZ;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_at TIMESTAMPTZ;
-- Error messages for failed preflights
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_error TEXT;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_error TEXT;
-- Response time for successful preflights (ms)
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_curl_ms INTEGER;
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_http_ms INTEGER;
-- Constraints for preflight status values
ALTER TABLE worker_registry
DROP CONSTRAINT IF EXISTS valid_preflight_curl_status;
ALTER TABLE worker_registry
ADD CONSTRAINT valid_preflight_curl_status
CHECK (preflight_curl_status IN ('pending', 'passed', 'failed', 'skipped'));
ALTER TABLE worker_registry
DROP CONSTRAINT IF EXISTS valid_preflight_http_status;
ALTER TABLE worker_registry
ADD CONSTRAINT valid_preflight_http_status
CHECK (preflight_http_status IN ('pending', 'passed', 'failed', 'skipped'));
-- ===================================================================
-- PART 2: Add method column to worker_tasks
-- ===================================================================
-- Transport method requirement for the task
-- NULL = no preference (any worker can claim)
-- 'curl' = requires curl/axios transport (proxy-based, fast)
-- 'http' = requires http/Puppeteer transport (browser-based, anti-detect)
ALTER TABLE worker_tasks
ADD COLUMN IF NOT EXISTS method VARCHAR(10);
-- Constraint for valid method values
ALTER TABLE worker_tasks
DROP CONSTRAINT IF EXISTS valid_task_method;
ALTER TABLE worker_tasks
ADD CONSTRAINT valid_task_method
CHECK (method IS NULL OR method IN ('curl', 'http'));
-- Index for method-based task claiming
CREATE INDEX IF NOT EXISTS idx_worker_tasks_method
ON worker_tasks(method)
WHERE status = 'pending';
-- Set default method for all existing pending tasks to 'http'
-- ALL current tasks require Puppeteer/browser-based transport
UPDATE worker_tasks
SET method = 'http'
WHERE method IS NULL;
-- ===================================================================
-- PART 3: Update claim_task function for method compatibility
-- ===================================================================
CREATE OR REPLACE FUNCTION claim_task(
p_role VARCHAR(50),
p_worker_id VARCHAR(100),
p_curl_passed BOOLEAN DEFAULT TRUE,
p_http_passed BOOLEAN DEFAULT FALSE
) RETURNS worker_tasks AS $$
DECLARE
claimed_task worker_tasks;
BEGIN
UPDATE worker_tasks
SET
status = 'claimed',
worker_id = p_worker_id,
claimed_at = NOW(),
updated_at = NOW()
WHERE id = (
SELECT id FROM worker_tasks
WHERE role = p_role
AND status = 'pending'
AND (scheduled_for IS NULL OR scheduled_for <= NOW())
-- Method compatibility: worker must have passed the required preflight
AND (
method IS NULL -- No preference, any worker can claim
OR (method = 'curl' AND p_curl_passed = TRUE)
OR (method = 'http' AND p_http_passed = TRUE)
)
-- Exclude stores that already have an active task
AND (dispensary_id IS NULL OR dispensary_id NOT IN (
SELECT dispensary_id FROM worker_tasks
WHERE status IN ('claimed', 'running')
AND dispensary_id IS NOT NULL
))
ORDER BY priority DESC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING * INTO claimed_task;
RETURN claimed_task;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- PART 4: Update v_active_workers view
-- ===================================================================
DROP VIEW IF EXISTS v_active_workers;
CREATE VIEW v_active_workers AS
SELECT
wr.id,
wr.worker_id,
wr.friendly_name,
wr.role,
wr.status,
wr.pod_name,
wr.hostname,
wr.started_at,
wr.last_heartbeat_at,
wr.last_task_at,
wr.tasks_completed,
wr.tasks_failed,
wr.current_task_id,
-- Preflight status
wr.preflight_curl_status,
wr.preflight_http_status,
wr.preflight_curl_at,
wr.preflight_http_at,
wr.preflight_curl_error,
wr.preflight_http_error,
wr.preflight_curl_ms,
wr.preflight_http_ms,
-- Computed fields
EXTRACT(EPOCH FROM (NOW() - wr.last_heartbeat_at)) as seconds_since_heartbeat,
CASE
WHEN wr.status = 'offline' THEN 'offline'
WHEN wr.last_heartbeat_at < NOW() - INTERVAL '2 minutes' THEN 'stale'
WHEN wr.current_task_id IS NOT NULL THEN 'busy'
ELSE 'ready'
END as health_status,
-- Capability flags (can this worker handle curl/http tasks?)
(wr.preflight_curl_status = 'passed') as can_curl,
(wr.preflight_http_status = 'passed') as can_http
FROM worker_registry wr
WHERE wr.status != 'terminated'
ORDER BY wr.status = 'active' DESC, wr.last_heartbeat_at DESC;
-- ===================================================================
-- PART 5: View for task queue with method info
-- ===================================================================
DROP VIEW IF EXISTS v_task_history;
CREATE VIEW v_task_history AS
SELECT
t.id,
t.role,
t.dispensary_id,
d.name as dispensary_name,
t.platform,
t.status,
t.priority,
t.method,
t.worker_id,
t.scheduled_for,
t.claimed_at,
t.started_at,
t.completed_at,
t.error_message,
t.retry_count,
t.created_at,
EXTRACT(EPOCH FROM (t.completed_at - t.started_at)) as duration_sec
FROM worker_tasks t
LEFT JOIN dispensaries d ON d.id = t.dispensary_id
ORDER BY t.created_at DESC;
-- ===================================================================
-- PART 6: Helper function to update worker preflight status
-- ===================================================================
CREATE OR REPLACE FUNCTION update_worker_preflight(
p_worker_id VARCHAR(100),
p_transport VARCHAR(10), -- 'curl' or 'http'
p_status VARCHAR(20), -- 'passed', 'failed', 'skipped'
p_response_ms INTEGER DEFAULT NULL,
p_error TEXT DEFAULT NULL
) RETURNS VOID AS $$
BEGIN
IF p_transport = 'curl' THEN
UPDATE worker_registry
SET
preflight_curl_status = p_status,
preflight_curl_at = NOW(),
preflight_curl_ms = p_response_ms,
preflight_curl_error = p_error,
updated_at = NOW()
WHERE worker_id = p_worker_id;
ELSIF p_transport = 'http' THEN
UPDATE worker_registry
SET
preflight_http_status = p_status,
preflight_http_at = NOW(),
preflight_http_ms = p_response_ms,
preflight_http_error = p_error,
updated_at = NOW()
WHERE worker_id = p_worker_id;
END IF;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- Comments
-- ===================================================================
COMMENT ON COLUMN worker_registry.preflight_curl_status IS 'Status of curl/axios preflight: pending, passed, failed, skipped';
COMMENT ON COLUMN worker_registry.preflight_http_status IS 'Status of http/Puppeteer preflight: pending, passed, failed, skipped';
COMMENT ON COLUMN worker_registry.preflight_curl_at IS 'When curl preflight completed';
COMMENT ON COLUMN worker_registry.preflight_http_at IS 'When http preflight completed';
COMMENT ON COLUMN worker_registry.preflight_curl_error IS 'Error message if curl preflight failed';
COMMENT ON COLUMN worker_registry.preflight_http_error IS 'Error message if http preflight failed';
COMMENT ON COLUMN worker_registry.preflight_curl_ms IS 'Response time of successful curl preflight (ms)';
COMMENT ON COLUMN worker_registry.preflight_http_ms IS 'Response time of successful http preflight (ms)';
COMMENT ON COLUMN worker_tasks.method IS 'Transport method required: NULL=any, curl=proxy-based, http=browser-based';
COMMENT ON FUNCTION claim_task IS 'Atomically claim a task, respecting method requirements and per-store locking';
COMMENT ON FUNCTION update_worker_preflight IS 'Update a workers preflight status for a given transport';

View File

@@ -0,0 +1,168 @@
-- Migration 085: Add IP and fingerprint columns for preflight reporting
-- These columns were missing from migration 084
-- ===================================================================
-- PART 1: Add IP address columns to worker_registry
-- ===================================================================
-- IP address detected during curl/axios preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS curl_ip VARCHAR(45);
-- IP address detected during http/Puppeteer preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS http_ip VARCHAR(45);
-- ===================================================================
-- PART 2: Add fingerprint data column
-- ===================================================================
-- Browser fingerprint data captured during Puppeteer preflight
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS fingerprint_data JSONB;
-- ===================================================================
-- PART 3: Add combined preflight status/timestamp for convenience
-- ===================================================================
-- Overall preflight status (computed from both transports)
-- Values: 'pending', 'passed', 'partial', 'failed'
-- - 'pending': neither transport tested
-- - 'passed': both transports passed (or http passed for browser-only)
-- - 'partial': at least one passed
-- - 'failed': no transport passed
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_status VARCHAR(20) DEFAULT 'pending';
-- Most recent preflight completion timestamp
ALTER TABLE worker_registry
ADD COLUMN IF NOT EXISTS preflight_at TIMESTAMPTZ;
-- ===================================================================
-- PART 4: Update function to set preflight status
-- ===================================================================
CREATE OR REPLACE FUNCTION update_worker_preflight(
p_worker_id VARCHAR(100),
p_transport VARCHAR(10), -- 'curl' or 'http'
p_status VARCHAR(20), -- 'passed', 'failed', 'skipped'
p_ip VARCHAR(45) DEFAULT NULL,
p_response_ms INTEGER DEFAULT NULL,
p_error TEXT DEFAULT NULL,
p_fingerprint JSONB DEFAULT NULL
) RETURNS VOID AS $$
DECLARE
v_curl_status VARCHAR(20);
v_http_status VARCHAR(20);
v_overall_status VARCHAR(20);
BEGIN
IF p_transport = 'curl' THEN
UPDATE worker_registry
SET
preflight_curl_status = p_status,
preflight_curl_at = NOW(),
preflight_curl_ms = p_response_ms,
preflight_curl_error = p_error,
curl_ip = p_ip,
updated_at = NOW()
WHERE worker_id = p_worker_id;
ELSIF p_transport = 'http' THEN
UPDATE worker_registry
SET
preflight_http_status = p_status,
preflight_http_at = NOW(),
preflight_http_ms = p_response_ms,
preflight_http_error = p_error,
http_ip = p_ip,
fingerprint_data = COALESCE(p_fingerprint, fingerprint_data),
updated_at = NOW()
WHERE worker_id = p_worker_id;
END IF;
-- Update overall preflight status
SELECT preflight_curl_status, preflight_http_status
INTO v_curl_status, v_http_status
FROM worker_registry
WHERE worker_id = p_worker_id;
-- Compute overall status
IF v_curl_status = 'passed' AND v_http_status = 'passed' THEN
v_overall_status := 'passed';
ELSIF v_curl_status = 'passed' OR v_http_status = 'passed' THEN
v_overall_status := 'partial';
ELSIF v_curl_status = 'failed' OR v_http_status = 'failed' THEN
v_overall_status := 'failed';
ELSE
v_overall_status := 'pending';
END IF;
UPDATE worker_registry
SET
preflight_status = v_overall_status,
preflight_at = NOW()
WHERE worker_id = p_worker_id;
END;
$$ LANGUAGE plpgsql;
-- ===================================================================
-- PART 5: Update v_active_workers view
-- ===================================================================
DROP VIEW IF EXISTS v_active_workers;
CREATE VIEW v_active_workers AS
SELECT
wr.id,
wr.worker_id,
wr.friendly_name,
wr.role,
wr.status,
wr.pod_name,
wr.hostname,
wr.started_at,
wr.last_heartbeat_at,
wr.last_task_at,
wr.tasks_completed,
wr.tasks_failed,
wr.current_task_id,
-- IP addresses from preflights
wr.curl_ip,
wr.http_ip,
-- Combined preflight status
wr.preflight_status,
wr.preflight_at,
-- Detailed preflight status per transport
wr.preflight_curl_status,
wr.preflight_http_status,
wr.preflight_curl_at,
wr.preflight_http_at,
wr.preflight_curl_error,
wr.preflight_http_error,
wr.preflight_curl_ms,
wr.preflight_http_ms,
-- Fingerprint data
wr.fingerprint_data,
-- Computed fields
EXTRACT(EPOCH FROM (NOW() - wr.last_heartbeat_at)) as seconds_since_heartbeat,
CASE
WHEN wr.status = 'offline' THEN 'offline'
WHEN wr.last_heartbeat_at < NOW() - INTERVAL '2 minutes' THEN 'stale'
WHEN wr.current_task_id IS NOT NULL THEN 'busy'
ELSE 'ready'
END as health_status,
-- Capability flags (can this worker handle curl/http tasks?)
(wr.preflight_curl_status = 'passed') as can_curl,
(wr.preflight_http_status = 'passed') as can_http
FROM worker_registry wr
WHERE wr.status != 'terminated'
ORDER BY wr.status = 'active' DESC, wr.last_heartbeat_at DESC;
-- ===================================================================
-- Comments
-- ===================================================================
COMMENT ON COLUMN worker_registry.curl_ip IS 'IP address detected during curl/axios preflight';
COMMENT ON COLUMN worker_registry.http_ip IS 'IP address detected during Puppeteer preflight';
COMMENT ON COLUMN worker_registry.fingerprint_data IS 'Browser fingerprint captured during Puppeteer preflight';
COMMENT ON COLUMN worker_registry.preflight_status IS 'Overall preflight status: pending, passed, partial, failed';
COMMENT ON COLUMN worker_registry.preflight_at IS 'Most recent preflight completion timestamp';

View File

@@ -0,0 +1,46 @@
# DEPRECATED CODE - DO NOT USE
**These directories contain OLD, ABANDONED code.**
## What's Here
| Directory | What It Was | Why Deprecated |
|-----------|-------------|----------------|
| `hydration/` | Old pipeline for processing crawl data | Replaced by `src/tasks/handlers/` |
| `scraper-v2/` | Old Puppeteer-based scraper engine | Replaced by curl-based `src/platforms/dutchie/client.ts` |
| `canonical-hydration/` | Intermediate step toward canonical schema | Merged into task handlers |
## What to Use Instead
| Old (DONT USE) | New (USE THIS) |
|----------------|----------------|
| `hydration/normalizers/dutchie.ts` | `src/tasks/handlers/product-refresh.ts` |
| `hydration/producer.ts` | `src/tasks/handlers/payload-fetch.ts` |
| `scraper-v2/engine.ts` | `src/platforms/dutchie/client.ts` |
| `scraper-v2/scheduler.ts` | `src/services/task-scheduler.ts` |
## Why Keep This Code?
- Historical reference only
- Some patterns may be useful for debugging
- Will be deleted once confirmed not needed
## Claude Instructions
**IF YOU ARE CLAUDE:**
1. NEVER import from `src/_deprecated/`
2. NEVER reference these files as examples
3. NEVER try to "fix" or "update" code in here
4. If you see imports from these directories, suggest replacing them
**Correct imports:**
```typescript
// GOOD
import { executeGraphQL } from '../platforms/dutchie/client';
import { pool } from '../db/pool';
// BAD - DO NOT USE
import { something } from '../_deprecated/hydration/...';
import { something } from '../_deprecated/scraper-v2/...';
```

View File

@@ -0,0 +1,584 @@
/**
* System API Routes
*
* Provides REST API endpoints for system monitoring and control:
* - /api/system/sync/* - Sync orchestrator
* - /api/system/dlq/* - Dead-letter queue
* - /api/system/integrity/* - Integrity checks
* - /api/system/fix/* - Auto-fix routines
* - /api/system/alerts/* - System alerts
* - /metrics - Prometheus metrics
*
* Phase 5: Full Production Sync + Monitoring
*/
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import {
SyncOrchestrator,
MetricsService,
DLQService,
AlertService,
IntegrityService,
AutoFixService,
} from '../services';
export function createSystemRouter(pool: Pool): Router {
const router = Router();
// Initialize services
const metrics = new MetricsService(pool);
const dlq = new DLQService(pool);
const alerts = new AlertService(pool);
const integrity = new IntegrityService(pool, alerts);
const autoFix = new AutoFixService(pool, alerts);
const orchestrator = new SyncOrchestrator(pool, metrics, dlq, alerts);
// ============================================================
// SYNC ORCHESTRATOR ENDPOINTS
// ============================================================
/**
* GET /api/system/sync/status
* Get current sync status
*/
router.get('/sync/status', async (_req: Request, res: Response) => {
try {
const status = await orchestrator.getStatus();
res.json(status);
} catch (error) {
console.error('[System] Sync status error:', error);
res.status(500).json({ error: 'Failed to get sync status' });
}
});
/**
* POST /api/system/sync/run
* Trigger a sync run
*/
router.post('/sync/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await orchestrator.runSync();
res.json({
success: true,
triggeredBy,
metrics: result,
});
} catch (error) {
console.error('[System] Sync run error:', error);
res.status(500).json({
success: false,
error: error instanceof Error ? error.message : 'Sync run failed',
});
}
});
/**
* GET /api/system/sync/queue-depth
* Get queue depth information
*/
router.get('/sync/queue-depth', async (_req: Request, res: Response) => {
try {
const depth = await orchestrator.getQueueDepth();
res.json(depth);
} catch (error) {
console.error('[System] Queue depth error:', error);
res.status(500).json({ error: 'Failed to get queue depth' });
}
});
/**
* GET /api/system/sync/health
* Get sync health status
*/
router.get('/sync/health', async (_req: Request, res: Response) => {
try {
const health = await orchestrator.getHealth();
res.status(health.healthy ? 200 : 503).json(health);
} catch (error) {
console.error('[System] Health check error:', error);
res.status(500).json({ healthy: false, error: 'Health check failed' });
}
});
/**
* POST /api/system/sync/pause
* Pause the orchestrator
*/
router.post('/sync/pause', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manual pause';
await orchestrator.pause(reason);
res.json({ success: true, message: 'Orchestrator paused' });
} catch (error) {
console.error('[System] Pause error:', error);
res.status(500).json({ error: 'Failed to pause orchestrator' });
}
});
/**
* POST /api/system/sync/resume
* Resume the orchestrator
*/
router.post('/sync/resume', async (_req: Request, res: Response) => {
try {
await orchestrator.resume();
res.json({ success: true, message: 'Orchestrator resumed' });
} catch (error) {
console.error('[System] Resume error:', error);
res.status(500).json({ error: 'Failed to resume orchestrator' });
}
});
// ============================================================
// DLQ ENDPOINTS
// ============================================================
/**
* GET /api/system/dlq
* List DLQ payloads
*/
router.get('/dlq', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as string,
errorType: req.query.errorType as string,
dispensaryId: req.query.dispensaryId ? parseInt(req.query.dispensaryId as string) : undefined,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await dlq.listPayloads(options);
res.json(result);
} catch (error) {
console.error('[System] DLQ list error:', error);
res.status(500).json({ error: 'Failed to list DLQ payloads' });
}
});
/**
* GET /api/system/dlq/stats
* Get DLQ statistics
*/
router.get('/dlq/stats', async (_req: Request, res: Response) => {
try {
const stats = await dlq.getStats();
res.json(stats);
} catch (error) {
console.error('[System] DLQ stats error:', error);
res.status(500).json({ error: 'Failed to get DLQ stats' });
}
});
/**
* GET /api/system/dlq/summary
* Get DLQ summary by error type
*/
router.get('/dlq/summary', async (_req: Request, res: Response) => {
try {
const summary = await dlq.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] DLQ summary error:', error);
res.status(500).json({ error: 'Failed to get DLQ summary' });
}
});
/**
* GET /api/system/dlq/:id
* Get a specific DLQ payload
*/
router.get('/dlq/:id', async (req: Request, res: Response) => {
try {
const payload = await dlq.getPayload(req.params.id);
if (!payload) {
return res.status(404).json({ error: 'Payload not found' });
}
res.json(payload);
} catch (error) {
console.error('[System] DLQ get error:', error);
res.status(500).json({ error: 'Failed to get DLQ payload' });
}
});
/**
* POST /api/system/dlq/:id/retry
* Retry a DLQ payload
*/
router.post('/dlq/:id/retry', async (req: Request, res: Response) => {
try {
const result = await dlq.retryPayload(req.params.id);
if (result.success) {
res.json(result);
} else {
res.status(400).json(result);
}
} catch (error) {
console.error('[System] DLQ retry error:', error);
res.status(500).json({ error: 'Failed to retry payload' });
}
});
/**
* POST /api/system/dlq/:id/abandon
* Abandon a DLQ payload
*/
router.post('/dlq/:id/abandon', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manually abandoned';
const abandonedBy = req.body.abandonedBy || 'api';
const success = await dlq.abandonPayload(req.params.id, reason, abandonedBy);
res.json({ success });
} catch (error) {
console.error('[System] DLQ abandon error:', error);
res.status(500).json({ error: 'Failed to abandon payload' });
}
});
/**
* POST /api/system/dlq/bulk-retry
* Bulk retry payloads by error type
*/
router.post('/dlq/bulk-retry', async (req: Request, res: Response) => {
try {
const { errorType } = req.body;
if (!errorType) {
return res.status(400).json({ error: 'errorType is required' });
}
const result = await dlq.bulkRetryByErrorType(errorType);
res.json(result);
} catch (error) {
console.error('[System] DLQ bulk retry error:', error);
res.status(500).json({ error: 'Failed to bulk retry' });
}
});
// ============================================================
// INTEGRITY CHECK ENDPOINTS
// ============================================================
/**
* POST /api/system/integrity/run
* Run all integrity checks
*/
router.post('/integrity/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await integrity.runAllChecks(triggeredBy);
res.json(result);
} catch (error) {
console.error('[System] Integrity run error:', error);
res.status(500).json({ error: 'Failed to run integrity checks' });
}
});
/**
* GET /api/system/integrity/runs
* Get recent integrity check runs
*/
router.get('/integrity/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 10;
const runs = await integrity.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Integrity runs error:', error);
res.status(500).json({ error: 'Failed to get integrity runs' });
}
});
/**
* GET /api/system/integrity/runs/:runId
* Get results for a specific integrity run
*/
router.get('/integrity/runs/:runId', async (req: Request, res: Response) => {
try {
const results = await integrity.getRunResults(req.params.runId);
res.json(results);
} catch (error) {
console.error('[System] Integrity run results error:', error);
res.status(500).json({ error: 'Failed to get run results' });
}
});
// ============================================================
// AUTO-FIX ENDPOINTS
// ============================================================
/**
* GET /api/system/fix/routines
* Get available fix routines
*/
router.get('/fix/routines', (_req: Request, res: Response) => {
try {
const routines = autoFix.getAvailableRoutines();
res.json(routines);
} catch (error) {
console.error('[System] Get routines error:', error);
res.status(500).json({ error: 'Failed to get routines' });
}
});
/**
* POST /api/system/fix/:routine
* Run a fix routine
*/
router.post('/fix/:routine', async (req: Request, res: Response) => {
try {
const routineName = req.params.routine;
const dryRun = req.body.dryRun === true;
const triggeredBy = req.body.triggeredBy || 'api';
const result = await autoFix.runRoutine(routineName as any, triggeredBy, { dryRun });
res.json(result);
} catch (error) {
console.error('[System] Fix routine error:', error);
res.status(500).json({ error: 'Failed to run fix routine' });
}
});
/**
* GET /api/system/fix/runs
* Get recent fix runs
*/
router.get('/fix/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 20;
const runs = await autoFix.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Fix runs error:', error);
res.status(500).json({ error: 'Failed to get fix runs' });
}
});
// ============================================================
// ALERTS ENDPOINTS
// ============================================================
/**
* GET /api/system/alerts
* List alerts
*/
router.get('/alerts', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as any,
severity: req.query.severity as any,
type: req.query.type as string,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await alerts.listAlerts(options);
res.json(result);
} catch (error) {
console.error('[System] Alerts list error:', error);
res.status(500).json({ error: 'Failed to list alerts' });
}
});
/**
* GET /api/system/alerts/active
* Get active alerts
*/
router.get('/alerts/active', async (_req: Request, res: Response) => {
try {
const activeAlerts = await alerts.getActiveAlerts();
res.json(activeAlerts);
} catch (error) {
console.error('[System] Active alerts error:', error);
res.status(500).json({ error: 'Failed to get active alerts' });
}
});
/**
* GET /api/system/alerts/summary
* Get alert summary
*/
router.get('/alerts/summary', async (_req: Request, res: Response) => {
try {
const summary = await alerts.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] Alerts summary error:', error);
res.status(500).json({ error: 'Failed to get alerts summary' });
}
});
/**
* POST /api/system/alerts/:id/acknowledge
* Acknowledge an alert
*/
router.post('/alerts/:id/acknowledge', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const acknowledgedBy = req.body.acknowledgedBy || 'api';
const success = await alerts.acknowledgeAlert(alertId, acknowledgedBy);
res.json({ success });
} catch (error) {
console.error('[System] Acknowledge alert error:', error);
res.status(500).json({ error: 'Failed to acknowledge alert' });
}
});
/**
* POST /api/system/alerts/:id/resolve
* Resolve an alert
*/
router.post('/alerts/:id/resolve', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const resolvedBy = req.body.resolvedBy || 'api';
const success = await alerts.resolveAlert(alertId, resolvedBy);
res.json({ success });
} catch (error) {
console.error('[System] Resolve alert error:', error);
res.status(500).json({ error: 'Failed to resolve alert' });
}
});
/**
* POST /api/system/alerts/bulk-acknowledge
* Bulk acknowledge alerts
*/
router.post('/alerts/bulk-acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await alerts.bulkAcknowledge(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Bulk acknowledge error:', error);
res.status(500).json({ error: 'Failed to bulk acknowledge' });
}
});
// ============================================================
// METRICS ENDPOINTS
// ============================================================
/**
* GET /api/system/metrics
* Get all current metrics
*/
router.get('/metrics', async (_req: Request, res: Response) => {
try {
const allMetrics = await metrics.getAllMetrics();
res.json(allMetrics);
} catch (error) {
console.error('[System] Metrics error:', error);
res.status(500).json({ error: 'Failed to get metrics' });
}
});
/**
* GET /api/system/metrics/:name
* Get a specific metric
*/
router.get('/metrics/:name', async (req: Request, res: Response) => {
try {
const metric = await metrics.getMetric(req.params.name);
if (!metric) {
return res.status(404).json({ error: 'Metric not found' });
}
res.json(metric);
} catch (error) {
console.error('[System] Metric error:', error);
res.status(500).json({ error: 'Failed to get metric' });
}
});
/**
* GET /api/system/metrics/:name/history
* Get metric time series
*/
router.get('/metrics/:name/history', async (req: Request, res: Response) => {
try {
const hours = req.query.hours ? parseInt(req.query.hours as string) : 24;
const history = await metrics.getMetricHistory(req.params.name, hours);
res.json(history);
} catch (error) {
console.error('[System] Metric history error:', error);
res.status(500).json({ error: 'Failed to get metric history' });
}
});
/**
* GET /api/system/errors
* Get error summary
*/
router.get('/errors', async (_req: Request, res: Response) => {
try {
const summary = await metrics.getErrorSummary();
res.json(summary);
} catch (error) {
console.error('[System] Error summary error:', error);
res.status(500).json({ error: 'Failed to get error summary' });
}
});
/**
* GET /api/system/errors/recent
* Get recent errors
*/
router.get('/errors/recent', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 50;
const errorType = req.query.type as string;
const errors = await metrics.getRecentErrors(limit, errorType);
res.json(errors);
} catch (error) {
console.error('[System] Recent errors error:', error);
res.status(500).json({ error: 'Failed to get recent errors' });
}
});
/**
* POST /api/system/errors/acknowledge
* Acknowledge errors
*/
router.post('/errors/acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await metrics.acknowledgeErrors(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Acknowledge errors error:', error);
res.status(500).json({ error: 'Failed to acknowledge errors' });
}
});
return router;
}
/**
* Create Prometheus metrics endpoint (standalone)
*/
export function createPrometheusRouter(pool: Pool): Router {
const router = Router();
const metrics = new MetricsService(pool);
/**
* GET /metrics
* Prometheus-compatible metrics endpoint
*/
router.get('/', async (_req: Request, res: Response) => {
try {
const prometheusOutput = await metrics.getPrometheusMetrics();
res.set('Content-Type', 'text/plain; version=0.0.4');
res.send(prometheusOutput);
} catch (error) {
console.error('[Prometheus] Metrics error:', error);
res.status(500).send('# Error generating metrics');
}
});
return router;
}

View File

@@ -109,7 +109,7 @@ import scraperMonitorRoutes from './routes/scraper-monitor';
import apiTokensRoutes from './routes/api-tokens'; import apiTokensRoutes from './routes/api-tokens';
import apiPermissionsRoutes from './routes/api-permissions'; import apiPermissionsRoutes from './routes/api-permissions';
import parallelScrapeRoutes from './routes/parallel-scrape'; import parallelScrapeRoutes from './routes/parallel-scrape';
import crawlerSandboxRoutes from './routes/crawler-sandbox'; // crawler-sandbox moved to _deprecated
import versionRoutes from './routes/version'; import versionRoutes from './routes/version';
import deployStatusRoutes from './routes/deploy-status'; import deployStatusRoutes from './routes/deploy-status';
import publicApiRoutes from './routes/public-api'; import publicApiRoutes from './routes/public-api';
@@ -187,7 +187,7 @@ app.use('/api/scraper-monitor', scraperMonitorRoutes);
app.use('/api/api-tokens', apiTokensRoutes); app.use('/api/api-tokens', apiTokensRoutes);
app.use('/api/api-permissions', apiPermissionsRoutes); app.use('/api/api-permissions', apiPermissionsRoutes);
app.use('/api/parallel-scrape', parallelScrapeRoutes); app.use('/api/parallel-scrape', parallelScrapeRoutes);
app.use('/api/crawler-sandbox', crawlerSandboxRoutes); // crawler-sandbox moved to _deprecated
app.use('/api/version', versionRoutes); app.use('/api/version', versionRoutes);
app.use('/api/admin/deploy-status', deployStatusRoutes); app.use('/api/admin/deploy-status', deployStatusRoutes);
console.log('[DeployStatus] Routes registered at /api/admin/deploy-status'); console.log('[DeployStatus] Routes registered at /api/admin/deploy-status');

View File

@@ -278,7 +278,7 @@ router.post('/update-locations', requireRole('superadmin', 'admin'), async (req,
// Run in background // Run in background
updateAllProxyLocations().catch(err => { updateAllProxyLocations().catch(err => {
console.error('Location update failed:', err); console.error('Location update failed:', err);
}); });
res.json({ message: 'Location update job started' }); res.json({ message: 'Location update job started' });

View File

@@ -23,6 +23,8 @@
import { Router, Request, Response } from 'express'; import { Router, Request, Response } from 'express';
import { pool } from '../db/pool'; import { pool } from '../db/pool';
import os from 'os'; import os from 'os';
import { runPuppeteerPreflightWithRetry } from '../services/puppeteer-preflight';
import { CrawlRotator } from '../services/crawl-rotator';
const router = Router(); const router = Router();
@@ -355,6 +357,12 @@ router.get('/workers', async (req: Request, res: Response) => {
-- Decommission fields -- Decommission fields
COALESCE(decommission_requested, false) as decommission_requested, COALESCE(decommission_requested, false) as decommission_requested,
decommission_reason, decommission_reason,
-- Preflight fields (dual-transport verification)
curl_ip,
http_ip,
preflight_status,
preflight_at,
fingerprint_data,
-- Full metadata for resources -- Full metadata for resources
metadata, metadata,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat_at)) as seconds_since_heartbeat, EXTRACT(EPOCH FROM (NOW() - last_heartbeat_at)) as seconds_since_heartbeat,
@@ -858,4 +866,58 @@ router.get('/pods', async (_req: Request, res: Response) => {
} }
}); });
// ============================================================
// PREFLIGHT SMOKE TEST
// ============================================================
/**
* POST /api/worker-registry/preflight-test
* Run an HTTP (Puppeteer) preflight test and return results
*
* This is a smoke test endpoint to verify the preflight system works.
* Returns IP, fingerprint data, bot detection results, and products fetched.
*/
router.post('/preflight-test', async (_req: Request, res: Response) => {
try {
console.log('[PreflightTest] Starting HTTP preflight smoke test...');
// Create a temporary CrawlRotator for the test
const crawlRotator = new CrawlRotator();
// Run the Puppeteer preflight (with 1 retry)
const startTime = Date.now();
const result = await runPuppeteerPreflightWithRetry(crawlRotator, 1);
const duration = Date.now() - startTime;
console.log(`[PreflightTest] Completed in ${duration}ms - passed: ${result.passed}`);
res.json({
success: true,
test: 'http_preflight',
duration_ms: duration,
result: {
passed: result.passed,
proxy_ip: result.proxyIp,
fingerprint: result.fingerprint,
bot_detection: result.botDetection,
products_returned: result.productsReturned,
browser_user_agent: result.browserUserAgent,
ip_verified: result.ipVerified,
proxy_available: result.proxyAvailable,
proxy_connected: result.proxyConnected,
antidetect_ready: result.antidetectReady,
response_time_ms: result.responseTimeMs,
error: result.error
}
});
} catch (error: any) {
console.error('[PreflightTest] Error:', error.message);
res.status(500).json({
success: false,
test: 'http_preflight',
error: error.message
});
}
});
export default router; export default router;

View File

@@ -683,6 +683,118 @@ export class CrawlRotator {
const current = this.proxy.getCurrent(); const current = this.proxy.getCurrent();
return current?.timezone; return current?.timezone;
} }
/**
* Preflight check - verifies proxy and anti-detect are working
* MUST be called before any task execution to ensure anonymity.
*
* Tests:
* 1. Proxy available - a proxy must be loaded and active
* 2. Proxy connectivity - makes HTTP request through proxy to verify connection
* 3. Anti-detect headers - verifies fingerprint is set with required headers
*
* @returns Promise<PreflightResult> with pass/fail status and details
*/
async preflight(): Promise<PreflightResult> {
const result: PreflightResult = {
passed: false,
proxyAvailable: false,
proxyConnected: false,
antidetectReady: false,
proxyIp: null,
fingerprint: null,
error: null,
responseTimeMs: null,
};
// Step 1: Check proxy is available
const currentProxy = this.proxy.getCurrent();
if (!currentProxy) {
result.error = 'No proxy available';
console.log('[Preflight] FAILED - No proxy available');
return result;
}
result.proxyAvailable = true;
result.proxyIp = currentProxy.host;
// Step 2: Check fingerprint/anti-detect is ready
const fingerprint = this.userAgent.getCurrent();
if (!fingerprint || !fingerprint.userAgent) {
result.error = 'Anti-detect fingerprint not initialized';
console.log('[Preflight] FAILED - No fingerprint');
return result;
}
result.antidetectReady = true;
result.fingerprint = {
userAgent: fingerprint.userAgent,
browserName: fingerprint.browserName,
deviceCategory: fingerprint.deviceCategory,
};
// Step 3: Test proxy connectivity with an actual HTTP request
// Use httpbin.org/ip to verify request goes through proxy
const proxyUrl = this.proxy.getProxyUrl(currentProxy);
const testUrl = 'https://httpbin.org/ip';
try {
const { default: axios } = await import('axios');
const { HttpsProxyAgent } = await import('https-proxy-agent');
const agent = new HttpsProxyAgent(proxyUrl);
const startTime = Date.now();
const response = await axios.get(testUrl, {
httpsAgent: agent,
timeout: 15000, // 15 second timeout
headers: {
'User-Agent': fingerprint.userAgent,
'Accept-Language': fingerprint.acceptLanguage,
...(fingerprint.secChUa && { 'sec-ch-ua': fingerprint.secChUa }),
...(fingerprint.secChUaPlatform && { 'sec-ch-ua-platform': fingerprint.secChUaPlatform }),
...(fingerprint.secChUaMobile && { 'sec-ch-ua-mobile': fingerprint.secChUaMobile }),
},
});
result.responseTimeMs = Date.now() - startTime;
result.proxyConnected = true;
result.passed = true;
// Mark success on proxy stats
await this.proxy.markSuccess(currentProxy.id, result.responseTimeMs);
console.log(`[Preflight] PASSED - Proxy ${currentProxy.host} connected (${result.responseTimeMs}ms), UA: ${fingerprint.browserName}/${fingerprint.deviceCategory}`);
} catch (err: any) {
result.error = `Proxy connection failed: ${err.message || 'Unknown error'}`;
console.log(`[Preflight] FAILED - Proxy connection error: ${err.message}`);
// Mark failure on proxy stats
await this.proxy.markFailed(currentProxy.id, err.message);
}
return result;
}
}
/**
* Result from preflight check
*/
export interface PreflightResult {
/** Overall pass/fail */
passed: boolean;
/** Step 1: Is a proxy loaded? */
proxyAvailable: boolean;
/** Step 2: Did HTTP request through proxy succeed? */
proxyConnected: boolean;
/** Step 3: Is fingerprint/anti-detect ready? */
antidetectReady: boolean;
/** Current proxy IP */
proxyIp: string | null;
/** Fingerprint summary */
fingerprint: { userAgent: string; browserName: string; deviceCategory: string } | null;
/** Error message if failed */
error: string | null;
/** Proxy response time in ms */
responseTimeMs: number | null;
} }
// ============================================================ // ============================================================

View File

@@ -0,0 +1,100 @@
/**
* Curl Preflight - Verify curl/axios transport works through proxy
*
* Tests:
* 1. Proxy is available and active
* 2. HTTP request through proxy succeeds
* 3. Anti-detect headers are properly set
*
* Use case: Fast, simple API requests that don't need browser fingerprint
*/
import axios from 'axios';
import { HttpsProxyAgent } from 'https-proxy-agent';
import { CrawlRotator, PreflightResult } from './crawl-rotator';
export interface CurlPreflightResult extends PreflightResult {
method: 'curl';
}
/**
* Run curl preflight check
* Tests proxy connectivity using axios/curl through the proxy
*/
export async function runCurlPreflight(
crawlRotator: CrawlRotator
): Promise<CurlPreflightResult> {
const result: CurlPreflightResult = {
method: 'curl',
passed: false,
proxyAvailable: false,
proxyConnected: false,
antidetectReady: false,
proxyIp: null,
fingerprint: null,
error: null,
responseTimeMs: null,
};
// Step 1: Check proxy is available
const currentProxy = crawlRotator.proxy.getCurrent();
if (!currentProxy) {
result.error = 'No proxy available';
console.log('[CurlPreflight] FAILED - No proxy available');
return result;
}
result.proxyAvailable = true;
result.proxyIp = currentProxy.host;
// Step 2: Check fingerprint/anti-detect is ready
const fingerprint = crawlRotator.userAgent.getCurrent();
if (!fingerprint || !fingerprint.userAgent) {
result.error = 'Anti-detect fingerprint not initialized';
console.log('[CurlPreflight] FAILED - No fingerprint');
return result;
}
result.antidetectReady = true;
result.fingerprint = {
userAgent: fingerprint.userAgent,
browserName: fingerprint.browserName,
deviceCategory: fingerprint.deviceCategory,
};
// Step 3: Test proxy connectivity with an actual HTTP request
const proxyUrl = crawlRotator.proxy.getProxyUrl(currentProxy);
const testUrl = 'https://httpbin.org/ip';
try {
const agent = new HttpsProxyAgent(proxyUrl);
const startTime = Date.now();
const response = await axios.get(testUrl, {
httpsAgent: agent,
timeout: 15000, // 15 second timeout
headers: {
'User-Agent': fingerprint.userAgent,
'Accept-Language': fingerprint.acceptLanguage,
...(fingerprint.secChUa && { 'sec-ch-ua': fingerprint.secChUa }),
...(fingerprint.secChUaPlatform && { 'sec-ch-ua-platform': fingerprint.secChUaPlatform }),
...(fingerprint.secChUaMobile && { 'sec-ch-ua-mobile': fingerprint.secChUaMobile }),
},
});
result.responseTimeMs = Date.now() - startTime;
result.proxyConnected = true;
result.passed = true;
// Mark success on proxy stats
await crawlRotator.proxy.markSuccess(currentProxy.id, result.responseTimeMs);
console.log(`[CurlPreflight] PASSED - Proxy ${currentProxy.host} connected (${result.responseTimeMs}ms), UA: ${fingerprint.browserName}/${fingerprint.deviceCategory}`);
} catch (err: any) {
result.error = `Proxy connection failed: ${err.message || 'Unknown error'}`;
console.log(`[CurlPreflight] FAILED - Proxy connection error: ${err.message}`);
// Mark failure on proxy stats
await crawlRotator.proxy.markFailed(currentProxy.id, err.message);
}
return result;
}

View File

@@ -0,0 +1,481 @@
/**
* Puppeteer Preflight - Verify browser-based transport works with anti-detect
*
* Uses Puppeteer + StealthPlugin to:
* 1. Launch headless browser with stealth mode + PROXY
* 2. Visit fingerprint.com demo to verify anti-detect and confirm proxy IP
* 3. Establish session by visiting Dutchie embedded menu
* 4. Make GraphQL request from browser context
* 5. Verify we get a valid response (not blocked)
*
* Use case: Anti-detect scraping that needs real browser fingerprint through proxy
*
* Based on test-intercept.js which successfully captures 1000+ products
*/
import { PreflightResult, CrawlRotator } from './crawl-rotator';
// GraphQL hash for FilteredProducts query - MUST match CLAUDE.md
const FILTERED_PRODUCTS_HASH = 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0';
// Test dispensary - AZ-Deeply-Rooted (known working)
const TEST_CNAME = 'AZ-Deeply-Rooted';
const TEST_PLATFORM_ID = '6405ef617056e8014d79101b';
// Anti-detect verification sites (primary + fallback)
const FINGERPRINT_DEMO_URL = 'https://demo.fingerprint.com/';
const AMIUNIQUE_URL = 'https://amiunique.org/fingerprint';
// IP geolocation API for timezone lookup (free, no key required)
const IP_API_URL = 'http://ip-api.com/json';
/**
* Look up timezone from IP address using ip-api.com
* Returns IANA timezone (e.g., 'America/New_York') or null on failure
*/
async function getTimezoneFromIp(ip: string): Promise<{ timezone: string; city?: string; region?: string } | null> {
try {
const axios = require('axios');
const response = await axios.get(`${IP_API_URL}/${ip}?fields=status,timezone,city,regionName`, {
timeout: 5000,
});
if (response.data?.status === 'success' && response.data?.timezone) {
return {
timezone: response.data.timezone,
city: response.data.city,
region: response.data.regionName,
};
}
return null;
} catch (err: any) {
console.log(`[PuppeteerPreflight] IP geolocation lookup failed: ${err.message}`);
return null;
}
}
export interface PuppeteerPreflightResult extends PreflightResult {
method: 'http';
/** Number of products returned (proves API access) */
productsReturned?: number;
/** Browser user agent used */
browserUserAgent?: string;
/** Bot detection result from fingerprint.com */
botDetection?: {
detected: boolean;
probability?: number;
type?: string;
};
/** Expected proxy IP (from pool) */
expectedProxyIp?: string;
/** Whether IP verification passed (detected IP matches proxy) */
ipVerified?: boolean;
/** Detected timezone from IP geolocation */
detectedTimezone?: string;
/** Detected location from IP geolocation */
detectedLocation?: {
city?: string;
region?: string;
};
}
/**
* Run Puppeteer preflight check with proxy
* Tests browser-based access with anti-detect verification via fingerprint.com
*
* @param crawlRotator - CrawlRotator instance to get proxy from pool
*/
export async function runPuppeteerPreflight(
crawlRotator?: CrawlRotator
): Promise<PuppeteerPreflightResult> {
const result: PuppeteerPreflightResult = {
method: 'http',
passed: false,
proxyAvailable: false,
proxyConnected: false,
antidetectReady: false,
proxyIp: null,
fingerprint: null,
error: null,
responseTimeMs: null,
productsReturned: 0,
ipVerified: false,
};
let browser: any = null;
try {
// Step 0: Get a proxy from the pool
let proxyUrl: string | null = null;
let expectedProxyHost: string | null = null;
if (crawlRotator) {
const currentProxy = crawlRotator.proxy.getCurrent();
if (currentProxy) {
result.proxyAvailable = true;
proxyUrl = crawlRotator.proxy.getProxyUrl(currentProxy);
expectedProxyHost = currentProxy.host;
result.expectedProxyIp = expectedProxyHost;
console.log(`[PuppeteerPreflight] Using proxy: ${currentProxy.host}:${currentProxy.port}`);
} else {
result.error = 'No proxy available from pool';
console.log(`[PuppeteerPreflight] FAILED - No proxy available`);
return result;
}
} else {
console.log(`[PuppeteerPreflight] WARNING: No CrawlRotator provided - using direct connection`);
result.proxyAvailable = true; // No proxy needed for direct
}
// Dynamic imports to avoid loading Puppeteer unless needed
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const startTime = Date.now();
// Build browser args
const browserArgs = ['--no-sandbox', '--disable-setuid-sandbox'];
if (proxyUrl) {
// Extract host:port for Puppeteer (it handles auth separately)
const proxyUrlParsed = new URL(proxyUrl);
browserArgs.push(`--proxy-server=${proxyUrlParsed.host}`);
}
// Launch browser with stealth + proxy
browser = await puppeteer.launch({
headless: 'new',
args: browserArgs,
});
const page = await browser.newPage();
// If proxy has auth, set it up
if (proxyUrl) {
const proxyUrlParsed = new URL(proxyUrl);
if (proxyUrlParsed.username && proxyUrlParsed.password) {
await page.authenticate({
username: decodeURIComponent(proxyUrlParsed.username),
password: decodeURIComponent(proxyUrlParsed.password),
});
}
}
// Get browser user agent
const userAgent = await page.evaluate(() => navigator.userAgent);
result.browserUserAgent = userAgent;
result.fingerprint = {
userAgent,
browserName: 'Chrome (Puppeteer)',
deviceCategory: 'desktop',
};
// =========================================================================
// STEP 1a: Get IP address directly via simple API (more reliable than scraping)
// =========================================================================
console.log(`[PuppeteerPreflight] Getting proxy IP address...`);
try {
const ipApiResponse = await page.evaluate(async () => {
try {
const response = await fetch('https://api.ipify.org?format=json');
const data = await response.json();
return { ip: data.ip, error: null };
} catch (err: any) {
return { ip: null, error: err.message };
}
});
if (ipApiResponse.ip) {
result.proxyIp = ipApiResponse.ip;
result.proxyConnected = true;
console.log(`[PuppeteerPreflight] Detected proxy IP: ${ipApiResponse.ip}`);
// Look up timezone from IP
const geoData = await getTimezoneFromIp(ipApiResponse.ip);
if (geoData) {
result.detectedTimezone = geoData.timezone;
result.detectedLocation = { city: geoData.city, region: geoData.region };
console.log(`[PuppeteerPreflight] IP Geolocation: ${geoData.city}, ${geoData.region} (${geoData.timezone})`);
// Set browser timezone to match proxy location via CDP
try {
const client = await page.target().createCDPSession();
await client.send('Emulation.setTimezoneOverride', { timezoneId: geoData.timezone });
console.log(`[PuppeteerPreflight] Browser timezone set to: ${geoData.timezone}`);
} catch (tzErr: any) {
console.log(`[PuppeteerPreflight] Failed to set browser timezone: ${tzErr.message}`);
}
} else {
console.log(`[PuppeteerPreflight] WARNING: Could not determine timezone from IP - timezone mismatch possible`);
}
} else {
console.log(`[PuppeteerPreflight] IP lookup failed: ${ipApiResponse.error || 'unknown error'}`);
}
} catch (ipErr: any) {
console.log(`[PuppeteerPreflight] IP API error: ${ipErr.message}`);
}
// =========================================================================
// STEP 1b: Visit fingerprint.com demo to verify anti-detect
// =========================================================================
console.log(`[PuppeteerPreflight] Testing anti-detect at ${FINGERPRINT_DEMO_URL}...`);
try {
await page.goto(FINGERPRINT_DEMO_URL, {
waitUntil: 'networkidle2',
timeout: 30000,
});
result.proxyConnected = true; // If we got here, proxy is working
// Wait for fingerprint results to load
await page.waitForSelector('[data-test="visitor-id"]', { timeout: 10000 }).catch(() => {});
// Extract fingerprint data from the page
const fingerprintData = await page.evaluate(() => {
// Try to find the IP address displayed on the page
const ipElement = document.querySelector('[data-test="ip-address"]');
const ip = ipElement?.textContent?.trim() || null;
// Try to find bot detection info
const botElement = document.querySelector('[data-test="bot-detected"]');
const botDetected = botElement?.textContent?.toLowerCase().includes('true') || false;
// Try to find visitor ID (proves fingerprinting worked)
const visitorIdElement = document.querySelector('[data-test="visitor-id"]');
const visitorId = visitorIdElement?.textContent?.trim() || null;
// Alternative: look for common UI patterns if data-test attrs not present
let detectedIp = ip;
if (!detectedIp) {
// Look for IP in any element containing IP-like pattern
const allText = document.body.innerText;
const ipMatch = allText.match(/\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b/);
detectedIp = ipMatch ? ipMatch[1] : null;
}
return {
ip: detectedIp,
botDetected,
visitorId,
pageLoaded: !!document.body,
};
});
if (fingerprintData.ip) {
result.proxyIp = fingerprintData.ip;
console.log(`[PuppeteerPreflight] Detected IP: ${fingerprintData.ip}`);
// Verify IP matches expected proxy
if (expectedProxyHost) {
// Check if detected IP contains the proxy host (or is close match)
if (fingerprintData.ip === expectedProxyHost ||
expectedProxyHost.includes(fingerprintData.ip) ||
fingerprintData.ip.includes(expectedProxyHost.split('.').slice(0, 3).join('.'))) {
result.ipVerified = true;
console.log(`[PuppeteerPreflight] IP VERIFIED - matches proxy`);
} else {
console.log(`[PuppeteerPreflight] IP mismatch: expected ${expectedProxyHost}, got ${fingerprintData.ip}`);
// Don't fail - residential proxies often show different egress IPs
}
}
// Note: Timezone already set earlier via ipify.org IP lookup
}
if (fingerprintData.visitorId) {
console.log(`[PuppeteerPreflight] Fingerprint visitor ID: ${fingerprintData.visitorId}`);
}
result.botDetection = {
detected: fingerprintData.botDetected,
};
if (fingerprintData.botDetected) {
console.log(`[PuppeteerPreflight] WARNING: Bot detection triggered!`);
} else {
console.log(`[PuppeteerPreflight] Anti-detect check: NOT detected as bot`);
result.antidetectReady = true;
}
} catch (fpErr: any) {
// Could mean proxy connection failed
console.log(`[PuppeteerPreflight] Fingerprint.com check failed: ${fpErr.message}`);
if (fpErr.message.includes('net::ERR_PROXY') || fpErr.message.includes('ECONNREFUSED')) {
result.error = `Proxy connection failed: ${fpErr.message}`;
return result;
}
// Try fallback: amiunique.org
console.log(`[PuppeteerPreflight] Trying fallback: ${AMIUNIQUE_URL}...`);
try {
await page.goto(AMIUNIQUE_URL, {
waitUntil: 'networkidle2',
timeout: 30000,
});
result.proxyConnected = true;
// Extract IP from amiunique.org page
const amiData = await page.evaluate(() => {
const allText = document.body.innerText;
const ipMatch = allText.match(/\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b/);
return {
ip: ipMatch ? ipMatch[1] : null,
pageLoaded: !!document.body,
};
});
if (amiData.ip) {
result.proxyIp = amiData.ip;
console.log(`[PuppeteerPreflight] Detected IP via amiunique.org: ${amiData.ip}`);
}
result.antidetectReady = true;
console.log(`[PuppeteerPreflight] amiunique.org fallback succeeded`);
} catch (amiErr: any) {
console.log(`[PuppeteerPreflight] amiunique.org fallback also failed: ${amiErr.message}`);
// Continue with Dutchie test anyway
result.proxyConnected = true;
result.antidetectReady = true;
}
}
// =========================================================================
// STEP 2: Test Dutchie API access (the real test)
// =========================================================================
const embedUrl = `https://dutchie.com/embedded-menu/${TEST_CNAME}?menuType=rec`;
console.log(`[PuppeteerPreflight] Establishing session at ${embedUrl}...`);
await page.goto(embedUrl, {
waitUntil: 'networkidle2',
timeout: 30000,
});
// Make GraphQL request from browser context
const graphqlResult = await page.evaluate(
async (platformId: string, hash: string) => {
try {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // CRITICAL: Must be 'Active' per CLAUDE.md
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: 0,
perPage: 10, // Just need a few to prove it works
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: hash,
},
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions),
});
const url = `https://dutchie.com/api-3/graphql?${qs.toString()}`;
const sessionId = 'preflight-' + Date.now();
const response = await fetch(url, {
method: 'GET',
headers: {
Accept: 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include',
});
if (!response.ok) {
return { error: `HTTP ${response.status}`, products: 0 };
}
const json = await response.json();
if (json.errors) {
return { error: JSON.stringify(json.errors).slice(0, 200), products: 0 };
}
const products = json?.data?.filteredProducts?.products || [];
return { error: null, products: products.length };
} catch (err: any) {
return { error: err.message || 'Unknown error', products: 0 };
}
},
TEST_PLATFORM_ID,
FILTERED_PRODUCTS_HASH
);
result.responseTimeMs = Date.now() - startTime;
if (graphqlResult.error) {
result.error = `GraphQL error: ${graphqlResult.error}`;
console.log(`[PuppeteerPreflight] FAILED - ${result.error}`);
} else if (graphqlResult.products === 0) {
result.error = 'GraphQL returned 0 products';
console.log(`[PuppeteerPreflight] FAILED - No products returned`);
} else {
result.passed = true;
result.productsReturned = graphqlResult.products;
console.log(
`[PuppeteerPreflight] PASSED - Got ${graphqlResult.products} products in ${result.responseTimeMs}ms`
);
if (result.proxyIp) {
console.log(`[PuppeteerPreflight] Browser IP via proxy: ${result.proxyIp}`);
}
}
} catch (err: any) {
result.error = `Browser error: ${err.message || 'Unknown error'}`;
console.log(`[PuppeteerPreflight] FAILED - ${result.error}`);
} finally {
if (browser) {
await browser.close().catch(() => {});
}
}
return result;
}
/**
* Run Puppeteer preflight with retry
* Retries once on failure to handle transient issues
*
* @param crawlRotator - CrawlRotator instance to get proxy from pool
* @param maxRetries - Number of retry attempts (default 1)
*/
export async function runPuppeteerPreflightWithRetry(
crawlRotator?: CrawlRotator,
maxRetries: number = 1
): Promise<PuppeteerPreflightResult> {
let lastResult: PuppeteerPreflightResult | null = null;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
if (attempt > 0) {
console.log(`[PuppeteerPreflight] Retry attempt ${attempt}/${maxRetries}...`);
await new Promise((r) => setTimeout(r, 5000)); // Wait 5s between retries
}
lastResult = await runPuppeteerPreflight(crawlRotator);
if (lastResult.passed) {
return lastResult;
}
}
return lastResult!;
}

View File

@@ -1,566 +1,30 @@
/** /**
* System API Routes * System API Routes (Stub)
* *
* Provides REST API endpoints for system monitoring and control: * The full system routes depend on SyncOrchestrator which was moved to _deprecated.
* - /api/system/sync/* - Sync orchestrator * This stub provides empty routers to maintain backward compatibility.
* - /api/system/dlq/* - Dead-letter queue
* - /api/system/integrity/* - Integrity checks
* - /api/system/fix/* - Auto-fix routines
* - /api/system/alerts/* - System alerts
* - /metrics - Prometheus metrics
* *
* Phase 5: Full Production Sync + Monitoring * Full implementation available at: src/_deprecated/system/routes/index.ts
*/ */
import { Router, Request, Response } from 'express'; import { Router, Request, Response } from 'express';
import { Pool } from 'pg'; import { Pool } from 'pg';
import { import { MetricsService } from '../services';
SyncOrchestrator,
MetricsService,
DLQService,
AlertService,
IntegrityService,
AutoFixService,
} from '../services';
export function createSystemRouter(pool: Pool): Router { export function createSystemRouter(_pool: Pool): Router {
const router = Router(); const router = Router();
// Initialize services // Stub - full sync/dlq/integrity/fix/alerts routes moved to _deprecated
const metrics = new MetricsService(pool); router.get('/status', (_req: Request, res: Response) => {
const dlq = new DLQService(pool);
const alerts = new AlertService(pool);
const integrity = new IntegrityService(pool, alerts);
const autoFix = new AutoFixService(pool, alerts);
const orchestrator = new SyncOrchestrator(pool, metrics, dlq, alerts);
// ============================================================
// SYNC ORCHESTRATOR ENDPOINTS
// ============================================================
/**
* GET /api/system/sync/status
* Get current sync status
*/
router.get('/sync/status', async (_req: Request, res: Response) => {
try {
const status = await orchestrator.getStatus();
res.json(status);
} catch (error) {
console.error('[System] Sync status error:', error);
res.status(500).json({ error: 'Failed to get sync status' });
}
});
/**
* POST /api/system/sync/run
* Trigger a sync run
*/
router.post('/sync/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await orchestrator.runSync();
res.json({ res.json({
success: true, message: 'System routes temporarily disabled - see _deprecated/system/routes',
triggeredBy, status: 'stub',
metrics: result,
}); });
} catch (error) {
console.error('[System] Sync run error:', error);
res.status(500).json({
success: false,
error: error instanceof Error ? error.message : 'Sync run failed',
});
}
});
/**
* GET /api/system/sync/queue-depth
* Get queue depth information
*/
router.get('/sync/queue-depth', async (_req: Request, res: Response) => {
try {
const depth = await orchestrator.getQueueDepth();
res.json(depth);
} catch (error) {
console.error('[System] Queue depth error:', error);
res.status(500).json({ error: 'Failed to get queue depth' });
}
});
/**
* GET /api/system/sync/health
* Get sync health status
*/
router.get('/sync/health', async (_req: Request, res: Response) => {
try {
const health = await orchestrator.getHealth();
res.status(health.healthy ? 200 : 503).json(health);
} catch (error) {
console.error('[System] Health check error:', error);
res.status(500).json({ healthy: false, error: 'Health check failed' });
}
});
/**
* POST /api/system/sync/pause
* Pause the orchestrator
*/
router.post('/sync/pause', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manual pause';
await orchestrator.pause(reason);
res.json({ success: true, message: 'Orchestrator paused' });
} catch (error) {
console.error('[System] Pause error:', error);
res.status(500).json({ error: 'Failed to pause orchestrator' });
}
});
/**
* POST /api/system/sync/resume
* Resume the orchestrator
*/
router.post('/sync/resume', async (_req: Request, res: Response) => {
try {
await orchestrator.resume();
res.json({ success: true, message: 'Orchestrator resumed' });
} catch (error) {
console.error('[System] Resume error:', error);
res.status(500).json({ error: 'Failed to resume orchestrator' });
}
});
// ============================================================
// DLQ ENDPOINTS
// ============================================================
/**
* GET /api/system/dlq
* List DLQ payloads
*/
router.get('/dlq', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as string,
errorType: req.query.errorType as string,
dispensaryId: req.query.dispensaryId ? parseInt(req.query.dispensaryId as string) : undefined,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await dlq.listPayloads(options);
res.json(result);
} catch (error) {
console.error('[System] DLQ list error:', error);
res.status(500).json({ error: 'Failed to list DLQ payloads' });
}
});
/**
* GET /api/system/dlq/stats
* Get DLQ statistics
*/
router.get('/dlq/stats', async (_req: Request, res: Response) => {
try {
const stats = await dlq.getStats();
res.json(stats);
} catch (error) {
console.error('[System] DLQ stats error:', error);
res.status(500).json({ error: 'Failed to get DLQ stats' });
}
});
/**
* GET /api/system/dlq/summary
* Get DLQ summary by error type
*/
router.get('/dlq/summary', async (_req: Request, res: Response) => {
try {
const summary = await dlq.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] DLQ summary error:', error);
res.status(500).json({ error: 'Failed to get DLQ summary' });
}
});
/**
* GET /api/system/dlq/:id
* Get a specific DLQ payload
*/
router.get('/dlq/:id', async (req: Request, res: Response) => {
try {
const payload = await dlq.getPayload(req.params.id);
if (!payload) {
return res.status(404).json({ error: 'Payload not found' });
}
res.json(payload);
} catch (error) {
console.error('[System] DLQ get error:', error);
res.status(500).json({ error: 'Failed to get DLQ payload' });
}
});
/**
* POST /api/system/dlq/:id/retry
* Retry a DLQ payload
*/
router.post('/dlq/:id/retry', async (req: Request, res: Response) => {
try {
const result = await dlq.retryPayload(req.params.id);
if (result.success) {
res.json(result);
} else {
res.status(400).json(result);
}
} catch (error) {
console.error('[System] DLQ retry error:', error);
res.status(500).json({ error: 'Failed to retry payload' });
}
});
/**
* POST /api/system/dlq/:id/abandon
* Abandon a DLQ payload
*/
router.post('/dlq/:id/abandon', async (req: Request, res: Response) => {
try {
const reason = req.body.reason || 'Manually abandoned';
const abandonedBy = req.body.abandonedBy || 'api';
const success = await dlq.abandonPayload(req.params.id, reason, abandonedBy);
res.json({ success });
} catch (error) {
console.error('[System] DLQ abandon error:', error);
res.status(500).json({ error: 'Failed to abandon payload' });
}
});
/**
* POST /api/system/dlq/bulk-retry
* Bulk retry payloads by error type
*/
router.post('/dlq/bulk-retry', async (req: Request, res: Response) => {
try {
const { errorType } = req.body;
if (!errorType) {
return res.status(400).json({ error: 'errorType is required' });
}
const result = await dlq.bulkRetryByErrorType(errorType);
res.json(result);
} catch (error) {
console.error('[System] DLQ bulk retry error:', error);
res.status(500).json({ error: 'Failed to bulk retry' });
}
});
// ============================================================
// INTEGRITY CHECK ENDPOINTS
// ============================================================
/**
* POST /api/system/integrity/run
* Run all integrity checks
*/
router.post('/integrity/run', async (req: Request, res: Response) => {
try {
const triggeredBy = req.body.triggeredBy || 'api';
const result = await integrity.runAllChecks(triggeredBy);
res.json(result);
} catch (error) {
console.error('[System] Integrity run error:', error);
res.status(500).json({ error: 'Failed to run integrity checks' });
}
});
/**
* GET /api/system/integrity/runs
* Get recent integrity check runs
*/
router.get('/integrity/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 10;
const runs = await integrity.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Integrity runs error:', error);
res.status(500).json({ error: 'Failed to get integrity runs' });
}
});
/**
* GET /api/system/integrity/runs/:runId
* Get results for a specific integrity run
*/
router.get('/integrity/runs/:runId', async (req: Request, res: Response) => {
try {
const results = await integrity.getRunResults(req.params.runId);
res.json(results);
} catch (error) {
console.error('[System] Integrity run results error:', error);
res.status(500).json({ error: 'Failed to get run results' });
}
});
// ============================================================
// AUTO-FIX ENDPOINTS
// ============================================================
/**
* GET /api/system/fix/routines
* Get available fix routines
*/
router.get('/fix/routines', (_req: Request, res: Response) => {
try {
const routines = autoFix.getAvailableRoutines();
res.json(routines);
} catch (error) {
console.error('[System] Get routines error:', error);
res.status(500).json({ error: 'Failed to get routines' });
}
});
/**
* POST /api/system/fix/:routine
* Run a fix routine
*/
router.post('/fix/:routine', async (req: Request, res: Response) => {
try {
const routineName = req.params.routine;
const dryRun = req.body.dryRun === true;
const triggeredBy = req.body.triggeredBy || 'api';
const result = await autoFix.runRoutine(routineName as any, triggeredBy, { dryRun });
res.json(result);
} catch (error) {
console.error('[System] Fix routine error:', error);
res.status(500).json({ error: 'Failed to run fix routine' });
}
});
/**
* GET /api/system/fix/runs
* Get recent fix runs
*/
router.get('/fix/runs', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 20;
const runs = await autoFix.getRecentRuns(limit);
res.json(runs);
} catch (error) {
console.error('[System] Fix runs error:', error);
res.status(500).json({ error: 'Failed to get fix runs' });
}
});
// ============================================================
// ALERTS ENDPOINTS
// ============================================================
/**
* GET /api/system/alerts
* List alerts
*/
router.get('/alerts', async (req: Request, res: Response) => {
try {
const options = {
status: req.query.status as any,
severity: req.query.severity as any,
type: req.query.type as string,
limit: req.query.limit ? parseInt(req.query.limit as string) : 50,
offset: req.query.offset ? parseInt(req.query.offset as string) : 0,
};
const result = await alerts.listAlerts(options);
res.json(result);
} catch (error) {
console.error('[System] Alerts list error:', error);
res.status(500).json({ error: 'Failed to list alerts' });
}
});
/**
* GET /api/system/alerts/active
* Get active alerts
*/
router.get('/alerts/active', async (_req: Request, res: Response) => {
try {
const activeAlerts = await alerts.getActiveAlerts();
res.json(activeAlerts);
} catch (error) {
console.error('[System] Active alerts error:', error);
res.status(500).json({ error: 'Failed to get active alerts' });
}
});
/**
* GET /api/system/alerts/summary
* Get alert summary
*/
router.get('/alerts/summary', async (_req: Request, res: Response) => {
try {
const summary = await alerts.getSummary();
res.json(summary);
} catch (error) {
console.error('[System] Alerts summary error:', error);
res.status(500).json({ error: 'Failed to get alerts summary' });
}
});
/**
* POST /api/system/alerts/:id/acknowledge
* Acknowledge an alert
*/
router.post('/alerts/:id/acknowledge', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const acknowledgedBy = req.body.acknowledgedBy || 'api';
const success = await alerts.acknowledgeAlert(alertId, acknowledgedBy);
res.json({ success });
} catch (error) {
console.error('[System] Acknowledge alert error:', error);
res.status(500).json({ error: 'Failed to acknowledge alert' });
}
});
/**
* POST /api/system/alerts/:id/resolve
* Resolve an alert
*/
router.post('/alerts/:id/resolve', async (req: Request, res: Response) => {
try {
const alertId = parseInt(req.params.id);
const resolvedBy = req.body.resolvedBy || 'api';
const success = await alerts.resolveAlert(alertId, resolvedBy);
res.json({ success });
} catch (error) {
console.error('[System] Resolve alert error:', error);
res.status(500).json({ error: 'Failed to resolve alert' });
}
});
/**
* POST /api/system/alerts/bulk-acknowledge
* Bulk acknowledge alerts
*/
router.post('/alerts/bulk-acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await alerts.bulkAcknowledge(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Bulk acknowledge error:', error);
res.status(500).json({ error: 'Failed to bulk acknowledge' });
}
});
// ============================================================
// METRICS ENDPOINTS
// ============================================================
/**
* GET /api/system/metrics
* Get all current metrics
*/
router.get('/metrics', async (_req: Request, res: Response) => {
try {
const allMetrics = await metrics.getAllMetrics();
res.json(allMetrics);
} catch (error) {
console.error('[System] Metrics error:', error);
res.status(500).json({ error: 'Failed to get metrics' });
}
});
/**
* GET /api/system/metrics/:name
* Get a specific metric
*/
router.get('/metrics/:name', async (req: Request, res: Response) => {
try {
const metric = await metrics.getMetric(req.params.name);
if (!metric) {
return res.status(404).json({ error: 'Metric not found' });
}
res.json(metric);
} catch (error) {
console.error('[System] Metric error:', error);
res.status(500).json({ error: 'Failed to get metric' });
}
});
/**
* GET /api/system/metrics/:name/history
* Get metric time series
*/
router.get('/metrics/:name/history', async (req: Request, res: Response) => {
try {
const hours = req.query.hours ? parseInt(req.query.hours as string) : 24;
const history = await metrics.getMetricHistory(req.params.name, hours);
res.json(history);
} catch (error) {
console.error('[System] Metric history error:', error);
res.status(500).json({ error: 'Failed to get metric history' });
}
});
/**
* GET /api/system/errors
* Get error summary
*/
router.get('/errors', async (_req: Request, res: Response) => {
try {
const summary = await metrics.getErrorSummary();
res.json(summary);
} catch (error) {
console.error('[System] Error summary error:', error);
res.status(500).json({ error: 'Failed to get error summary' });
}
});
/**
* GET /api/system/errors/recent
* Get recent errors
*/
router.get('/errors/recent', async (req: Request, res: Response) => {
try {
const limit = req.query.limit ? parseInt(req.query.limit as string) : 50;
const errorType = req.query.type as string;
const errors = await metrics.getRecentErrors(limit, errorType);
res.json(errors);
} catch (error) {
console.error('[System] Recent errors error:', error);
res.status(500).json({ error: 'Failed to get recent errors' });
}
});
/**
* POST /api/system/errors/acknowledge
* Acknowledge errors
*/
router.post('/errors/acknowledge', async (req: Request, res: Response) => {
try {
const { ids, acknowledgedBy } = req.body;
if (!ids || !Array.isArray(ids)) {
return res.status(400).json({ error: 'ids array is required' });
}
const count = await metrics.acknowledgeErrors(ids, acknowledgedBy || 'api');
res.json({ acknowledged: count });
} catch (error) {
console.error('[System] Acknowledge errors error:', error);
res.status(500).json({ error: 'Failed to acknowledge errors' });
}
}); });
return router; return router;
} }
/**
* Create Prometheus metrics endpoint (standalone)
*/
export function createPrometheusRouter(pool: Pool): Router { export function createPrometheusRouter(pool: Pool): Router {
const router = Router(); const router = Router();
const metrics = new MetricsService(pool); const metrics = new MetricsService(pool);

View File

@@ -4,7 +4,7 @@
* Phase 5: Full Production Sync + Monitoring * Phase 5: Full Production Sync + Monitoring
*/ */
export { SyncOrchestrator, type SyncStatus, type QueueDepth, type SyncRunMetrics, type OrchestratorStatus } from './sync-orchestrator'; // SyncOrchestrator moved to _deprecated (depends on hydration module)
export { MetricsService, ERROR_TYPES, type Metric, type MetricTimeSeries, type ErrorBucket, type ErrorType } from './metrics'; export { MetricsService, ERROR_TYPES, type Metric, type MetricTimeSeries, type ErrorBucket, type ErrorType } from './metrics';
export { DLQService, type DLQPayload, type DLQStats } from './dlq'; export { DLQService, type DLQPayload, type DLQStats } from './dlq';
export { AlertService, type SystemAlert, type AlertSummary, type AlertSeverity, type AlertStatus } from './alerts'; export { AlertService, type SystemAlert, type AlertSummary, type AlertSeverity, type AlertStatus } from './alerts';

View File

@@ -4,9 +4,9 @@
* Exports all task handlers for the task worker. * Exports all task handlers for the task worker.
*/ */
export { handleProductRefresh } from './product-refresh';
export { handleProductDiscovery } from './product-discovery'; export { handleProductDiscovery } from './product-discovery';
export { handleProductRefresh } from './product-refresh';
export { handleStoreDiscovery } from './store-discovery'; export { handleStoreDiscovery } from './store-discovery';
export { handleEntryPointDiscovery } from './entry-point-discovery'; export { handleEntryPointDiscovery } from './entry-point-discovery';
export { handleAnalyticsRefresh } from './analytics-refresh'; export { handleAnalyticsRefresh } from './analytics-refresh';
export { handleProxyTest } from './proxy-test'; export { handleWhoami } from './whoami';

View File

@@ -1,51 +0,0 @@
/**
* Proxy Test Handler
* Tests proxy connectivity by fetching public IP via ipify
*/
import { TaskContext, TaskResult } from '../task-worker';
import { execSync } from 'child_process';
export async function handleProxyTest(ctx: TaskContext): Promise<TaskResult> {
const { pool } = ctx;
console.log('[ProxyTest] Testing proxy connection...');
try {
// Get active proxy from DB
const proxyResult = await pool.query(`
SELECT host, port, username, password
FROM proxies
WHERE is_active = true
LIMIT 1
`);
if (proxyResult.rows.length === 0) {
return { success: false, error: 'No active proxy configured' };
}
const p = proxyResult.rows[0];
const proxyUrl = p.username
? `http://${p.username}:${p.password}@${p.host}:${p.port}`
: `http://${p.host}:${p.port}`;
console.log(`[ProxyTest] Using proxy: ${p.host}:${p.port}`);
// Fetch IP via proxy
const cmd = `curl -s --proxy '${proxyUrl}' 'https://api.ipify.org?format=json'`;
const output = execSync(cmd, { timeout: 30000 }).toString().trim();
const data = JSON.parse(output);
console.log(`[ProxyTest] Proxy IP: ${data.ip}`);
return {
success: true,
proxyIp: data.ip,
proxyHost: p.host,
proxyPort: p.port,
};
} catch (error: any) {
console.error('[ProxyTest] Error:', error.message);
return { success: false, error: error.message };
}
}

View File

@@ -0,0 +1,80 @@
/**
* WhoAmI Handler
* Tests proxy connectivity and anti-detect by fetching public IP
* Reports: proxy IP, fingerprint info, and connection status
*/
import { TaskContext, TaskResult } from '../task-worker';
import { execSync } from 'child_process';
export async function handleWhoami(ctx: TaskContext): Promise<TaskResult> {
const { pool, crawlRotator } = ctx;
console.log('[WhoAmI] Testing proxy and anti-detect...');
try {
// Use the preflight check which tests proxy + anti-detect
if (crawlRotator) {
const preflight = await crawlRotator.preflight();
if (!preflight.passed) {
return {
success: false,
error: preflight.error || 'Preflight check failed',
proxyAvailable: preflight.proxyAvailable,
proxyConnected: preflight.proxyConnected,
antidetectReady: preflight.antidetectReady,
};
}
console.log(`[WhoAmI] Proxy IP: ${preflight.proxyIp}, Response: ${preflight.responseTimeMs}ms`);
console.log(`[WhoAmI] Fingerprint: ${preflight.fingerprint?.browserName}/${preflight.fingerprint?.deviceCategory}`);
return {
success: true,
proxyIp: preflight.proxyIp,
responseTimeMs: preflight.responseTimeMs,
fingerprint: preflight.fingerprint,
proxyAvailable: preflight.proxyAvailable,
proxyConnected: preflight.proxyConnected,
antidetectReady: preflight.antidetectReady,
};
}
// Fallback: Direct proxy test without CrawlRotator
const proxyResult = await pool.query(`
SELECT host, port, username, password
FROM proxies
WHERE is_active = true
LIMIT 1
`);
if (proxyResult.rows.length === 0) {
return { success: false, error: 'No active proxy configured' };
}
const p = proxyResult.rows[0];
const proxyUrl = p.username
? `http://${p.username}:${p.password}@${p.host}:${p.port}`
: `http://${p.host}:${p.port}`;
console.log(`[WhoAmI] Using proxy: ${p.host}:${p.port}`);
// Fetch IP via proxy
const cmd = `curl -s --proxy '${proxyUrl}' 'https://api.ipify.org?format=json'`;
const output = execSync(cmd, { timeout: 30000 }).toString().trim();
const data = JSON.parse(output);
console.log(`[WhoAmI] Proxy IP: ${data.ip}`);
return {
success: true,
proxyIp: data.ip,
proxyHost: p.host,
proxyPort: p.port,
};
} catch (error: any) {
console.error('[WhoAmI] Error:', error.message);
return { success: false, error: error.message };
}
}

View File

@@ -17,8 +17,8 @@ export {
export { TaskWorker, TaskContext, TaskResult } from './task-worker'; export { TaskWorker, TaskContext, TaskResult } from './task-worker';
export { export {
handleProductRefresh,
handleProductDiscovery, handleProductDiscovery,
handleProductRefresh,
handleStoreDiscovery, handleStoreDiscovery,
handleEntryPointDiscovery, handleEntryPointDiscovery,
handleAnalyticsRefresh, handleAnalyticsRefresh,

View File

@@ -24,15 +24,16 @@ async function tableExists(tableName: string): Promise<boolean> {
// Per TASK_WORKFLOW_2024-12-10.md: Task roles // Per TASK_WORKFLOW_2024-12-10.md: Task roles
// payload_fetch: Hits Dutchie API, saves raw payload to filesystem // payload_fetch: Hits Dutchie API, saves raw payload to filesystem
// product_refresh: Reads local payload, normalizes, upserts to DB // product_discovery: Main product crawl handler
// product_refresh: Legacy role (deprecated but kept for compatibility)
export type TaskRole = export type TaskRole =
| 'store_discovery' | 'store_discovery'
| 'entry_point_discovery' | 'entry_point_discovery'
| 'product_discovery' | 'product_discovery'
| 'payload_fetch' // NEW: Fetches from API, saves to disk | 'payload_fetch' // Fetches from API, saves to disk
| 'product_refresh' // CHANGED: Now reads from local payload | 'product_refresh' // DEPRECATED: Use product_discovery instead
| 'analytics_refresh' | 'analytics_refresh'
| 'proxy_test'; // Tests proxy connectivity via ipify | 'whoami'; // Tests proxy + anti-detect connectivity
export type TaskStatus = export type TaskStatus =
| 'pending' | 'pending'
@@ -51,6 +52,7 @@ export interface WorkerTask {
platform: string | null; platform: string | null;
status: TaskStatus; status: TaskStatus;
priority: number; priority: number;
method: 'curl' | 'http' | null; // Transport method: curl=axios/proxy, http=Puppeteer/browser
scheduled_for: Date | null; scheduled_for: Date | null;
worker_id: string | null; worker_id: string | null;
claimed_at: Date | null; claimed_at: Date | null;
@@ -152,23 +154,33 @@ class TaskService {
* Claim a task atomically for a worker * Claim a task atomically for a worker
* If role is null, claims ANY available task (role-agnostic worker) * If role is null, claims ANY available task (role-agnostic worker)
* Returns null if task pool is paused. * Returns null if task pool is paused.
*
* @param role - Task role to claim, or null for any task
* @param workerId - Worker ID claiming the task
* @param curlPassed - Whether worker passed curl preflight (default true for backward compat)
* @param httpPassed - Whether worker passed http/Puppeteer preflight (default false)
*/ */
async claimTask(role: TaskRole | null, workerId: string): Promise<WorkerTask | null> { async claimTask(
role: TaskRole | null,
workerId: string,
curlPassed: boolean = true,
httpPassed: boolean = false
): Promise<WorkerTask | null> {
// Check if task pool is paused - don't claim any tasks // Check if task pool is paused - don't claim any tasks
if (isTaskPoolPaused()) { if (isTaskPoolPaused()) {
return null; return null;
} }
if (role) { if (role) {
// Role-specific claiming - use the SQL function // Role-specific claiming - use the SQL function with preflight capabilities
const result = await pool.query( const result = await pool.query(
`SELECT * FROM claim_task($1, $2)`, `SELECT * FROM claim_task($1, $2, $3, $4)`,
[role, workerId] [role, workerId, curlPassed, httpPassed]
); );
return (result.rows[0] as WorkerTask) || null; return (result.rows[0] as WorkerTask) || null;
} }
// Role-agnostic claiming - claim ANY pending task // Role-agnostic claiming - claim ANY pending task matching worker capabilities
const result = await pool.query(` const result = await pool.query(`
UPDATE worker_tasks UPDATE worker_tasks
SET SET
@@ -179,6 +191,12 @@ class TaskService {
SELECT id FROM worker_tasks SELECT id FROM worker_tasks
WHERE status = 'pending' WHERE status = 'pending'
AND (scheduled_for IS NULL OR scheduled_for <= NOW()) AND (scheduled_for IS NULL OR scheduled_for <= NOW())
-- Method compatibility: worker must have passed the required preflight
AND (
method IS NULL -- No preference, any worker can claim
OR (method = 'curl' AND $2 = TRUE)
OR (method = 'http' AND $3 = TRUE)
)
-- Exclude stores that already have an active task -- Exclude stores that already have an active task
AND (dispensary_id IS NULL OR dispensary_id NOT IN ( AND (dispensary_id IS NULL OR dispensary_id NOT IN (
SELECT dispensary_id FROM worker_tasks SELECT dispensary_id FROM worker_tasks
@@ -190,7 +208,7 @@ class TaskService {
FOR UPDATE SKIP LOCKED FOR UPDATE SKIP LOCKED
) )
RETURNING * RETURNING *
`, [workerId]); `, [workerId, curlPassed, httpPassed]);
return (result.rows[0] as WorkerTask) || null; return (result.rows[0] as WorkerTask) || null;
} }
@@ -231,6 +249,24 @@ class TaskService {
); );
} }
/**
* Release a claimed task back to pending (e.g., when preflight fails)
* This allows another worker to pick it up.
*/
async releaseTask(taskId: number): Promise<void> {
await pool.query(
`UPDATE worker_tasks
SET status = 'pending',
worker_id = NULL,
claimed_at = NULL,
started_at = NULL,
updated_at = NOW()
WHERE id = $1 AND status IN ('claimed', 'running')`,
[taskId]
);
console.log(`[TaskService] Task ${taskId} released back to pending`);
}
/** /**
* Mark a task as failed, with auto-retry if under max_retries * Mark a task as failed, with auto-retry if under max_retries
* Returns true if task was re-queued for retry, false if permanently failed * Returns true if task was re-queued for retry, false if permanently failed

View File

@@ -51,6 +51,10 @@ import os from 'os';
import { CrawlRotator } from '../services/crawl-rotator'; import { CrawlRotator } from '../services/crawl-rotator';
import { setCrawlRotator } from '../platforms/dutchie'; import { setCrawlRotator } from '../platforms/dutchie';
// Dual-transport preflight system
import { runCurlPreflight, CurlPreflightResult } from '../services/curl-preflight';
import { runPuppeteerPreflightWithRetry, PuppeteerPreflightResult } from '../services/puppeteer-preflight';
// Task handlers by role // Task handlers by role
// Per TASK_WORKFLOW_2024-12-10.md: payload_fetch and product_refresh are now separate // Per TASK_WORKFLOW_2024-12-10.md: payload_fetch and product_refresh are now separate
import { handlePayloadFetch } from './handlers/payload-fetch'; import { handlePayloadFetch } from './handlers/payload-fetch';
@@ -59,7 +63,7 @@ import { handleProductDiscovery } from './handlers/product-discovery';
import { handleStoreDiscovery } from './handlers/store-discovery'; import { handleStoreDiscovery } from './handlers/store-discovery';
import { handleEntryPointDiscovery } from './handlers/entry-point-discovery'; import { handleEntryPointDiscovery } from './handlers/entry-point-discovery';
import { handleAnalyticsRefresh } from './handlers/analytics-refresh'; import { handleAnalyticsRefresh } from './handlers/analytics-refresh';
import { handleProxyTest } from './handlers/proxy-test'; import { handleWhoami } from './handlers/whoami';
const POLL_INTERVAL_MS = parseInt(process.env.POLL_INTERVAL_MS || '5000'); const POLL_INTERVAL_MS = parseInt(process.env.POLL_INTERVAL_MS || '5000');
const HEARTBEAT_INTERVAL_MS = parseInt(process.env.HEARTBEAT_INTERVAL_MS || '30000'); const HEARTBEAT_INTERVAL_MS = parseInt(process.env.HEARTBEAT_INTERVAL_MS || '30000');
@@ -111,6 +115,7 @@ export interface TaskContext {
workerId: string; workerId: string;
task: WorkerTask; task: WorkerTask;
heartbeat: () => Promise<void>; heartbeat: () => Promise<void>;
crawlRotator?: CrawlRotator;
} }
export interface TaskResult { export interface TaskResult {
@@ -125,16 +130,17 @@ export interface TaskResult {
type TaskHandler = (ctx: TaskContext) => Promise<TaskResult>; type TaskHandler = (ctx: TaskContext) => Promise<TaskResult>;
// Per TASK_WORKFLOW_2024-12-10.md: Handler registry // Per TASK_WORKFLOW_2024-12-10.md: Handler registry
// payload_fetch: Fetches from Dutchie API, saves to disk, chains to product_refresh // payload_fetch: Fetches from Dutchie API, saves to disk
// product_refresh: Reads local payload, normalizes, upserts to DB // product_refresh: Reads local payload, normalizes, upserts to DB
// product_discovery: Main handler for product crawling
const TASK_HANDLERS: Record<TaskRole, TaskHandler> = { const TASK_HANDLERS: Record<TaskRole, TaskHandler> = {
payload_fetch: handlePayloadFetch, // NEW: API fetch -> disk payload_fetch: handlePayloadFetch, // API fetch -> disk
product_refresh: handleProductRefresh, // CHANGED: disk -> DB product_refresh: handleProductRefresh, // disk -> DB
product_discovery: handleProductDiscovery, product_discovery: handleProductDiscovery,
store_discovery: handleStoreDiscovery, store_discovery: handleStoreDiscovery,
entry_point_discovery: handleEntryPointDiscovery, entry_point_discovery: handleEntryPointDiscovery,
analytics_refresh: handleAnalyticsRefresh, analytics_refresh: handleAnalyticsRefresh,
proxy_test: handleProxyTest, // Tests proxy via ipify whoami: handleWhoami, // Tests proxy + anti-detect
}; };
/** /**
@@ -188,6 +194,21 @@ export class TaskWorker {
private isBackingOff: boolean = false; private isBackingOff: boolean = false;
private backoffReason: string | null = null; private backoffReason: string | null = null;
// ==========================================================================
// DUAL-TRANSPORT PREFLIGHT STATUS
// ==========================================================================
// Workers run BOTH preflights on startup:
// - curl: axios/proxy transport - fast, for simple API calls
// - http: Puppeteer/browser transport - anti-detect, for Dutchie GraphQL
//
// Task claiming checks method compatibility - worker must have passed
// the preflight for the task's required method.
// ==========================================================================
private preflightCurlPassed: boolean = false;
private preflightHttpPassed: boolean = false;
private preflightCurlResult: CurlPreflightResult | null = null;
private preflightHttpResult: PuppeteerPreflightResult | null = null;
constructor(role: TaskRole | null = null, workerId?: string) { constructor(role: TaskRole | null = null, workerId?: string) {
this.pool = getPool(); this.pool = getPool();
this.role = role; this.role = role;
@@ -350,6 +371,117 @@ export class TaskWorker {
} }
} }
/**
* Run dual-transport preflights on startup
* Tests both curl (axios/proxy) and http (Puppeteer/browser) transport methods.
* Results are reported to worker_registry and used for task claiming.
*
* NOTE: All current tasks require 'http' method, so http preflight must pass
* for the worker to claim any tasks. Curl preflight is for future use.
*/
private async runDualPreflights(): Promise<void> {
console.log(`[TaskWorker] Running dual-transport preflights...`);
// Run both preflights in parallel for efficiency
const [curlResult, httpResult] = await Promise.all([
runCurlPreflight(this.crawlRotator).catch((err): CurlPreflightResult => ({
method: 'curl',
passed: false,
proxyAvailable: false,
proxyConnected: false,
antidetectReady: false,
proxyIp: null,
fingerprint: null,
error: `Preflight error: ${err.message}`,
responseTimeMs: null,
})),
runPuppeteerPreflightWithRetry(this.crawlRotator, 1).catch((err): PuppeteerPreflightResult => ({
method: 'http',
passed: false,
proxyAvailable: false,
proxyConnected: false,
antidetectReady: false,
proxyIp: null,
fingerprint: null,
error: `Preflight error: ${err.message}`,
responseTimeMs: null,
productsReturned: 0,
})),
]);
// Store results
this.preflightCurlResult = curlResult;
this.preflightHttpResult = httpResult;
this.preflightCurlPassed = curlResult.passed;
this.preflightHttpPassed = httpResult.passed;
// Log results
console.log(`[TaskWorker] CURL preflight: ${curlResult.passed ? 'PASSED' : 'FAILED'}${curlResult.error ? ` - ${curlResult.error}` : ''}`);
console.log(`[TaskWorker] HTTP preflight: ${httpResult.passed ? 'PASSED' : 'FAILED'}${httpResult.error ? ` - ${httpResult.error}` : ''}`);
if (httpResult.passed && httpResult.productsReturned) {
console.log(`[TaskWorker] HTTP preflight returned ${httpResult.productsReturned} products from test store`);
}
// Report to worker_registry via API
await this.reportPreflightStatus();
// Since all tasks require 'http', warn if http preflight failed
if (!this.preflightHttpPassed) {
console.warn(`[TaskWorker] WARNING: HTTP preflight failed - this worker cannot claim any tasks!`);
console.warn(`[TaskWorker] Error: ${httpResult.error}`);
}
}
/**
* Report preflight status to worker_registry
* Function signature: update_worker_preflight(worker_id, transport, status, ip, response_ms, error, fingerprint)
*/
private async reportPreflightStatus(): Promise<void> {
try {
// Update worker_registry directly via SQL (more reliable than API)
// CURL preflight - includes IP address
await this.pool.query(`
SELECT update_worker_preflight($1, 'curl', $2, $3, $4, $5, $6)
`, [
this.workerId,
this.preflightCurlPassed ? 'passed' : 'failed',
this.preflightCurlResult?.proxyIp || null,
this.preflightCurlResult?.responseTimeMs || null,
this.preflightCurlResult?.error || null,
null, // No fingerprint for curl
]);
// HTTP preflight - includes IP, fingerprint, and timezone data
const httpFingerprint = this.preflightHttpResult ? {
...this.preflightHttpResult.fingerprint,
detectedTimezone: (this.preflightHttpResult as any).detectedTimezone,
detectedLocation: (this.preflightHttpResult as any).detectedLocation,
productsReturned: this.preflightHttpResult.productsReturned,
botDetection: (this.preflightHttpResult as any).botDetection,
} : null;
await this.pool.query(`
SELECT update_worker_preflight($1, 'http', $2, $3, $4, $5, $6)
`, [
this.workerId,
this.preflightHttpPassed ? 'passed' : 'failed',
this.preflightHttpResult?.proxyIp || null,
this.preflightHttpResult?.responseTimeMs || null,
this.preflightHttpResult?.error || null,
httpFingerprint ? JSON.stringify(httpFingerprint) : null,
]);
console.log(`[TaskWorker] Preflight status reported to worker_registry`);
if (this.preflightHttpResult?.proxyIp) {
console.log(`[TaskWorker] HTTP IP: ${this.preflightHttpResult.proxyIp}, Timezone: ${(this.preflightHttpResult as any).detectedTimezone || 'unknown'}`);
}
} catch (err: any) {
// Non-fatal - worker can still function
console.warn(`[TaskWorker] Could not report preflight status: ${err.message}`);
}
}
/** /**
* Register worker with the registry (get friendly name) * Register worker with the registry (get friendly name)
*/ */
@@ -493,11 +625,15 @@ export class TaskWorker {
// Register with the API to get a friendly name // Register with the API to get a friendly name
await this.register(); await this.register();
// Run dual-transport preflights
await this.runDualPreflights();
// Start registry heartbeat // Start registry heartbeat
this.startRegistryHeartbeat(); this.startRegistryHeartbeat();
const roleMsg = this.role ? `for role: ${this.role}` : '(role-agnostic - any task)'; const roleMsg = this.role ? `for role: ${this.role}` : '(role-agnostic - any task)';
console.log(`[TaskWorker] ${this.friendlyName} starting ${roleMsg} (max ${this.maxConcurrentTasks} concurrent tasks)`); const preflightMsg = `curl=${this.preflightCurlPassed ? '✓' : '✗'} http=${this.preflightHttpPassed ? '✓' : '✗'}`;
console.log(`[TaskWorker] ${this.friendlyName} starting ${roleMsg} (${preflightMsg}, max ${this.maxConcurrentTasks} concurrent tasks)`);
while (this.isRunning) { while (this.isRunning) {
try { try {
@@ -551,10 +687,36 @@ export class TaskWorker {
// Try to claim more tasks if we have capacity // Try to claim more tasks if we have capacity
if (this.canAcceptMoreTasks()) { if (this.canAcceptMoreTasks()) {
const task = await taskService.claimTask(this.role, this.workerId); // Pass preflight capabilities to only claim compatible tasks
const task = await taskService.claimTask(
this.role,
this.workerId,
this.preflightCurlPassed,
this.preflightHttpPassed
);
if (task) { if (task) {
console.log(`[TaskWorker] ${this.friendlyName} claimed task ${task.id} (${task.role}) [${this.activeTasks.size + 1}/${this.maxConcurrentTasks}]`); console.log(`[TaskWorker] ${this.friendlyName} claimed task ${task.id} (${task.role}) [${this.activeTasks.size + 1}/${this.maxConcurrentTasks}]`);
// =================================================================
// PREFLIGHT CHECK - CRITICAL: Worker MUST pass before task execution
// Verifies: 1) Proxy available 2) Proxy connected 3) Anti-detect ready
// =================================================================
const preflight = await this.crawlRotator.preflight();
if (!preflight.passed) {
console.log(`[TaskWorker] ${this.friendlyName} PREFLIGHT FAILED for task ${task.id}: ${preflight.error}`);
console.log(`[TaskWorker] Releasing task ${task.id} back to pending - worker cannot proceed without proxy/anti-detect`);
// Release task back to pending so another worker can pick it up
await taskService.releaseTask(task.id);
// Wait before trying again - give proxies time to recover
await this.sleep(30000); // 30 second wait on preflight failure
return;
}
console.log(`[TaskWorker] ${this.friendlyName} preflight PASSED for task ${task.id} (proxy: ${preflight.proxyIp}, ${preflight.responseTimeMs}ms)`);
this.activeTasks.set(task.id, task); this.activeTasks.set(task.id, task);
// Start task in background (don't await) // Start task in background (don't await)
@@ -611,6 +773,7 @@ export class TaskWorker {
heartbeat: async () => { heartbeat: async () => {
await taskService.heartbeat(task.id); await taskService.heartbeat(task.id);
}, },
crawlRotator: this.crawlRotator,
}; };
// Execute the task // Execute the task
@@ -716,6 +879,8 @@ export class TaskWorker {
maxConcurrentTasks: number; maxConcurrentTasks: number;
isBackingOff: boolean; isBackingOff: boolean;
backoffReason: string | null; backoffReason: string | null;
preflightCurlPassed: boolean;
preflightHttpPassed: boolean;
} { } {
return { return {
workerId: this.workerId, workerId: this.workerId,
@@ -726,6 +891,8 @@ export class TaskWorker {
maxConcurrentTasks: this.maxConcurrentTasks, maxConcurrentTasks: this.maxConcurrentTasks,
isBackingOff: this.isBackingOff, isBackingOff: this.isBackingOff,
backoffReason: this.backoffReason, backoffReason: this.backoffReason,
preflightCurlPassed: this.preflightCurlPassed,
preflightHttpPassed: this.preflightHttpPassed,
}; };
} }
} }
@@ -742,8 +909,8 @@ async function main(): Promise<void> {
'store_discovery', 'store_discovery',
'entry_point_discovery', 'entry_point_discovery',
'product_discovery', 'product_discovery',
'payload_fetch', // NEW: Fetches from API, saves to disk 'payload_fetch', // Fetches from API, saves to disk
'product_refresh', // CHANGED: Reads from disk, processes to DB 'product_refresh', // Reads from disk, processes to DB
'analytics_refresh', 'analytics_refresh',
]; ];

180
backend/test-intercept.js Normal file
View File

@@ -0,0 +1,180 @@
/**
* Stealth Browser Payload Capture - Direct GraphQL Injection
*
* Uses the browser session to make GraphQL requests that look organic.
* Adds proper headers matching what Dutchie's frontend sends.
*/
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function capturePayload(config) {
const {
dispensaryId = null,
platformId,
cName,
outputPath = `/tmp/payload_${cName}_${Date.now()}.json`,
} = config;
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Establish session by visiting the embedded menu
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
console.log(`[Capture] Establishing session at ${embedUrl}...`);
await page.goto(embedUrl, {
waitUntil: 'networkidle2',
timeout: 60000
});
console.log('[Capture] Session established, fetching ALL products...');
// Fetch all products using GET requests with proper headers
const result = await page.evaluate(async (platformId, cName) => {
const allProducts = [];
const logs = [];
let pageNum = 0;
const perPage = 100;
let totalCount = 0;
const sessionId = 'browser-session-' + Date.now();
try {
while (pageNum < 30) { // Max 30 pages = 3000 products
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // 'Active' for in-stock products per CLAUDE.md
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: pageNum,
perPage: perPage,
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
}
};
// Build GET URL like the browser does
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions)
});
const url = `https://dutchie.com/api-3/graphql?${qs.toString()}`;
const response = await fetch(url, {
method: 'GET',
headers: {
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include'
});
logs.push(`Page ${pageNum}: HTTP ${response.status}`);
if (!response.ok) {
const text = await response.text();
logs.push(`HTTP error: ${response.status} - ${text.slice(0, 200)}`);
break;
}
const json = await response.json();
if (json.errors) {
logs.push(`GraphQL error: ${JSON.stringify(json.errors).slice(0, 200)}`);
break;
}
const data = json?.data?.filteredProducts;
if (!data || !data.products) {
logs.push('No products in response');
break;
}
const products = data.products;
allProducts.push(...products);
if (pageNum === 0) {
totalCount = data.queryInfo?.totalCount || 0;
logs.push(`Total reported: ${totalCount}`);
}
logs.push(`Got ${products.length} products (total: ${allProducts.length}/${totalCount})`);
if (allProducts.length >= totalCount || products.length < perPage) {
break;
}
pageNum++;
// Small delay between pages to be polite
await new Promise(r => setTimeout(r, 200));
}
} catch (err) {
logs.push(`Error: ${err.message}`);
}
return { products: allProducts, totalCount, logs };
}, platformId, cName);
await browser.close();
// Print logs from browser context
result.logs.forEach(log => console.log(`[Browser] ${log}`));
console.log(`[Capture] Got ${result.products.length} products (API reported ${result.totalCount})`);
const payload = {
dispensaryId: dispensaryId,
platformId: platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: result.products.length,
products: result.products,
};
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
console.log(`\n=== Capture Complete ===`);
console.log(`Total products: ${result.products.length}`);
console.log(`Saved to: ${outputPath}`);
console.log(`File size: ${(fs.statSync(outputPath).size / 1024).toFixed(1)} KB`);
return payload;
}
// Run
(async () => {
const payload = await capturePayload({
cName: 'AZ-Deeply-Rooted',
platformId: '6405ef617056e8014d79101b',
});
if (payload.products.length > 0) {
const sample = payload.products[0];
console.log(`\nSample: ${sample.Name || sample.name} - ${sample.brand?.name || sample.brandName}`);
}
})().catch(console.error);

View File

@@ -14,5 +14,5 @@
"allowSyntheticDefaultImports": true "allowSyntheticDefaultImports": true
}, },
"include": ["src/**/*"], "include": ["src/**/*"],
"exclude": ["node_modules", "dist", "src/**/*.test.ts", "src/**/__tests__/**"] "exclude": ["node_modules", "dist", "src/**/*.test.ts", "src/**/__tests__/**", "src/_deprecated/**"]
} }

View File

@@ -48,6 +48,17 @@ interface Worker {
seconds_since_heartbeat: number; seconds_since_heartbeat: number;
decommission_requested?: boolean; decommission_requested?: boolean;
decommission_reason?: string; decommission_reason?: string;
// Dual-transport preflight status
preflight_curl_status?: 'pending' | 'passed' | 'failed' | 'skipped';
preflight_http_status?: 'pending' | 'passed' | 'failed' | 'skipped';
preflight_curl_at?: string;
preflight_http_at?: string;
preflight_curl_error?: string;
preflight_http_error?: string;
preflight_curl_ms?: number;
preflight_http_ms?: number;
can_curl?: boolean;
can_http?: boolean;
metadata: { metadata: {
cpu?: number; cpu?: number;
memory?: number; memory?: number;
@@ -277,6 +288,67 @@ function ResourceBadge({ worker }: { worker: Worker }) {
); );
} }
// Transport capability badge showing curl/http preflight status
function TransportBadge({ worker }: { worker: Worker }) {
const curlStatus = worker.preflight_curl_status || 'pending';
const httpStatus = worker.preflight_http_status || 'pending';
const getStatusConfig = (status: string, label: string, ms?: number, error?: string) => {
switch (status) {
case 'passed':
return {
bg: 'bg-emerald-100',
text: 'text-emerald-700',
icon: <CheckCircle className="w-3 h-3" />,
tooltip: ms ? `${label}: Passed (${ms}ms)` : `${label}: Passed`,
};
case 'failed':
return {
bg: 'bg-red-100',
text: 'text-red-700',
icon: <XCircle className="w-3 h-3" />,
tooltip: error ? `${label}: Failed - ${error}` : `${label}: Failed`,
};
case 'skipped':
return {
bg: 'bg-gray-100',
text: 'text-gray-500',
icon: <Clock className="w-3 h-3" />,
tooltip: `${label}: Skipped`,
};
default:
return {
bg: 'bg-yellow-100',
text: 'text-yellow-700',
icon: <Clock className="w-3 h-3 animate-pulse" />,
tooltip: `${label}: Pending`,
};
}
};
const curlConfig = getStatusConfig(curlStatus, 'CURL', worker.preflight_curl_ms, worker.preflight_curl_error);
const httpConfig = getStatusConfig(httpStatus, 'HTTP', worker.preflight_http_ms, worker.preflight_http_error);
return (
<div className="flex flex-col gap-1">
<div
className={`inline-flex items-center gap-1 px-1.5 py-0.5 rounded text-xs font-medium ${curlConfig.bg} ${curlConfig.text}`}
title={curlConfig.tooltip}
>
{curlConfig.icon}
<span>curl</span>
</div>
<div
className={`inline-flex items-center gap-1 px-1.5 py-0.5 rounded text-xs font-medium ${httpConfig.bg} ${httpConfig.text}`}
title={httpConfig.tooltip}
>
{httpConfig.icon}
<span>http</span>
</div>
</div>
);
}
// Task count badge showing active/max concurrent tasks // Task count badge showing active/max concurrent tasks
function TaskCountBadge({ worker, tasks }: { worker: Worker; tasks: Task[] }) { function TaskCountBadge({ worker, tasks }: { worker: Worker; tasks: Task[] }) {
const activeCount = worker.active_task_count ?? (worker.current_task_id ? 1 : 0); const activeCount = worker.active_task_count ?? (worker.current_task_id ? 1 : 0);
@@ -883,6 +955,7 @@ export function WorkersDashboard() {
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Worker</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Worker</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Role</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Role</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Status</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Status</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Transport</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Resources</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Resources</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Tasks</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Tasks</th>
<th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Duration</th> <th className="px-4 py-3 text-left text-xs font-medium text-gray-500 uppercase">Duration</th>
@@ -934,6 +1007,9 @@ export function WorkersDashboard() {
<td className="px-4 py-3"> <td className="px-4 py-3">
<HealthBadge status={worker.status} healthStatus={worker.health_status} /> <HealthBadge status={worker.status} healthStatus={worker.health_status} />
</td> </td>
<td className="px-4 py-3">
<TransportBadge worker={worker} />
</td>
<td className="px-4 py-3"> <td className="px-4 py-3">
<ResourceBadge worker={worker} /> <ResourceBadge worker={worker} />
</td> </td>

View File

@@ -0,0 +1,18 @@
# Woodpecker Agent Docker Compose
# Path: /opt/woodpecker/docker-compose.yml
# Deploy: cd /opt/woodpecker && docker compose up -d
version: '3.8'
services:
woodpecker-agent:
image: woodpeckerci/woodpecker-agent:latest
container_name: woodpecker-agent
restart: always
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- WOODPECKER_SERVER=localhost:9000
- WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET}
- WOODPECKER_MAX_WORKFLOWS=5
- WOODPECKER_HEALTHCHECK=true
- WOODPECKER_LOG_LEVEL=info