chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/: - hydration/ (old pipeline approach) - scraper-v2/ (old Puppeteer scraper) - canonical-hydration/ (merged into tasks) - Unused services: availability, crawler-logger, geolocation, etc - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser - Archive outdated docs to docs/_archive/: - ANALYTICS_RUNBOOK.md - ANALYTICS_V2_EXAMPLES.md - BRAND_INTELLIGENCE_API.md - CRAWL_PIPELINE.md - TASK_WORKFLOW_2024-12-10.md - WORKER_TASK_ARCHITECTURE.md - ORGANIC_SCRAPING_GUIDE.md - Add docs/CODEBASE_MAP.md as single source of truth - Add warning files to deprecated/archived directories - Slim down CLAUDE.md to essential rules only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
218
backend/docs/CODEBASE_MAP.md
Normal file
218
backend/docs/CODEBASE_MAP.md
Normal file
@@ -0,0 +1,218 @@
|
|||||||
|
# CannaiQ Backend Codebase Map
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-12
|
||||||
|
**Purpose:** Help Claude and developers understand which code is current vs deprecated
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference: What to Use
|
||||||
|
|
||||||
|
### For Crawling/Scraping
|
||||||
|
| Task | Use This | NOT This |
|
||||||
|
|------|----------|----------|
|
||||||
|
| Fetch products | `src/tasks/handlers/payload-fetch.ts` | `src/hydration/*` |
|
||||||
|
| Process products | `src/tasks/handlers/product-refresh.ts` | `src/scraper-v2/*` |
|
||||||
|
| GraphQL client | `src/platforms/dutchie/client.ts` | `src/dutchie-az/services/graphql-client.ts` |
|
||||||
|
| Worker system | `src/tasks/task-worker.ts` | `src/dutchie-az/services/worker.ts` |
|
||||||
|
|
||||||
|
### For Database
|
||||||
|
| Task | Use This | NOT This |
|
||||||
|
|------|----------|----------|
|
||||||
|
| Get DB pool | `src/db/pool.ts` | `src/dutchie-az/db/connection.ts` |
|
||||||
|
| Run migrations | `src/db/migrate.ts` (CLI only) | Never import at runtime |
|
||||||
|
| Query products | `store_products` table | `products`, `dutchie_products` |
|
||||||
|
| Query stores | `dispensaries` table | `stores` table |
|
||||||
|
|
||||||
|
### For Discovery
|
||||||
|
| Task | Use This |
|
||||||
|
|------|----------|
|
||||||
|
| Discover stores | `src/discovery/*.ts` |
|
||||||
|
| Run discovery | `npx tsx src/scripts/run-discovery.ts` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Directory Status
|
||||||
|
|
||||||
|
### ACTIVE DIRECTORIES (Use These)
|
||||||
|
|
||||||
|
```
|
||||||
|
src/
|
||||||
|
├── auth/ # JWT/session auth, middleware
|
||||||
|
├── db/ # Database pool, migrations
|
||||||
|
├── discovery/ # Dutchie store discovery pipeline
|
||||||
|
├── middleware/ # Express middleware
|
||||||
|
├── multi-state/ # Multi-state query support
|
||||||
|
├── platforms/ # Platform-specific clients (Dutchie, Jane, etc)
|
||||||
|
│ └── dutchie/ # THE Dutchie client - use this one
|
||||||
|
├── routes/ # Express API routes
|
||||||
|
├── services/ # Core services (logger, scheduler, etc)
|
||||||
|
├── tasks/ # Task system (workers, handlers, scheduler)
|
||||||
|
│ └── handlers/ # Task handlers (payload_fetch, product_refresh, etc)
|
||||||
|
├── types/ # TypeScript types
|
||||||
|
└── utils/ # Utilities (storage, image processing)
|
||||||
|
```
|
||||||
|
|
||||||
|
### DEPRECATED DIRECTORIES (DO NOT USE)
|
||||||
|
|
||||||
|
```
|
||||||
|
src/
|
||||||
|
├── hydration/ # DEPRECATED - Old pipeline approach
|
||||||
|
├── scraper-v2/ # DEPRECATED - Old scraper engine
|
||||||
|
├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers
|
||||||
|
├── dutchie-az/ # PARTIAL - Some parts deprecated, some active
|
||||||
|
│ ├── db/ # DEPRECATED - Use src/db/pool.ts
|
||||||
|
│ └── services/ # PARTIAL - worker.ts still runs, graphql-client.ts deprecated
|
||||||
|
├── portals/ # FUTURE - Not yet implemented
|
||||||
|
├── seo/ # PARTIAL - Settings work, templates WIP
|
||||||
|
└── system/ # DEPRECATED - Old orchestration system
|
||||||
|
```
|
||||||
|
|
||||||
|
### DEPRECATED FILES (DO NOT USE)
|
||||||
|
|
||||||
|
```
|
||||||
|
src/dutchie-az/db/connection.ts # Use src/db/pool.ts instead
|
||||||
|
src/dutchie-az/services/graphql-client.ts # Use src/platforms/dutchie/client.ts
|
||||||
|
src/hydration/*.ts # Entire directory deprecated
|
||||||
|
src/scraper-v2/*.ts # Entire directory deprecated
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Files Reference
|
||||||
|
|
||||||
|
### Entry Points
|
||||||
|
| File | Purpose | Status |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `src/index.ts` | Main Express server | ACTIVE |
|
||||||
|
| `src/dutchie-az/services/worker.ts` | Worker process entry | ACTIVE |
|
||||||
|
| `src/tasks/task-worker.ts` | Task worker (new system) | ACTIVE |
|
||||||
|
|
||||||
|
### Dutchie Integration
|
||||||
|
| File | Purpose | Status |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `src/platforms/dutchie/client.ts` | GraphQL client, hashes, curl | **PRIMARY** |
|
||||||
|
| `src/platforms/dutchie/queries.ts` | High-level query functions | ACTIVE |
|
||||||
|
| `src/platforms/dutchie/index.ts` | Re-exports | ACTIVE |
|
||||||
|
|
||||||
|
### Task Handlers
|
||||||
|
| File | Purpose | Status |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** |
|
||||||
|
| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** |
|
||||||
|
| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE |
|
||||||
|
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs | ACTIVE |
|
||||||
|
| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE |
|
||||||
|
|
||||||
|
### Database
|
||||||
|
| File | Purpose | Status |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `src/db/pool.ts` | Canonical DB pool | **PRIMARY** |
|
||||||
|
| `src/db/migrate.ts` | Migration runner (CLI only) | CLI ONLY |
|
||||||
|
| `src/db/auto-migrate.ts` | Auto-run migrations on startup | ACTIVE |
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
| File | Purpose | Status |
|
||||||
|
|------|---------|--------|
|
||||||
|
| `.env` | Environment variables | ACTIVE |
|
||||||
|
| `package.json` | Dependencies | ACTIVE |
|
||||||
|
| `tsconfig.json` | TypeScript config | ACTIVE |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GraphQL Hashes (CRITICAL)
|
||||||
|
|
||||||
|
The correct hashes are in `src/platforms/dutchie/client.ts`:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
export const GRAPHQL_HASHES = {
|
||||||
|
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
|
||||||
|
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
|
||||||
|
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
|
||||||
|
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
**ALWAYS** use `Status: 'Active'` for FilteredProducts (not `null` or `'All'`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scripts Reference
|
||||||
|
|
||||||
|
### Useful Scripts (in `src/scripts/`)
|
||||||
|
| Script | Purpose |
|
||||||
|
|--------|---------|
|
||||||
|
| `run-discovery.ts` | Run Dutchie discovery |
|
||||||
|
| `crawl-single-store.ts` | Test crawl a single store |
|
||||||
|
| `test-dutchie-graphql.ts` | Test GraphQL queries |
|
||||||
|
|
||||||
|
### One-Off Scripts (probably don't need)
|
||||||
|
| Script | Purpose |
|
||||||
|
|--------|---------|
|
||||||
|
| `harmonize-az-dispensaries.ts` | One-time data cleanup |
|
||||||
|
| `bootstrap-stores-for-dispensaries.ts` | One-time migration |
|
||||||
|
| `backfill-*.ts` | Historical backfill scripts |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Routes
|
||||||
|
|
||||||
|
### Active Routes (in `src/routes/`)
|
||||||
|
| Route File | Mount Point | Purpose |
|
||||||
|
|------------|-------------|---------|
|
||||||
|
| `auth.ts` | `/api/auth` | Login/logout/session |
|
||||||
|
| `stores.ts` | `/api/stores` | Store CRUD |
|
||||||
|
| `dashboard.ts` | `/api/dashboard` | Dashboard stats |
|
||||||
|
| `workers.ts` | `/api/workers` | Worker monitoring |
|
||||||
|
| `pipeline.ts` | `/api/pipeline` | Crawl triggers |
|
||||||
|
| `discovery.ts` | `/api/discovery` | Discovery management |
|
||||||
|
| `analytics.ts` | `/api/analytics` | Analytics queries |
|
||||||
|
| `wordpress.ts` | `/api/v1/wordpress` | WordPress plugin API |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Documentation Files
|
||||||
|
|
||||||
|
### Current Docs (in `backend/docs/`)
|
||||||
|
| Doc | Purpose | Currency |
|
||||||
|
|-----|---------|----------|
|
||||||
|
| `TASK_WORKFLOW_2024-12-10.md` | Task system architecture | CURRENT |
|
||||||
|
| `WORKER_TASK_ARCHITECTURE.md` | Worker/task design | CURRENT |
|
||||||
|
| `CRAWL_PIPELINE.md` | Crawl pipeline overview | CURRENT |
|
||||||
|
| `ORGANIC_SCRAPING_GUIDE.md` | Browser-based scraping | CURRENT |
|
||||||
|
| `CODEBASE_MAP.md` | This file | CURRENT |
|
||||||
|
| `ANALYTICS_V2_EXAMPLES.md` | Analytics API examples | CURRENT |
|
||||||
|
| `BRAND_INTELLIGENCE_API.md` | Brand API docs | CURRENT |
|
||||||
|
|
||||||
|
### Root Docs
|
||||||
|
| Doc | Purpose | Currency |
|
||||||
|
|-----|---------|----------|
|
||||||
|
| `CLAUDE.md` | Claude instructions | **PRIMARY** |
|
||||||
|
| `README.md` | Project overview | NEEDS UPDATE |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Mistakes to Avoid
|
||||||
|
|
||||||
|
1. **Don't use `src/hydration/`** - It's an old approach that was superseded by the task system
|
||||||
|
|
||||||
|
2. **Don't use `src/dutchie-az/db/connection.ts`** - Use `src/db/pool.ts` instead
|
||||||
|
|
||||||
|
3. **Don't import `src/db/migrate.ts` at runtime** - It will crash. Only use for CLI migrations.
|
||||||
|
|
||||||
|
4. **Don't query `stores` table** - It's empty. Use `dispensaries`.
|
||||||
|
|
||||||
|
5. **Don't query `products` table** - It's empty. Use `store_products`.
|
||||||
|
|
||||||
|
6. **Don't use wrong GraphQL hash** - Always get hash from `GRAPHQL_HASHES` in client.ts
|
||||||
|
|
||||||
|
7. **Don't use `Status: null`** - It returns 0 products. Use `Status: 'Active'`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## When in Doubt
|
||||||
|
|
||||||
|
1. Check if the file is imported in `src/index.ts` - if not, it may be deprecated
|
||||||
|
2. Check the last modified date - older files may be stale
|
||||||
|
3. Look for `DEPRECATED` comments in the code
|
||||||
|
4. Ask: "Is there a newer version of this in `src/tasks/` or `src/platforms/`?"
|
||||||
|
5. Read the relevant doc in `docs/` before modifying code
|
||||||
297
backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md
Normal file
297
backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md
Normal file
@@ -0,0 +1,297 @@
|
|||||||
|
# Organic Browser-Based Scraping Guide
|
||||||
|
|
||||||
|
**Last Updated:** 2025-12-12
|
||||||
|
**Status:** Production-ready proof of concept
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why Organic Scraping?
|
||||||
|
|
||||||
|
| Approach | Detection Risk | Speed | Complexity |
|
||||||
|
|----------|---------------|-------|------------|
|
||||||
|
| Direct curl | Higher | Fast | Low |
|
||||||
|
| curl-impersonate | Medium | Fast | Medium |
|
||||||
|
| **Browser-based (organic)** | **Lowest** | Slower | Higher |
|
||||||
|
|
||||||
|
Direct curl requests can be fingerprinted via:
|
||||||
|
- TLS fingerprint (cipher suites, extensions)
|
||||||
|
- Header order and values
|
||||||
|
- Missing cookies/session data
|
||||||
|
- Request patterns
|
||||||
|
|
||||||
|
Browser-based requests inherit:
|
||||||
|
- Real Chrome TLS fingerprint
|
||||||
|
- Session cookies from page visit
|
||||||
|
- Natural header order
|
||||||
|
- JavaScript execution environment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
|
||||||
|
```
|
||||||
|
|
||||||
|
### Core Script: `test-intercept.js`
|
||||||
|
|
||||||
|
Located at: `backend/test-intercept.js`
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const puppeteer = require('puppeteer-extra');
|
||||||
|
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
|
||||||
|
const fs = require('fs');
|
||||||
|
|
||||||
|
puppeteer.use(StealthPlugin());
|
||||||
|
|
||||||
|
async function capturePayload(config) {
|
||||||
|
const { dispensaryId, platformId, cName, outputPath } = config;
|
||||||
|
|
||||||
|
const browser = await puppeteer.launch({
|
||||||
|
headless: 'new',
|
||||||
|
args: ['--no-sandbox', '--disable-setuid-sandbox']
|
||||||
|
});
|
||||||
|
|
||||||
|
const page = await browser.newPage();
|
||||||
|
|
||||||
|
// STEP 1: Establish session by visiting the menu
|
||||||
|
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
|
||||||
|
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
|
||||||
|
|
||||||
|
// STEP 2: Fetch ALL products using GraphQL from browser context
|
||||||
|
const result = await page.evaluate(async (platformId) => {
|
||||||
|
const allProducts = [];
|
||||||
|
let pageNum = 0;
|
||||||
|
const perPage = 100;
|
||||||
|
let totalCount = 0;
|
||||||
|
const sessionId = 'browser-session-' + Date.now();
|
||||||
|
|
||||||
|
while (pageNum < 30) {
|
||||||
|
const variables = {
|
||||||
|
includeEnterpriseSpecials: false,
|
||||||
|
productsFilter: {
|
||||||
|
dispensaryId: platformId,
|
||||||
|
pricingType: 'rec',
|
||||||
|
Status: 'Active', // CRITICAL: Must be 'Active', not null
|
||||||
|
types: [],
|
||||||
|
useCache: true,
|
||||||
|
isDefaultSort: true,
|
||||||
|
sortBy: 'popularSortIdx',
|
||||||
|
sortDirection: 1,
|
||||||
|
bypassOnlineThresholds: true,
|
||||||
|
isKioskMenu: false,
|
||||||
|
removeProductsBelowOptionThresholds: false,
|
||||||
|
},
|
||||||
|
page: pageNum,
|
||||||
|
perPage: perPage,
|
||||||
|
};
|
||||||
|
|
||||||
|
const extensions = {
|
||||||
|
persistedQuery: {
|
||||||
|
version: 1,
|
||||||
|
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
const qs = new URLSearchParams({
|
||||||
|
operationName: 'FilteredProducts',
|
||||||
|
variables: JSON.stringify(variables),
|
||||||
|
extensions: JSON.stringify(extensions)
|
||||||
|
});
|
||||||
|
|
||||||
|
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
|
||||||
|
method: 'GET',
|
||||||
|
headers: {
|
||||||
|
'Accept': 'application/json',
|
||||||
|
'content-type': 'application/json',
|
||||||
|
'x-dutchie-session': sessionId,
|
||||||
|
'apollographql-client-name': 'Marketplace (production)',
|
||||||
|
},
|
||||||
|
credentials: 'include'
|
||||||
|
});
|
||||||
|
|
||||||
|
const json = await response.json();
|
||||||
|
const data = json?.data?.filteredProducts;
|
||||||
|
if (!data?.products) break;
|
||||||
|
|
||||||
|
allProducts.push(...data.products);
|
||||||
|
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
|
||||||
|
if (allProducts.length >= totalCount) break;
|
||||||
|
|
||||||
|
pageNum++;
|
||||||
|
await new Promise(r => setTimeout(r, 200)); // Polite delay
|
||||||
|
}
|
||||||
|
|
||||||
|
return { products: allProducts, totalCount };
|
||||||
|
}, platformId);
|
||||||
|
|
||||||
|
await browser.close();
|
||||||
|
|
||||||
|
// STEP 3: Save payload
|
||||||
|
const payload = {
|
||||||
|
dispensaryId,
|
||||||
|
platformId,
|
||||||
|
cName,
|
||||||
|
fetchedAt: new Date().toISOString(),
|
||||||
|
productCount: result.products.length,
|
||||||
|
products: result.products,
|
||||||
|
};
|
||||||
|
|
||||||
|
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
|
||||||
|
return payload;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Parameters
|
||||||
|
|
||||||
|
### GraphQL Hash (FilteredProducts)
|
||||||
|
|
||||||
|
```
|
||||||
|
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
|
||||||
|
```
|
||||||
|
|
||||||
|
**WARNING:** Using the wrong hash returns HTTP 400.
|
||||||
|
|
||||||
|
### Status Parameter
|
||||||
|
|
||||||
|
| Value | Result |
|
||||||
|
|-------|--------|
|
||||||
|
| `'Active'` | Returns in-stock products (1019 in test) |
|
||||||
|
| `null` | Returns 0 products |
|
||||||
|
| `'All'` | Returns HTTP 400 |
|
||||||
|
|
||||||
|
**ALWAYS use `Status: 'Active'`**
|
||||||
|
|
||||||
|
### Required Headers
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
{
|
||||||
|
'Accept': 'application/json',
|
||||||
|
'content-type': 'application/json',
|
||||||
|
'x-dutchie-session': 'unique-session-id',
|
||||||
|
'apollographql-client-name': 'Marketplace (production)',
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Endpoint
|
||||||
|
|
||||||
|
```
|
||||||
|
https://dutchie.com/api-3/graphql
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Benchmarks
|
||||||
|
|
||||||
|
Test store: AZ-Deeply-Rooted (1019 products)
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Total products | 1019 |
|
||||||
|
| Time | 18.5 seconds |
|
||||||
|
| Payload size | 11.8 MB |
|
||||||
|
| Pages fetched | 11 (100 per page) |
|
||||||
|
| Success rate | 100% |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Payload Format
|
||||||
|
|
||||||
|
The output matches the existing `payload-fetch.ts` handler format:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"dispensaryId": 123,
|
||||||
|
"platformId": "6405ef617056e8014d79101b",
|
||||||
|
"cName": "AZ-Deeply-Rooted",
|
||||||
|
"fetchedAt": "2025-12-12T05:05:19.837Z",
|
||||||
|
"productCount": 1019,
|
||||||
|
"products": [
|
||||||
|
{
|
||||||
|
"id": "6927508db4851262f629a869",
|
||||||
|
"Name": "Product Name",
|
||||||
|
"brand": { "name": "Brand Name", ... },
|
||||||
|
"type": "Flower",
|
||||||
|
"THC": "25%",
|
||||||
|
"Prices": [...],
|
||||||
|
"Options": [...],
|
||||||
|
...
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
### As a Task Handler
|
||||||
|
|
||||||
|
The organic approach can be integrated as an alternative to curl-based fetching:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// In src/tasks/handlers/organic-payload-fetch.ts
|
||||||
|
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
|
||||||
|
// Use puppeteer-based capture
|
||||||
|
// Save to same payload storage
|
||||||
|
// Queue product_refresh task
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Worker Configuration
|
||||||
|
|
||||||
|
Add to job_schedules:
|
||||||
|
```sql
|
||||||
|
INSERT INTO job_schedules (name, role, cron_expression)
|
||||||
|
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### HTTP 400 Bad Request
|
||||||
|
- Check hash is correct: `ee29c060...`
|
||||||
|
- Verify Status is `'Active'` (string, not null)
|
||||||
|
|
||||||
|
### 0 Products Returned
|
||||||
|
- Status was likely `null` or `'All'` - use `'Active'`
|
||||||
|
- Check platformId is valid MongoDB ObjectId
|
||||||
|
|
||||||
|
### Session Not Established
|
||||||
|
- Increase timeout on initial page.goto()
|
||||||
|
- Check cName is valid (matches embedded-menu URL)
|
||||||
|
|
||||||
|
### Detection/Blocking
|
||||||
|
- StealthPlugin should handle most cases
|
||||||
|
- Add random delays between pages
|
||||||
|
- Use headless: 'new' (not true/false)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Reference
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `backend/test-intercept.js` | Proof of concept script |
|
||||||
|
| `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation |
|
||||||
|
| `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler |
|
||||||
|
| `backend/src/utils/payload-storage.ts` | Payload save/load utilities |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## See Also
|
||||||
|
|
||||||
|
- `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation
|
||||||
|
- `TASK_WORKFLOW_2024-12-10.md` - Task system architecture
|
||||||
|
- `CLAUDE.md` - Project rules and constraints
|
||||||
25
backend/docs/_archive/README.md
Normal file
25
backend/docs/_archive/README.md
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
# ARCHIVED DOCUMENTATION
|
||||||
|
|
||||||
|
**WARNING: These docs may be outdated or inaccurate.**
|
||||||
|
|
||||||
|
The code has evolved significantly. These docs are kept for historical reference only.
|
||||||
|
|
||||||
|
## What to Use Instead
|
||||||
|
|
||||||
|
**The single source of truth is:**
|
||||||
|
- `CLAUDE.md` (root) - Essential rules and quick reference
|
||||||
|
- `docs/CODEBASE_MAP.md` - Current file/directory reference
|
||||||
|
|
||||||
|
## Why Archive?
|
||||||
|
|
||||||
|
These docs were written during development iterations and may reference:
|
||||||
|
- Old file paths that no longer exist
|
||||||
|
- Deprecated approaches (hydration, scraper-v2)
|
||||||
|
- APIs that have changed
|
||||||
|
- Database schemas that evolved
|
||||||
|
|
||||||
|
## If You Need Details
|
||||||
|
|
||||||
|
1. First check CODEBASE_MAP.md for current file locations
|
||||||
|
2. Then read the actual source code
|
||||||
|
3. Only use archive docs as a last resort for historical context
|
||||||
46
backend/src/_deprecated/DONT_USE.md
Normal file
46
backend/src/_deprecated/DONT_USE.md
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
# DEPRECATED CODE - DO NOT USE
|
||||||
|
|
||||||
|
**These directories contain OLD, ABANDONED code.**
|
||||||
|
|
||||||
|
## What's Here
|
||||||
|
|
||||||
|
| Directory | What It Was | Why Deprecated |
|
||||||
|
|-----------|-------------|----------------|
|
||||||
|
| `hydration/` | Old pipeline for processing crawl data | Replaced by `src/tasks/handlers/` |
|
||||||
|
| `scraper-v2/` | Old Puppeteer-based scraper engine | Replaced by curl-based `src/platforms/dutchie/client.ts` |
|
||||||
|
| `canonical-hydration/` | Intermediate step toward canonical schema | Merged into task handlers |
|
||||||
|
|
||||||
|
## What to Use Instead
|
||||||
|
|
||||||
|
| Old (DONT USE) | New (USE THIS) |
|
||||||
|
|----------------|----------------|
|
||||||
|
| `hydration/normalizers/dutchie.ts` | `src/tasks/handlers/product-refresh.ts` |
|
||||||
|
| `hydration/producer.ts` | `src/tasks/handlers/payload-fetch.ts` |
|
||||||
|
| `scraper-v2/engine.ts` | `src/platforms/dutchie/client.ts` |
|
||||||
|
| `scraper-v2/scheduler.ts` | `src/services/task-scheduler.ts` |
|
||||||
|
|
||||||
|
## Why Keep This Code?
|
||||||
|
|
||||||
|
- Historical reference only
|
||||||
|
- Some patterns may be useful for debugging
|
||||||
|
- Will be deleted once confirmed not needed
|
||||||
|
|
||||||
|
## Claude Instructions
|
||||||
|
|
||||||
|
**IF YOU ARE CLAUDE:**
|
||||||
|
|
||||||
|
1. NEVER import from `src/_deprecated/`
|
||||||
|
2. NEVER reference these files as examples
|
||||||
|
3. NEVER try to "fix" or "update" code in here
|
||||||
|
4. If you see imports from these directories, suggest replacing them
|
||||||
|
|
||||||
|
**Correct imports:**
|
||||||
|
```typescript
|
||||||
|
// GOOD
|
||||||
|
import { executeGraphQL } from '../platforms/dutchie/client';
|
||||||
|
import { pool } from '../db/pool';
|
||||||
|
|
||||||
|
// BAD - DO NOT USE
|
||||||
|
import { something } from '../_deprecated/hydration/...';
|
||||||
|
import { something } from '../_deprecated/scraper-v2/...';
|
||||||
|
```
|
||||||
180
backend/test-intercept.js
Normal file
180
backend/test-intercept.js
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
/**
|
||||||
|
* Stealth Browser Payload Capture - Direct GraphQL Injection
|
||||||
|
*
|
||||||
|
* Uses the browser session to make GraphQL requests that look organic.
|
||||||
|
* Adds proper headers matching what Dutchie's frontend sends.
|
||||||
|
*/
|
||||||
|
|
||||||
|
const puppeteer = require('puppeteer-extra');
|
||||||
|
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
|
||||||
|
const fs = require('fs');
|
||||||
|
|
||||||
|
puppeteer.use(StealthPlugin());
|
||||||
|
|
||||||
|
async function capturePayload(config) {
|
||||||
|
const {
|
||||||
|
dispensaryId = null,
|
||||||
|
platformId,
|
||||||
|
cName,
|
||||||
|
outputPath = `/tmp/payload_${cName}_${Date.now()}.json`,
|
||||||
|
} = config;
|
||||||
|
|
||||||
|
const browser = await puppeteer.launch({
|
||||||
|
headless: 'new',
|
||||||
|
args: ['--no-sandbox', '--disable-setuid-sandbox']
|
||||||
|
});
|
||||||
|
|
||||||
|
const page = await browser.newPage();
|
||||||
|
|
||||||
|
// Establish session by visiting the embedded menu
|
||||||
|
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
|
||||||
|
console.log(`[Capture] Establishing session at ${embedUrl}...`);
|
||||||
|
|
||||||
|
await page.goto(embedUrl, {
|
||||||
|
waitUntil: 'networkidle2',
|
||||||
|
timeout: 60000
|
||||||
|
});
|
||||||
|
|
||||||
|
console.log('[Capture] Session established, fetching ALL products...');
|
||||||
|
|
||||||
|
// Fetch all products using GET requests with proper headers
|
||||||
|
const result = await page.evaluate(async (platformId, cName) => {
|
||||||
|
const allProducts = [];
|
||||||
|
const logs = [];
|
||||||
|
let pageNum = 0;
|
||||||
|
const perPage = 100;
|
||||||
|
let totalCount = 0;
|
||||||
|
const sessionId = 'browser-session-' + Date.now();
|
||||||
|
|
||||||
|
try {
|
||||||
|
while (pageNum < 30) { // Max 30 pages = 3000 products
|
||||||
|
const variables = {
|
||||||
|
includeEnterpriseSpecials: false,
|
||||||
|
productsFilter: {
|
||||||
|
dispensaryId: platformId,
|
||||||
|
pricingType: 'rec',
|
||||||
|
Status: 'Active', // 'Active' for in-stock products per CLAUDE.md
|
||||||
|
types: [],
|
||||||
|
useCache: true,
|
||||||
|
isDefaultSort: true,
|
||||||
|
sortBy: 'popularSortIdx',
|
||||||
|
sortDirection: 1,
|
||||||
|
bypassOnlineThresholds: true,
|
||||||
|
isKioskMenu: false,
|
||||||
|
removeProductsBelowOptionThresholds: false,
|
||||||
|
},
|
||||||
|
page: pageNum,
|
||||||
|
perPage: perPage,
|
||||||
|
};
|
||||||
|
|
||||||
|
const extensions = {
|
||||||
|
persistedQuery: {
|
||||||
|
version: 1,
|
||||||
|
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
// Build GET URL like the browser does
|
||||||
|
const qs = new URLSearchParams({
|
||||||
|
operationName: 'FilteredProducts',
|
||||||
|
variables: JSON.stringify(variables),
|
||||||
|
extensions: JSON.stringify(extensions)
|
||||||
|
});
|
||||||
|
const url = `https://dutchie.com/api-3/graphql?${qs.toString()}`;
|
||||||
|
|
||||||
|
const response = await fetch(url, {
|
||||||
|
method: 'GET',
|
||||||
|
headers: {
|
||||||
|
'Accept': 'application/json',
|
||||||
|
'content-type': 'application/json',
|
||||||
|
'x-dutchie-session': sessionId,
|
||||||
|
'apollographql-client-name': 'Marketplace (production)',
|
||||||
|
},
|
||||||
|
credentials: 'include'
|
||||||
|
});
|
||||||
|
|
||||||
|
logs.push(`Page ${pageNum}: HTTP ${response.status}`);
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
const text = await response.text();
|
||||||
|
logs.push(`HTTP error: ${response.status} - ${text.slice(0, 200)}`);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
const json = await response.json();
|
||||||
|
|
||||||
|
if (json.errors) {
|
||||||
|
logs.push(`GraphQL error: ${JSON.stringify(json.errors).slice(0, 200)}`);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
const data = json?.data?.filteredProducts;
|
||||||
|
if (!data || !data.products) {
|
||||||
|
logs.push('No products in response');
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
const products = data.products;
|
||||||
|
allProducts.push(...products);
|
||||||
|
|
||||||
|
if (pageNum === 0) {
|
||||||
|
totalCount = data.queryInfo?.totalCount || 0;
|
||||||
|
logs.push(`Total reported: ${totalCount}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
logs.push(`Got ${products.length} products (total: ${allProducts.length}/${totalCount})`);
|
||||||
|
|
||||||
|
if (allProducts.length >= totalCount || products.length < perPage) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
pageNum++;
|
||||||
|
|
||||||
|
// Small delay between pages to be polite
|
||||||
|
await new Promise(r => setTimeout(r, 200));
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
logs.push(`Error: ${err.message}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
return { products: allProducts, totalCount, logs };
|
||||||
|
}, platformId, cName);
|
||||||
|
|
||||||
|
await browser.close();
|
||||||
|
|
||||||
|
// Print logs from browser context
|
||||||
|
result.logs.forEach(log => console.log(`[Browser] ${log}`));
|
||||||
|
|
||||||
|
console.log(`[Capture] Got ${result.products.length} products (API reported ${result.totalCount})`);
|
||||||
|
|
||||||
|
const payload = {
|
||||||
|
dispensaryId: dispensaryId,
|
||||||
|
platformId: platformId,
|
||||||
|
cName,
|
||||||
|
fetchedAt: new Date().toISOString(),
|
||||||
|
productCount: result.products.length,
|
||||||
|
products: result.products,
|
||||||
|
};
|
||||||
|
|
||||||
|
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
|
||||||
|
|
||||||
|
console.log(`\n=== Capture Complete ===`);
|
||||||
|
console.log(`Total products: ${result.products.length}`);
|
||||||
|
console.log(`Saved to: ${outputPath}`);
|
||||||
|
console.log(`File size: ${(fs.statSync(outputPath).size / 1024).toFixed(1)} KB`);
|
||||||
|
|
||||||
|
return payload;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Run
|
||||||
|
(async () => {
|
||||||
|
const payload = await capturePayload({
|
||||||
|
cName: 'AZ-Deeply-Rooted',
|
||||||
|
platformId: '6405ef617056e8014d79101b',
|
||||||
|
});
|
||||||
|
|
||||||
|
if (payload.products.length > 0) {
|
||||||
|
const sample = payload.products[0];
|
||||||
|
console.log(`\nSample: ${sample.Name || sample.name} - ${sample.brand?.name || sample.brandName}`);
|
||||||
|
}
|
||||||
|
})().catch(console.error);
|
||||||
18
k8s/woodpecker-agent-compose.yml
Normal file
18
k8s/woodpecker-agent-compose.yml
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
# Woodpecker Agent Docker Compose
|
||||||
|
# Path: /opt/woodpecker/docker-compose.yml
|
||||||
|
# Deploy: cd /opt/woodpecker && docker compose up -d
|
||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
woodpecker-agent:
|
||||||
|
image: woodpeckerci/woodpecker-agent:latest
|
||||||
|
container_name: woodpecker-agent
|
||||||
|
restart: always
|
||||||
|
volumes:
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock
|
||||||
|
environment:
|
||||||
|
- WOODPECKER_SERVER=localhost:9000
|
||||||
|
- WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET}
|
||||||
|
- WOODPECKER_MAX_WORKFLOWS=5
|
||||||
|
- WOODPECKER_HEALTHCHECK=true
|
||||||
|
- WOODPECKER_LOG_LEVEL=info
|
||||||
Reference in New Issue
Block a user