- Move deprecated directories to src/_deprecated/: - hydration/ (old pipeline approach) - scraper-v2/ (old Puppeteer scraper) - canonical-hydration/ (merged into tasks) - Unused services: availability, crawler-logger, geolocation, etc - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser - Archive outdated docs to docs/_archive/: - ANALYTICS_RUNBOOK.md - ANALYTICS_V2_EXAMPLES.md - BRAND_INTELLIGENCE_API.md - CRAWL_PIPELINE.md - TASK_WORKFLOW_2024-12-10.md - WORKER_TASK_ARCHITECTURE.md - ORGANIC_SCRAPING_GUIDE.md - Add docs/CODEBASE_MAP.md as single source of truth - Add warning files to deprecated/archived directories - Slim down CLAUDE.md to essential rules only 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.2 KiB
7.2 KiB
Organic Browser-Based Scraping Guide
Last Updated: 2025-12-12 Status: Production-ready proof of concept
Overview
This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.
Why Organic Scraping?
| Approach | Detection Risk | Speed | Complexity |
|---|---|---|---|
| Direct curl | Higher | Fast | Low |
| curl-impersonate | Medium | Fast | Medium |
| Browser-based (organic) | Lowest | Slower | Higher |
Direct curl requests can be fingerprinted via:
- TLS fingerprint (cipher suites, extensions)
- Header order and values
- Missing cookies/session data
- Request patterns
Browser-based requests inherit:
- Real Chrome TLS fingerprint
- Session cookies from page visit
- Natural header order
- JavaScript execution environment
Implementation
Dependencies
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Core Script: test-intercept.js
Located at: backend/test-intercept.js
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function capturePayload(config) {
const { dispensaryId, platformId, cName, outputPath } = config;
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// STEP 1: Establish session by visiting the menu
const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });
// STEP 2: Fetch ALL products using GraphQL from browser context
const result = await page.evaluate(async (platformId) => {
const allProducts = [];
let pageNum = 0;
const perPage = 100;
let totalCount = 0;
const sessionId = 'browser-session-' + Date.now();
while (pageNum < 30) {
const variables = {
includeEnterpriseSpecials: false,
productsFilter: {
dispensaryId: platformId,
pricingType: 'rec',
Status: 'Active', // CRITICAL: Must be 'Active', not null
types: [],
useCache: true,
isDefaultSort: true,
sortBy: 'popularSortIdx',
sortDirection: 1,
bypassOnlineThresholds: true,
isKioskMenu: false,
removeProductsBelowOptionThresholds: false,
},
page: pageNum,
perPage: perPage,
};
const extensions = {
persistedQuery: {
version: 1,
sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
}
};
const qs = new URLSearchParams({
operationName: 'FilteredProducts',
variables: JSON.stringify(variables),
extensions: JSON.stringify(extensions)
});
const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
method: 'GET',
headers: {
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': sessionId,
'apollographql-client-name': 'Marketplace (production)',
},
credentials: 'include'
});
const json = await response.json();
const data = json?.data?.filteredProducts;
if (!data?.products) break;
allProducts.push(...data.products);
if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
if (allProducts.length >= totalCount) break;
pageNum++;
await new Promise(r => setTimeout(r, 200)); // Polite delay
}
return { products: allProducts, totalCount };
}, platformId);
await browser.close();
// STEP 3: Save payload
const payload = {
dispensaryId,
platformId,
cName,
fetchedAt: new Date().toISOString(),
productCount: result.products.length,
products: result.products,
};
fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
return payload;
}
Critical Parameters
GraphQL Hash (FilteredProducts)
ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0
WARNING: Using the wrong hash returns HTTP 400.
Status Parameter
| Value | Result |
|---|---|
'Active' |
Returns in-stock products (1019 in test) |
null |
Returns 0 products |
'All' |
Returns HTTP 400 |
ALWAYS use Status: 'Active'
Required Headers
{
'Accept': 'application/json',
'content-type': 'application/json',
'x-dutchie-session': 'unique-session-id',
'apollographql-client-name': 'Marketplace (production)',
}
Endpoint
https://dutchie.com/api-3/graphql
Performance Benchmarks
Test store: AZ-Deeply-Rooted (1019 products)
| Metric | Value |
|---|---|
| Total products | 1019 |
| Time | 18.5 seconds |
| Payload size | 11.8 MB |
| Pages fetched | 11 (100 per page) |
| Success rate | 100% |
Payload Format
The output matches the existing payload-fetch.ts handler format:
{
"dispensaryId": 123,
"platformId": "6405ef617056e8014d79101b",
"cName": "AZ-Deeply-Rooted",
"fetchedAt": "2025-12-12T05:05:19.837Z",
"productCount": 1019,
"products": [
{
"id": "6927508db4851262f629a869",
"Name": "Product Name",
"brand": { "name": "Brand Name", ... },
"type": "Flower",
"THC": "25%",
"Prices": [...],
"Options": [...],
...
}
]
}
Integration Points
As a Task Handler
The organic approach can be integrated as an alternative to curl-based fetching:
// In src/tasks/handlers/organic-payload-fetch.ts
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
// Use puppeteer-based capture
// Save to same payload storage
// Queue product_refresh task
}
Worker Configuration
Add to job_schedules:
INSERT INTO job_schedules (name, role, cron_expression)
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');
Troubleshooting
HTTP 400 Bad Request
- Check hash is correct:
ee29c060... - Verify Status is
'Active'(string, not null)
0 Products Returned
- Status was likely
nullor'All'- use'Active' - Check platformId is valid MongoDB ObjectId
Session Not Established
- Increase timeout on initial page.goto()
- Check cName is valid (matches embedded-menu URL)
Detection/Blocking
- StealthPlugin should handle most cases
- Add random delays between pages
- Use headless: 'new' (not true/false)
Files Reference
| File | Purpose |
|---|---|
backend/test-intercept.js |
Proof of concept script |
backend/src/platforms/dutchie/client.ts |
GraphQL hashes, curl implementation |
backend/src/tasks/handlers/payload-fetch.ts |
Current curl-based handler |
backend/src/utils/payload-storage.ts |
Payload save/load utilities |
See Also
DUTCHIE_CRAWL_WORKFLOW.md- Full crawl pipeline documentationTASK_WORKFLOW_2024-12-10.md- Task system architectureCLAUDE.md- Project rules and constraints