# Organic Browser-Based Scraping Guide **Last Updated:** 2025-12-12 **Status:** Production-ready proof of concept --- ## Overview This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk. --- ## Why Organic Scraping? | Approach | Detection Risk | Speed | Complexity | |----------|---------------|-------|------------| | Direct curl | Higher | Fast | Low | | curl-impersonate | Medium | Fast | Medium | | **Browser-based (organic)** | **Lowest** | Slower | Higher | Direct curl requests can be fingerprinted via: - TLS fingerprint (cipher suites, extensions) - Header order and values - Missing cookies/session data - Request patterns Browser-based requests inherit: - Real Chrome TLS fingerprint - Session cookies from page visit - Natural header order - JavaScript execution environment --- ## Implementation ### Dependencies ```bash npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth ``` ### Core Script: `test-intercept.js` Located at: `backend/test-intercept.js` ```javascript const puppeteer = require('puppeteer-extra'); const StealthPlugin = require('puppeteer-extra-plugin-stealth'); const fs = require('fs'); puppeteer.use(StealthPlugin()); async function capturePayload(config) { const { dispensaryId, platformId, cName, outputPath } = config; const browser = await puppeteer.launch({ headless: 'new', args: ['--no-sandbox', '--disable-setuid-sandbox'] }); const page = await browser.newPage(); // STEP 1: Establish session by visiting the menu const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`; await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 }); // STEP 2: Fetch ALL products using GraphQL from browser context const result = await page.evaluate(async (platformId) => { const allProducts = []; let pageNum = 0; const perPage = 100; let totalCount = 0; const sessionId = 'browser-session-' + Date.now(); while (pageNum < 30) { const variables = { includeEnterpriseSpecials: false, productsFilter: { dispensaryId: platformId, pricingType: 'rec', Status: 'Active', // CRITICAL: Must be 'Active', not null types: [], useCache: true, isDefaultSort: true, sortBy: 'popularSortIdx', sortDirection: 1, bypassOnlineThresholds: true, isKioskMenu: false, removeProductsBelowOptionThresholds: false, }, page: pageNum, perPage: perPage, }; const extensions = { persistedQuery: { version: 1, sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0' } }; const qs = new URLSearchParams({ operationName: 'FilteredProducts', variables: JSON.stringify(variables), extensions: JSON.stringify(extensions) }); const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, { method: 'GET', headers: { 'Accept': 'application/json', 'content-type': 'application/json', 'x-dutchie-session': sessionId, 'apollographql-client-name': 'Marketplace (production)', }, credentials: 'include' }); const json = await response.json(); const data = json?.data?.filteredProducts; if (!data?.products) break; allProducts.push(...data.products); if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0; if (allProducts.length >= totalCount) break; pageNum++; await new Promise(r => setTimeout(r, 200)); // Polite delay } return { products: allProducts, totalCount }; }, platformId); await browser.close(); // STEP 3: Save payload const payload = { dispensaryId, platformId, cName, fetchedAt: new Date().toISOString(), productCount: result.products.length, products: result.products, }; fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2)); return payload; } ``` --- ## Critical Parameters ### GraphQL Hash (FilteredProducts) ``` ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0 ``` **WARNING:** Using the wrong hash returns HTTP 400. ### Status Parameter | Value | Result | |-------|--------| | `'Active'` | Returns in-stock products (1019 in test) | | `null` | Returns 0 products | | `'All'` | Returns HTTP 400 | **ALWAYS use `Status: 'Active'`** ### Required Headers ```javascript { 'Accept': 'application/json', 'content-type': 'application/json', 'x-dutchie-session': 'unique-session-id', 'apollographql-client-name': 'Marketplace (production)', } ``` ### Endpoint ``` https://dutchie.com/api-3/graphql ``` --- ## Performance Benchmarks Test store: AZ-Deeply-Rooted (1019 products) | Metric | Value | |--------|-------| | Total products | 1019 | | Time | 18.5 seconds | | Payload size | 11.8 MB | | Pages fetched | 11 (100 per page) | | Success rate | 100% | --- ## Payload Format The output matches the existing `payload-fetch.ts` handler format: ```json { "dispensaryId": 123, "platformId": "6405ef617056e8014d79101b", "cName": "AZ-Deeply-Rooted", "fetchedAt": "2025-12-12T05:05:19.837Z", "productCount": 1019, "products": [ { "id": "6927508db4851262f629a869", "Name": "Product Name", "brand": { "name": "Brand Name", ... }, "type": "Flower", "THC": "25%", "Prices": [...], "Options": [...], ... } ] } ``` --- ## Integration Points ### As a Task Handler The organic approach can be integrated as an alternative to curl-based fetching: ```typescript // In src/tasks/handlers/organic-payload-fetch.ts export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise { // Use puppeteer-based capture // Save to same payload storage // Queue product_refresh task } ``` ### Worker Configuration Add to job_schedules: ```sql INSERT INTO job_schedules (name, role, cron_expression) VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *'); ``` --- ## Troubleshooting ### HTTP 400 Bad Request - Check hash is correct: `ee29c060...` - Verify Status is `'Active'` (string, not null) ### 0 Products Returned - Status was likely `null` or `'All'` - use `'Active'` - Check platformId is valid MongoDB ObjectId ### Session Not Established - Increase timeout on initial page.goto() - Check cName is valid (matches embedded-menu URL) ### Detection/Blocking - StealthPlugin should handle most cases - Add random delays between pages - Use headless: 'new' (not true/false) --- ## Files Reference | File | Purpose | |------|---------| | `backend/test-intercept.js` | Proof of concept script | | `backend/src/platforms/dutchie/client.ts` | GraphQL hashes, curl implementation | | `backend/src/tasks/handlers/payload-fetch.ts` | Current curl-based handler | | `backend/src/utils/payload-storage.ts` | Payload save/load utilities | --- ## See Also - `DUTCHIE_CRAWL_WORKFLOW.md` - Full crawl pipeline documentation - `TASK_WORKFLOW_2024-12-10.md` - Task system architecture - `CLAUDE.md` - Project rules and constraints