Files
cannaiq/backend/docs/_archive/ORGANIC_SCRAPING_GUIDE.md
Kelly a35976b9e9 chore: Clean up deprecated code and docs
- Move deprecated directories to src/_deprecated/:
  - hydration/ (old pipeline approach)
  - scraper-v2/ (old Puppeteer scraper)
  - canonical-hydration/ (merged into tasks)
  - Unused services: availability, crawler-logger, geolocation, etc
  - Unused utils: age-gate-playwright, HomepageValidator, stealthBrowser

- Archive outdated docs to docs/_archive/:
  - ANALYTICS_RUNBOOK.md
  - ANALYTICS_V2_EXAMPLES.md
  - BRAND_INTELLIGENCE_API.md
  - CRAWL_PIPELINE.md
  - TASK_WORKFLOW_2024-12-10.md
  - WORKER_TASK_ARCHITECTURE.md
  - ORGANIC_SCRAPING_GUIDE.md

- Add docs/CODEBASE_MAP.md as single source of truth
- Add warning files to deprecated/archived directories
- Slim down CLAUDE.md to essential rules only

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 22:17:40 -07:00

7.2 KiB

Organic Browser-Based Scraping Guide

Last Updated: 2025-12-12 Status: Production-ready proof of concept


Overview

This document describes the "organic" browser-based approach to scraping Dutchie dispensary menus. Unlike direct curl/axios requests, this method uses a real browser session to make API calls, making requests appear natural and reducing detection risk.


Why Organic Scraping?

Approach Detection Risk Speed Complexity
Direct curl Higher Fast Low
curl-impersonate Medium Fast Medium
Browser-based (organic) Lowest Slower Higher

Direct curl requests can be fingerprinted via:

  • TLS fingerprint (cipher suites, extensions)
  • Header order and values
  • Missing cookies/session data
  • Request patterns

Browser-based requests inherit:

  • Real Chrome TLS fingerprint
  • Session cookies from page visit
  • Natural header order
  • JavaScript execution environment

Implementation

Dependencies

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Core Script: test-intercept.js

Located at: backend/test-intercept.js

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');

puppeteer.use(StealthPlugin());

async function capturePayload(config) {
  const { dispensaryId, platformId, cName, outputPath } = config;

  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // STEP 1: Establish session by visiting the menu
  const embedUrl = `https://dutchie.com/embedded-menu/${cName}?menuType=rec`;
  await page.goto(embedUrl, { waitUntil: 'networkidle2', timeout: 60000 });

  // STEP 2: Fetch ALL products using GraphQL from browser context
  const result = await page.evaluate(async (platformId) => {
    const allProducts = [];
    let pageNum = 0;
    const perPage = 100;
    let totalCount = 0;
    const sessionId = 'browser-session-' + Date.now();

    while (pageNum < 30) {
      const variables = {
        includeEnterpriseSpecials: false,
        productsFilter: {
          dispensaryId: platformId,
          pricingType: 'rec',
          Status: 'Active',  // CRITICAL: Must be 'Active', not null
          types: [],
          useCache: true,
          isDefaultSort: true,
          sortBy: 'popularSortIdx',
          sortDirection: 1,
          bypassOnlineThresholds: true,
          isKioskMenu: false,
          removeProductsBelowOptionThresholds: false,
        },
        page: pageNum,
        perPage: perPage,
      };

      const extensions = {
        persistedQuery: {
          version: 1,
          sha256Hash: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0'
        }
      };

      const qs = new URLSearchParams({
        operationName: 'FilteredProducts',
        variables: JSON.stringify(variables),
        extensions: JSON.stringify(extensions)
      });

      const response = await fetch(`https://dutchie.com/api-3/graphql?${qs}`, {
        method: 'GET',
        headers: {
          'Accept': 'application/json',
          'content-type': 'application/json',
          'x-dutchie-session': sessionId,
          'apollographql-client-name': 'Marketplace (production)',
        },
        credentials: 'include'
      });

      const json = await response.json();
      const data = json?.data?.filteredProducts;
      if (!data?.products) break;

      allProducts.push(...data.products);
      if (pageNum === 0) totalCount = data.queryInfo?.totalCount || 0;
      if (allProducts.length >= totalCount) break;

      pageNum++;
      await new Promise(r => setTimeout(r, 200)); // Polite delay
    }

    return { products: allProducts, totalCount };
  }, platformId);

  await browser.close();

  // STEP 3: Save payload
  const payload = {
    dispensaryId,
    platformId,
    cName,
    fetchedAt: new Date().toISOString(),
    productCount: result.products.length,
    products: result.products,
  };

  fs.writeFileSync(outputPath, JSON.stringify(payload, null, 2));
  return payload;
}

Critical Parameters

GraphQL Hash (FilteredProducts)

ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0

WARNING: Using the wrong hash returns HTTP 400.

Status Parameter

Value Result
'Active' Returns in-stock products (1019 in test)
null Returns 0 products
'All' Returns HTTP 400

ALWAYS use Status: 'Active'

Required Headers

{
  'Accept': 'application/json',
  'content-type': 'application/json',
  'x-dutchie-session': 'unique-session-id',
  'apollographql-client-name': 'Marketplace (production)',
}

Endpoint

https://dutchie.com/api-3/graphql

Performance Benchmarks

Test store: AZ-Deeply-Rooted (1019 products)

Metric Value
Total products 1019
Time 18.5 seconds
Payload size 11.8 MB
Pages fetched 11 (100 per page)
Success rate 100%

Payload Format

The output matches the existing payload-fetch.ts handler format:

{
  "dispensaryId": 123,
  "platformId": "6405ef617056e8014d79101b",
  "cName": "AZ-Deeply-Rooted",
  "fetchedAt": "2025-12-12T05:05:19.837Z",
  "productCount": 1019,
  "products": [
    {
      "id": "6927508db4851262f629a869",
      "Name": "Product Name",
      "brand": { "name": "Brand Name", ... },
      "type": "Flower",
      "THC": "25%",
      "Prices": [...],
      "Options": [...],
      ...
    }
  ]
}

Integration Points

As a Task Handler

The organic approach can be integrated as an alternative to curl-based fetching:

// In src/tasks/handlers/organic-payload-fetch.ts
export async function handleOrganicPayloadFetch(ctx: TaskContext): Promise<TaskResult> {
  // Use puppeteer-based capture
  // Save to same payload storage
  // Queue product_refresh task
}

Worker Configuration

Add to job_schedules:

INSERT INTO job_schedules (name, role, cron_expression)
VALUES ('organic_product_crawl', 'organic_payload_fetch', '0 */6 * * *');

Troubleshooting

HTTP 400 Bad Request

  • Check hash is correct: ee29c060...
  • Verify Status is 'Active' (string, not null)

0 Products Returned

  • Status was likely null or 'All' - use 'Active'
  • Check platformId is valid MongoDB ObjectId

Session Not Established

  • Increase timeout on initial page.goto()
  • Check cName is valid (matches embedded-menu URL)

Detection/Blocking

  • StealthPlugin should handle most cases
  • Add random delays between pages
  • Use headless: 'new' (not true/false)

Files Reference

File Purpose
backend/test-intercept.js Proof of concept script
backend/src/platforms/dutchie/client.ts GraphQL hashes, curl implementation
backend/src/tasks/handlers/payload-fetch.ts Current curl-based handler
backend/src/utils/payload-storage.ts Payload save/load utilities

See Also

  • DUTCHIE_CRAWL_WORKFLOW.md - Full crawl pipeline documentation
  • TASK_WORKFLOW_2024-12-10.md - Task system architecture
  • CLAUDE.md - Project rules and constraints