Files
cannaiq/CLAUDE.md
Kelly bd65674f3a fix(menu-detection): remove non-existent platform_dispensary_id_resolved_at column
The UPDATE query was trying to set a column that doesn't exist in the database
schema, causing platform ID resolution to fail silently. Now stores the
resolved_at timestamp in provider_detection_data JSONB instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 19:40:22 -07:00

16 KiB
Raw Blame History

Claude Guidelines for this Project

Core Rules Summary

  • DB: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
  • Images: No MinIO. Save to local /images/products//-.webp (and brands); preserve original URL; serve via backend static.
  • Dutchie GraphQL: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
  • cName/slug: Derive cName from each store's menu_url (/embedded-menu/ or /dispensary/). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
  • Dual-mode always: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
  • Batch DB writes: Chunk products/snapshots/missing (100200) to avoid OOM.
  • OOS/missing: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
  • API/Frontend: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
  • Scheduling: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
  • Monitor: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
  • No slug guessing: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.

Detailed Rules

  1. Use the consolidated DB everywhere

    • Preferred env: CRAWLSY_DATABASE_URL (fallback DATABASE_URL).
    • Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.
  2. Dispensary vs Store

    • Dutchie pipeline uses dispensaries (not legacy stores). For dutchie crawls, always work with dispensary ID.
    • Ignore legacy fields like dutchie_plus_id and slug guessing. Use the record's menu_url and platform_dispensary_id.
  3. Menu detection and platform IDs

    • Set menu_type from menu_url detection; resolve platform_dispensary_id for menu_type='dutchie'.
    • Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when menu_type='dutchie' AND platform_dispensary_id is set.
  4. Queries and mapping

    • The DB returns snake_case; code expects camelCase. Always alias/map:
      • platform_dispensary_id AS "platformDispensaryId"
      • Map via mapDbRowToDispensary when loading dispensaries (scheduler, crawler, admin crawl).
    • Avoid SELECT *; explicitly select and/or map fields.
  5. Scheduling

    • /scraper-schedule should accept filters/search (All vs AZ-only, name).
    • "Run Now"/scheduler must skip or warn if menu_type!='dutchie' or platform_dispensary_id missing.
    • Use dispensary_crawl_status view; show reason when not crawlable.
  6. Crawling

    • Trigger dutchie crawls by dispensary ID (e.g., /api/az/admin/crawl/:id or runDispensaryOrchestrator(id)).
    • Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (/images/...), store local URLs.
    • Use dutchie GraphQL pipeline only for menu_type='dutchie'.
  7. Frontend

    • Forward-facing URLs: /api/az, /az, /az-schedule; no vendor names.
    • /scraper-schedule: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).
  8. No slug guessing

    • Do not guess slugs; use the DB record's menu_url and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
  9. Verify locally before pushing

    • Apply migrations, restart backend, ensure auth (users table) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check /api/az/dashboard, /api/az/stores/:id/products, /az, /scraper-schedule.
  10. Image storage (no MinIO)

    • Save images to local filesystem only. Do not create or use MinIO in Docker.
    • Product images: /images/products/<dispensary_id>/<product_id>-<hash>.webp (+medium/+thumb).
    • Brand images: /images/brands/<brand_slug_or_sku>-<hash>.webp.
    • Store local URLs in DB fields (keep original URLs as fallback only).
    • Serve /images via backend static middleware.
  11. Dutchie GraphQL fetch rules

    • Endpoint: https://dutchie.com/api-3/graphql (NOT api-gw.dutchie.com which no longer exists).
    • Variables: Use productsFilter.dispensaryId = platform_dispensary_id (MongoDB ObjectId, e.g., 6405ef617056e8014d79101b).
    • Do NOT use dispensaryFilter.cNameOrID - that's outdated.
    • cName (e.g., AZ-Deeply-Rooted) is only for Referer/Origin headers and Puppeteer session bootstrapping.
    • Mode A: Status: "Active" - returns active products with pricing
    • Mode B: Status: null / activeOnly: false - returns all products including OOS/inactive
    • Example payload:
      {
        "operationName": "FilteredProducts",
        "variables": {
          "productsFilter": {
            "dispensaryId": "6405ef617056e8014d79101b",
            "pricingType": "rec",
            "Status": "Active"
          }
        },
        "extensions": {
          "persistedQuery": { "version": 1, "sha256Hash": "<hash>" }
        }
      }
      
    • Headers (server-side axios only): Chrome UA, Origin: https://dutchie.com, Referer: https://dutchie.com/embedded-menu/<cName>, accept: application/json, content-type: application/json.
    • If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
    • Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.
  12. Stop over-prep; run the crawl

    • To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
      DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
      npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"
      
      If local DNS is blocked, run the same command inside the scraper pod via kubectl exec ... -- bash -lc '...'.
    • After crawl, verify counts via dutchie_products, dutchie_product_snapshots, and dispensaries.last_crawl_at. Do not inspect the legacy products table for Dutchie.
  13. Fetch troubleshooting

    • If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use productsFilter.dispensaryId.
    • If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
  14. Views and metrics

    • Keep v_brands/v_categories/v_brand_history based on dutchie_products and preserve brand_count metrics. Do not drop brand_count.
  15. Batch DB writes to avoid OOM

    • Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
    • Chunk arrays (e.g., 100200 items) and upsert/insert in a loop; drop references after each chunk.
    • Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
  16. Use dual-mode crawls by default

    • Always run with useBothModes:true to combine:
      • Mode A (active feed with pricing/stock)
      • Mode B (max coverage including OOS/inactive)
    • Union/dedupe by product ID so you keep full coverage and pricing in one run.
    • If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
  17. Capture OOS and missing items

    • GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
    • After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
    • Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
    • Verify with /api/az/stores/:id/products?stockStatus=out_of_stock and ?stockStatus=missing_from_feed.
  18. Menu discovery must crawl the website when menu_url is null

    • For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
    • If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
    • Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
    • Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
  19. Preserve all stock statuses (including unknown)

    • Do not filter or drop stock_status values in API/UI; pass through whatever is stored on the latest snapshot/product. Expected values include: in_stock, out_of_stock (if exposed), missing_from_feed, unknown. Only apply filters when explicitly requested by the user.
  20. Never delete or overwrite historical data

    • Do not delete products/snapshots or overwrite historical records. Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. Historical data must remain intact for analytics.
  21. Per-location cName and platform_dispensary_id resolution

    • For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
    • Derive cName from menu_url per store: /embedded-menu/<cName> or /dispensary/<cName>.
    • Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
    • If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in provider_detection_data.resolution_error.
    • Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
  22. API endpoints (AZ pipeline)

    • Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
    • Rebuild frontend with VITE_API_URL pointing to the backend
    • Dispensary Detail and analytics must use AZ endpoints
  23. Monitoring and logging

    • /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
    • Auto-refresh every 30 seconds
    • System Logs page should show real log data, not just startup messages
  24. Dashboard Architecture - CRITICAL

    • Frontend: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with VITE_API_URL pointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up.
    • Backend: /api/dashboard/stats MUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables: dutchie_products, dispensaries, and views like v_dashboard_stats, v_latest_snapshots. Do NOT use a separate legacy connection. Do NOT query az_products (doesn't exist) or legacy stores/products tables.
    • DB Connectivity: Use the proper DB host/role. Errors like role "dutchie" does not exist mean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correct DATABASE_URL and test: kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt'
    • After fixing: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
    • Checklist:
      1. Rebuild/redeploy frontend with correct API URL, clear cache
      2. Fix /api/dashboard/* to use the consolidated DB pool and dutchie views/tables
      3. Test /api/dashboard/stats from the scraper pod; then reload the UI
  25. Deployment (Gitea + Kubernetes)

    • Registry: Gitea at code.cannabrands.app/creationshop/dispensary-scraper
    • Build and push (from backend directory):
      # Login to Gitea container registry
      docker login code.cannabrands.app
      
      # Build the image
      cd backend
      docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
      
      # Push to registry
      docker push code.cannabrands.app/creationshop/dispensary-scraper:latest
      
    • Deploy to Kubernetes:
      # Restart deployments to pull new image
      kubectl rollout restart deployment/scraper -n dispensary-scraper
      kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
      
      # Watch rollout status
      kubectl rollout status deployment/scraper -n dispensary-scraper
      kubectl rollout status deployment/scraper-worker -n dispensary-scraper
      
    • Check pods:
      kubectl get pods -n dispensary-scraper
      kubectl logs -f deployment/scraper -n dispensary-scraper
      kubectl logs -f deployment/scraper-worker -n dispensary-scraper
      
    • K8s manifests are in /k8s/ folder (scraper.yaml, scraper-worker.yaml, etc.)
    • imagePullSecrets use regcred secret for Gitea registry auth
  26. Crawler Architecture

    • Scraper pod (1 replica): Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (crawl_jobs table).
    • Scraper-worker pods (5 replicas): Each worker runs dist/dutchie-az/services/worker.js, polling the job queue and processing jobs.
    • Job types processed by workers:
      • menu_detection / menu_detection_single: Detect menu provider type and resolve platform_dispensary_id from menu_url
      • dutchie_product_crawl: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
    • Job schedules (managed in job_schedules table):
      • dutchie_az_menu_detection: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_type
      • dutchie_az_product_crawl: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
    • Trigger schedules manually: curl -X POST /api/az/admin/schedules/{id}/trigger
    • Check schedule status: curl /api/az/admin/schedules
    • Worker logs: kubectl logs -f deployment/scraper-worker -n dispensary-scraper