cannaiq/CLAUDE.md at 98f8e5e28d4555dc60651f8d9b9c55ea2479f7e3

Files

Kelly bd65674f3a fix(menu-detection): remove non-existent platform_dispensary_id_resolved_at column

The UPDATE query was trying to set a column that doesn't exist in the database
schema, causing platform ID resolution to fail silently. Now stores the
resolved_at timestamp in provider_detection_data JSONB instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-03 19:40:22 -07:00

16 KiB

Raw Blame History

Claude Guidelines for this Project

Core Rules Summary

DB: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
Images: No MinIO. Save to local /images/products//-.webp (and brands); preserve original URL; serve via backend static.
Dutchie GraphQL: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
cName/slug: Derive cName from each store's menu_url (/embedded-menu/ or /dispensary/). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
Dual-mode always: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
Batch DB writes: Chunk products/snapshots/missing (100–200) to avoid OOM.
OOS/missing: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
API/Frontend: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
Scheduling: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
Monitor: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
No slug guessing: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.

Detailed Rules

Use the consolidated DB everywhere
- Preferred env: CRAWLSY_DATABASE_URL (fallback DATABASE_URL).
- Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.
Dispensary vs Store
- Dutchie pipeline uses dispensaries (not legacy stores). For dutchie crawls, always work with dispensary ID.
- Ignore legacy fields like dutchie_plus_id and slug guessing. Use the record's menu_url and platform_dispensary_id.
Menu detection and platform IDs
- Set menu_type from menu_url detection; resolve platform_dispensary_id for menu_type='dutchie'.
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when menu_type='dutchie' AND platform_dispensary_id is set.
Queries and mapping
- The DB returns snake_case; code expects camelCase. Always alias/map:
  - platform_dispensary_id AS "platformDispensaryId"
  - Map via mapDbRowToDispensary when loading dispensaries (scheduler, crawler, admin crawl).
- Avoid SELECT *; explicitly select and/or map fields.
Scheduling
- /scraper-schedule should accept filters/search (All vs AZ-only, name).
- "Run Now"/scheduler must skip or warn if menu_type!='dutchie' or platform_dispensary_id missing.
- Use dispensary_crawl_status view; show reason when not crawlable.
Crawling
- Trigger dutchie crawls by dispensary ID (e.g., /api/az/admin/crawl/:id or runDispensaryOrchestrator(id)).
- Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (/images/...), store local URLs.
- Use dutchie GraphQL pipeline only for menu_type='dutchie'.
Frontend
- Forward-facing URLs: /api/az, /az, /az-schedule; no vendor names.
- /scraper-schedule: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).
No slug guessing
- Do not guess slugs; use the DB record's menu_url and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
Verify locally before pushing
- Apply migrations, restart backend, ensure auth (users table) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check /api/az/dashboard, /api/az/stores/:id/products, /az, /scraper-schedule.
Image storage (no MinIO)
- Save images to local filesystem only. Do not create or use MinIO in Docker.
- Product images: /images/products/<dispensary_id>/<product_id>-<hash>.webp (+medium/+thumb).
- Brand images: /images/brands/<brand_slug_or_sku>-<hash>.webp.
- Store local URLs in DB fields (keep original URLs as fallback only).
- Serve /images via backend static middleware.
Dutchie GraphQL fetch rules
- Endpoint: https://dutchie.com/api-3/graphql (NOT api-gw.dutchie.com which no longer exists).
- Variables: Use productsFilter.dispensaryId = platform_dispensary_id (MongoDB ObjectId, e.g., 6405ef617056e8014d79101b).
- Do NOT use dispensaryFilter.cNameOrID - that's outdated.
- cName (e.g., AZ-Deeply-Rooted) is only for Referer/Origin headers and Puppeteer session bootstrapping.
- Mode A: Status: "Active" - returns active products with pricing
- Mode B: Status: null / activeOnly: false - returns all products including OOS/inactive
- Example payload:
```
{
  "operationName": "FilteredProducts",
  "variables": {
    "productsFilter": {
      "dispensaryId": "6405ef617056e8014d79101b",
      "pricingType": "rec",
      "Status": "Active"
    }
  },
  "extensions": {
    "persistedQuery": { "version": 1, "sha256Hash": "<hash>" }
  }
}
```
- Headers (server-side axios only): Chrome UA, Origin: https://dutchie.com, Referer: https://dutchie.com/embedded-menu/<cName>, accept: application/json, content-type: application/json.
- If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
- Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.

Stop over-prep; run the crawl

To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):

DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"

If local DNS is blocked, run the same command inside the scraper pod via kubectl exec ... -- bash -lc '...'.

After crawl, verify counts via dutchie_products, dutchie_product_snapshots, and dispensaries.last_crawl_at. Do not inspect the legacy products table for Dutchie.

Fetch troubleshooting
- If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use productsFilter.dispensaryId.
- If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
Views and metrics
- Keep v_brands/v_categories/v_brand_history based on dutchie_products and preserve brand_count metrics. Do not drop brand_count.
Batch DB writes to avoid OOM
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
- Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk.
- Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
Use dual-mode crawls by default
- Always run with useBothModes:true to combine:
  - Mode A (active feed with pricing/stock)
  - Mode B (max coverage including OOS/inactive)
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
- If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
Capture OOS and missing items
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
- Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
- Verify with /api/az/stores/:id/products?stockStatus=out_of_stock and ?stockStatus=missing_from_feed.
Menu discovery must crawl the website when menu_url is null
- For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
- If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
- Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
- Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
Preserve all stock statuses (including unknown)
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored on the latest snapshot/product. Expected values include: in_stock, out_of_stock (if exposed), missing_from_feed, unknown. Only apply filters when explicitly requested by the user.
Never delete or overwrite historical data
- Do not delete products/snapshots or overwrite historical records. Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. Historical data must remain intact for analytics.
Per-location cName and platform_dispensary_id resolution
- For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
- Derive cName from menu_url per store: /embedded-menu/<cName> or /dispensary/<cName>.
- Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
- If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in provider_detection_data.resolution_error.
- Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
API endpoints (AZ pipeline)
- Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
- Rebuild frontend with VITE_API_URL pointing to the backend
- Dispensary Detail and analytics must use AZ endpoints
Monitoring and logging
- /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
- Auto-refresh every 30 seconds
- System Logs page should show real log data, not just startup messages
Dashboard Architecture - CRITICAL
- Frontend: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with VITE_API_URL pointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up.
- Backend: /api/dashboard/stats MUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables: dutchie_products, dispensaries, and views like v_dashboard_stats, v_latest_snapshots. Do NOT use a separate legacy connection. Do NOT query az_products (doesn't exist) or legacy stores/products tables.
- DB Connectivity: Use the proper DB host/role. Errors like role "dutchie" does not exist mean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correct DATABASE_URL and test: kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt'
- After fixing: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
- Checklist:
  1. Rebuild/redeploy frontend with correct API URL, clear cache
  2. Fix /api/dashboard/* to use the consolidated DB pool and dutchie views/tables
  3. Test /api/dashboard/stats from the scraper pod; then reload the UI

Deployment (Gitea + Kubernetes)

Registry: Gitea at code.cannabrands.app/creationshop/dispensary-scraper

Build and push (from backend directory):

# Login to Gitea container registry
docker login code.cannabrands.app

# Build the image
cd backend
docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .

# Push to registry
docker push code.cannabrands.app/creationshop/dispensary-scraper:latest

Deploy to Kubernetes:

# Restart deployments to pull new image
kubectl rollout restart deployment/scraper -n dispensary-scraper
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper

# Watch rollout status
kubectl rollout status deployment/scraper -n dispensary-scraper
kubectl rollout status deployment/scraper-worker -n dispensary-scraper

Check pods:

kubectl get pods -n dispensary-scraper
kubectl logs -f deployment/scraper -n dispensary-scraper
kubectl logs -f deployment/scraper-worker -n dispensary-scraper

K8s manifests are in /k8s/ folder (scraper.yaml, scraper-worker.yaml, etc.)
imagePullSecrets use regcred secret for Gitea registry auth

Crawler Architecture
- Scraper pod (1 replica): Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (crawl_jobs table).
- Scraper-worker pods (5 replicas): Each worker runs dist/dutchie-az/services/worker.js, polling the job queue and processing jobs.
- Job types processed by workers:
  - menu_detection / menu_detection_single: Detect menu provider type and resolve platform_dispensary_id from menu_url
  - dutchie_product_crawl: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
- Job schedules (managed in job_schedules table):
  - dutchie_az_menu_detection: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_type
  - dutchie_az_product_crawl: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
- Trigger schedules manually: curl -X POST /api/az/admin/schedules/{id}/trigger
- Check schedule status: curl /api/az/admin/schedules
- Worker logs: kubectl logs -f deployment/scraper-worker -n dispensary-scraper

16 KiB Raw Blame History Unescape Escape

Claude Guidelines for this Project

Core Rules Summary

Detailed Rules

16 KiB

Raw Blame History