- Add backend stale process monitoring API (/api/stale-processes) - Add users management route - Add frontend landing page and stale process monitor UI on /scraper-tools - Move old development scripts to backend/archive/ - Update frontend build with new features 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
19 KiB
Claude Guidelines for this Project
Core Rules Summary
- DB: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
- Images: No MinIO. Save to local /images/products//-.webp (and brands); preserve original URL; serve via backend static.
- Dutchie GraphQL: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
- cName/slug: Derive cName from each store's menu_url (/embedded-menu/ or /dispensary/). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
- Dual-mode always: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
- Batch DB writes: Chunk products/snapshots/missing (100–200) to avoid OOM.
- OOS/missing: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
- API/Frontend: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
- Scheduling: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
- Monitor: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
- No slug guessing: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.
Detailed Rules
-
Use the consolidated DB everywhere
- Preferred env:
CRAWLSY_DATABASE_URL(fallbackDATABASE_URL). - Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.
- Preferred env:
-
Dispensary vs Store
- Dutchie pipeline uses
dispensaries(not legacystores). For dutchie crawls, always work with dispensary ID. - Ignore legacy fields like
dutchie_plus_idand slug guessing. Use the record'smenu_urlandplatform_dispensary_id.
- Dutchie pipeline uses
-
Menu detection and platform IDs
- Set
menu_typefrommenu_urldetection; resolveplatform_dispensary_idformenu_type='dutchie'. - Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when
menu_type='dutchie'ANDplatform_dispensary_idis set.
- Set
-
Queries and mapping
- The DB returns snake_case; code expects camelCase. Always alias/map:
platform_dispensary_id AS "platformDispensaryId"- Map via
mapDbRowToDispensarywhen loading dispensaries (scheduler, crawler, admin crawl).
- Avoid
SELECT *; explicitly select and/or map fields.
- The DB returns snake_case; code expects camelCase. Always alias/map:
-
Scheduling
/scraper-scheduleshould accept filters/search (All vs AZ-only, name).- "Run Now"/scheduler must skip or warn if
menu_type!='dutchie'orplatform_dispensary_idmissing. - Use
dispensary_crawl_statusview; show reason when not crawlable.
-
Crawling
- Trigger dutchie crawls by dispensary ID (e.g.,
/api/az/admin/crawl/:idorrunDispensaryOrchestrator(id)). - Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (
/images/...), store local URLs. - Use dutchie GraphQL pipeline only for
menu_type='dutchie'.
- Trigger dutchie crawls by dispensary ID (e.g.,
-
Frontend
- Forward-facing URLs:
/api/az,/az,/az-schedule; no vendor names. /scraper-schedule: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).
- Forward-facing URLs:
-
No slug guessing
- Do not guess slugs; use the DB record's
menu_urland ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
- Do not guess slugs; use the DB record's
-
Verify locally before pushing
- Apply migrations, restart backend, ensure auth (
userstable) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check/api/az/dashboard,/api/az/stores/:id/products,/az,/scraper-schedule.
- Apply migrations, restart backend, ensure auth (
-
Image storage (no MinIO)
- Save images to local filesystem only. Do not create or use MinIO in Docker.
- Product images:
/images/products/<dispensary_id>/<product_id>-<hash>.webp(+medium/+thumb). - Brand images:
/images/brands/<brand_slug_or_sku>-<hash>.webp. - Store local URLs in DB fields (keep original URLs as fallback only).
- Serve
/imagesvia backend static middleware.
-
Dutchie GraphQL fetch rules
- Endpoint:
https://dutchie.com/api-3/graphql(NOTapi-gw.dutchie.comwhich no longer exists). - Variables: Use
productsFilter.dispensaryId=platform_dispensary_id(MongoDB ObjectId, e.g.,6405ef617056e8014d79101b). - Do NOT use
dispensaryFilter.cNameOrID- that's outdated. cName(e.g.,AZ-Deeply-Rooted) is only for Referer/Origin headers and Puppeteer session bootstrapping.- Mode A:
Status: "Active"- returns active products with pricing - Mode B:
Status: null/activeOnly: false- returns all products including OOS/inactive - Example payload:
{ "operationName": "FilteredProducts", "variables": { "productsFilter": { "dispensaryId": "6405ef617056e8014d79101b", "pricingType": "rec", "Status": "Active" } }, "extensions": { "persistedQuery": { "version": 1, "sha256Hash": "<hash>" } } } - Headers (server-side axios only): Chrome UA,
Origin: https://dutchie.com,Referer: https://dutchie.com/embedded-menu/<cName>,accept: application/json,content-type: application/json. - If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
- Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.
- Endpoint:
-
Stop over-prep; run the crawl
- To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
If local DNS is blocked, run the same command inside the scraper pod via
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"kubectl exec ... -- bash -lc '...'. - After crawl, verify counts via
dutchie_products,dutchie_product_snapshots, anddispensaries.last_crawl_at. Do not inspect the legacyproductstable for Dutchie.
- To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
-
Fetch troubleshooting
- If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use
productsFilter.dispensaryId. - If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
- If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use
-
Views and metrics
- Keep v_brands/v_categories/v_brand_history based on
dutchie_productsand preserve brand_count metrics. Do not drop brand_count.
- Keep v_brands/v_categories/v_brand_history based on
-
Batch DB writes to avoid OOM
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
- Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk.
- Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
-
Use dual-mode crawls by default
- Always run with
useBothModes:trueto combine:- Mode A (active feed with pricing/stock)
- Mode B (max coverage including OOS/inactive)
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
- If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
- Always run with
-
Capture OOS and missing items
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
- Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
- Verify with
/api/az/stores/:id/products?stockStatus=out_of_stockand?stockStatus=missing_from_feed.
-
Menu discovery must crawl the website when menu_url is null
- For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
- If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
- Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
- Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
-
Preserve all stock statuses (including unknown)
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored on the latest snapshot/product. Expected values include: in_stock, out_of_stock (if exposed), missing_from_feed, unknown. Only apply filters when explicitly requested by the user.
-
Never delete or overwrite historical data
- Do not delete products/snapshots or overwrite historical records. Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. Historical data must remain intact for analytics.
-
Deployment via CI/CD only
- Test locally, commit clean changes, and let CI/CD build and deploy to Kubernetes at code.cannabrands.app. Do NOT manually build/push images or tweak prod pods. Deploy backend first, smoke-test APIs, then frontend; roll back via CI/CD if needed.
-
Per-location cName and platform_dispensary_id resolution
- For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
- Derive cName from menu_url per store:
/embedded-menu/<cName>or/dispensary/<cName>. - Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
- If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in
provider_detection_data.resolution_error. - Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
-
API endpoints (AZ pipeline)
- Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
- Rebuild frontend with VITE_API_URL pointing to the backend
- Dispensary Detail and analytics must use AZ endpoints
-
Monitoring and logging
- /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
- Auto-refresh every 30 seconds
- System Logs page should show real log data, not just startup messages
-
Dashboard Architecture - CRITICAL
- Frontend: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with
VITE_API_URLpointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up. - Backend:
/api/dashboard/statsMUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables:dutchie_products,dispensaries, and views likev_dashboard_stats,v_latest_snapshots. Do NOT use a separate legacy connection. Do NOT queryaz_products(doesn't exist) or legacystores/productstables. - DB Connectivity: Use the proper DB host/role. Errors like
role "dutchie" does not existmean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correctDATABASE_URLand test:kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt' - After fixing: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
- Checklist:
- Rebuild/redeploy frontend with correct API URL, clear cache
- Fix
/api/dashboard/*to use the consolidated DB pool and dutchie views/tables - Test
/api/dashboard/statsfrom the scraper pod; then reload the UI
- Frontend: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with
-
Deployment (Gitea + Kubernetes)
- Registry: Gitea at
code.cannabrands.app/creationshop/dispensary-scraper - Build and push (from backend directory):
# Login to Gitea container registry docker login code.cannabrands.app # Build the image cd backend docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest . # Push to registry docker push code.cannabrands.app/creationshop/dispensary-scraper:latest - Deploy to Kubernetes:
# Restart deployments to pull new image kubectl rollout restart deployment/scraper -n dispensary-scraper kubectl rollout restart deployment/scraper-worker -n dispensary-scraper # Watch rollout status kubectl rollout status deployment/scraper -n dispensary-scraper kubectl rollout status deployment/scraper-worker -n dispensary-scraper - Check pods:
kubectl get pods -n dispensary-scraper kubectl logs -f deployment/scraper -n dispensary-scraper kubectl logs -f deployment/scraper-worker -n dispensary-scraper - K8s manifests are in
/k8s/folder (scraper.yaml, scraper-worker.yaml, etc.) - imagePullSecrets use
regcredsecret for Gitea registry auth
- Registry: Gitea at
-
Crawler Architecture
- Scraper pod (1 replica): Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (
crawl_jobstable). - Scraper-worker pods (5 replicas): Each worker runs
dist/dutchie-az/services/worker.js, polling the job queue and processing jobs. - Job types processed by workers:
menu_detection/menu_detection_single: Detect menu provider type and resolve platform_dispensary_id from menu_urldutchie_product_crawl: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
- Job schedules (managed in
job_schedulestable):dutchie_az_menu_detection: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_typedutchie_az_product_crawl: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
- Trigger schedules manually:
curl -X POST /api/az/admin/schedules/{id}/trigger - Check schedule status:
curl /api/az/admin/schedules - Worker logs:
kubectl logs -f deployment/scraper-worker -n dispensary-scraper
- Scraper pod (1 replica): Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (
-
Crawler Maintenance Procedure (Check Jobs, Requeue, Restart) When crawlers are stuck or jobs aren't processing, follow this procedure:
Step 1: Check Job Status
# Port-forward to production kubectl port-forward -n dispensary-scraper deployment/scraper 3099:3010 & # Check active/stuck jobs curl -s http://localhost:3099/api/az/monitor/active-jobs | jq . # Check recent job history curl -s "http://localhost:3099/api/az/monitor/jobs?limit=20" | jq '.jobs[] | {id, job_type, status, dispensary_id, started_at, products_found, duration_min: (.duration_ms/60000 | floor)}' # Check schedule status curl -s http://localhost:3099/api/az/admin/schedules | jq '.schedules[] | {id, jobName, enabled, lastRunAt, lastStatus, nextRunAt}'Step 2: Reset Stuck Jobs Jobs are considered stuck if they have
status='running'but no heartbeat in >30 minutes:# Via API (if endpoint exists) curl -s -X POST http://localhost:3099/api/az/admin/reset-stuck-jobs # Via direct DB (if API not available) kubectl exec -n dispensary-scraper deployment/scraper -- psql $DATABASE_URL -c " UPDATE dispensary_crawl_jobs SET status = 'failed', error_message = 'Job timed out - worker stopped sending heartbeats', completed_at = NOW() WHERE status = 'running' AND (last_heartbeat_at < NOW() - INTERVAL '30 minutes' OR last_heartbeat_at IS NULL); "Step 3: Requeue Jobs (Trigger Fresh Crawl)
# Trigger product crawl schedule (typically ID 1) curl -s -X POST http://localhost:3099/api/az/admin/schedules/1/trigger # Trigger menu detection schedule (typically ID 2) curl -s -X POST http://localhost:3099/api/az/admin/schedules/2/trigger # Or crawl a specific dispensary curl -s -X POST http://localhost:3099/api/az/admin/crawl/112Step 4: Restart Crawler Workers
# Restart scraper-worker pods (clears any stuck processes) kubectl rollout restart deployment/scraper-worker -n dispensary-scraper # Watch rollout progress kubectl rollout status deployment/scraper-worker -n dispensary-scraper # Optionally restart main scraper pod too kubectl rollout restart deployment/scraper -n dispensary-scraperStep 5: Monitor Recovery
# Watch worker logs kubectl logs -f deployment/scraper-worker -n dispensary-scraper --tail=50 # Check dashboard for product counts curl -s http://localhost:3099/api/az/dashboard | jq '{totalStores, totalProducts, storesByType}' # Verify jobs are processing curl -s http://localhost:3099/api/az/monitor/active-jobs | jq .Quick One-Liner for Full Reset:
# Reset stuck jobs and restart workers kubectl exec -n dispensary-scraper deployment/scraper -- psql $DATABASE_URL -c "UPDATE dispensary_crawl_jobs SET status='failed', completed_at=NOW() WHERE status='running' AND (last_heartbeat_at < NOW() - INTERVAL '30 minutes' OR last_heartbeat_at IS NULL);" && kubectl rollout restart deployment/scraper-worker -n dispensary-scraper && kubectl rollout status deployment/scraper-worker -n dispensary-scraperCleanup port-forwards when done:
pkill -f "port-forward.*dispensary-scraper"