fix(monitor): remove non-existent worker columns from job_run_logs query

The job_run_logs table tracks scheduled job orchestration, not individual worker jobs. Worker info (worker_id, worker_hostname) belongs on dispensary_crawl_jobs, not job_run_logs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 18:45:05 -07:00
parent 54f40d26bb
commit 66e07b2009
466 changed files with 84988 additions and 9226 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,16 +1,34 @@
 ## Claude Guidelines for this Project

+### Core Rules Summary
+
+- **DB**: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
+- **Images**: No MinIO. Save to local /images/products/<disp>/<prod>-<hash>.webp (and brands); preserve original URL; serve via backend static.
+- **Dutchie GraphQL**: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
+- **cName/slug**: Derive cName from each store's menu_url (/embedded-menu/<cName> or /dispensary/<slug>). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
+- **Dual-mode always**: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
+- **Batch DB writes**: Chunk products/snapshots/missing (100–200) to avoid OOM.
+- **OOS/missing**: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
+- **API/Frontend**: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
+- **Scheduling**: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
+- **Monitor**: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
+- **No slug guessing**: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.
+
+---
+
+### Detailed Rules
+
 1) **Use the consolidated DB everywhere**
   - Preferred env: `CRAWLSY_DATABASE_URL` (fallback `DATABASE_URL`).
   - Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.

 2) **Dispensary vs Store**
   - Dutchie pipeline uses `dispensaries` (not legacy `stores`). For dutchie crawls, always work with dispensary ID.
-   - Ignore legacy fields like `dutchie_plus_id` and slug guessing. Use the record’s `menu_url` and `platform_dispensary_id`.
+   - Ignore legacy fields like `dutchie_plus_id` and slug guessing. Use the record's `menu_url` and `platform_dispensary_id`.

 3) **Menu detection and platform IDs**
   - Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`.
-   - Admin should have “refresh detection” and “resolve ID” actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.
+   - Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.

 4) **Queries and mapping**
   - The DB returns snake_case; code expects camelCase. Always alias/map:
@@ -20,7 +38,7 @@

 5) **Scheduling**
   - `/scraper-schedule` should accept filters/search (All vs AZ-only, name).
-   - “Run Now”/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing.
+   - "Run Now"/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing.
   - Use `dispensary_crawl_status` view; show reason when not crawlable.

 6) **Crawling**
@@ -33,7 +51,7 @@
   - `/scraper-schedule`: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).

 8) **No slug guessing**
-   - Do not guess slugs; use the DB record’s `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
+   - Do not guess slugs; use the DB record's `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.

 9) **Verify locally before pushing**
   - Apply migrations, restart backend, ensure auth (`users` table) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check `/api/az/dashboard`, `/api/az/stores/:id/products`, `/az`, `/scraper-schedule`.
@@ -44,3 +62,143 @@
    - Brand images: `/images/brands/<brand_slug_or_sku>-<hash>.webp`.
    - Store local URLs in DB fields (keep original URLs as fallback only).
    - Serve `/images` via backend static middleware.
+
+11) **Dutchie GraphQL fetch rules**
+    - **Endpoint**: `https://dutchie.com/api-3/graphql` (NOT `api-gw.dutchie.com` which no longer exists).
+    - **Variables**: Use `productsFilter.dispensaryId` = `platform_dispensary_id` (MongoDB ObjectId, e.g., `6405ef617056e8014d79101b`).
+    - Do NOT use `dispensaryFilter.cNameOrID` - that's outdated.
+    - `cName` (e.g., `AZ-Deeply-Rooted`) is only for Referer/Origin headers and Puppeteer session bootstrapping.
+    - **Mode A**: `Status: "Active"` - returns active products with pricing
+    - **Mode B**: `Status: null` / `activeOnly: false` - returns all products including OOS/inactive
+    - **Example payload**:
+      ```json
+      {
+        "operationName": "FilteredProducts",
+        "variables": {
+          "productsFilter": {
+            "dispensaryId": "6405ef617056e8014d79101b",
+            "pricingType": "rec",
+            "Status": "Active"
+          }
+        },
+        "extensions": {
+          "persistedQuery": { "version": 1, "sha256Hash": "<hash>" }
+        }
+      }
+      ```
+    - **Headers** (server-side axios only): Chrome UA, `Origin: https://dutchie.com`, `Referer: https://dutchie.com/embedded-menu/<cName>`, `accept: application/json`, `content-type: application/json`.
+    - If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
+    - Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.
+
+12) **Stop over-prep; run the crawl**
+    - To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
+      ```
+      DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
+      npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"
+      ```
+      If local DNS is blocked, run the same command inside the scraper pod via `kubectl exec ... -- bash -lc '...'`.
+    - After crawl, verify counts via `dutchie_products`, `dutchie_product_snapshots`, and `dispensaries.last_crawl_at`. Do not inspect the legacy `products` table for Dutchie.
+
+13) **Fetch troubleshooting**
+    - If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use `productsFilter.dispensaryId`.
+    - If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
+
+14) **Views and metrics**
+    - Keep v_brands/v_categories/v_brand_history based on `dutchie_products` and preserve brand_count metrics. Do not drop brand_count.
+
+15) **Batch DB writes to avoid OOM**
+    - Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
+    - Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk.
+    - Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
+
+16) **Use dual-mode crawls by default**
+    - Always run with `useBothModes:true` to combine:
+      - Mode A (active feed with pricing/stock)
+      - Mode B (max coverage including OOS/inactive)
+    - Union/dedupe by product ID so you keep full coverage and pricing in one run.
+    - If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
+
+17) **Capture OOS and missing items**
+    - GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
+    - After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
+    - Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
+    - Verify with `/api/az/stores/:id/products?stockStatus=out_of_stock` and `?stockStatus=missing_from_feed`.
+
+18) **Menu discovery must crawl the website when menu_url is null**
+    - For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
+    - If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
+    - Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
+    - Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
+
+18) **Per-location cName and platform_dispensary_id resolution**
+    - For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
+    - Derive cName from menu_url per store: `/embedded-menu/<cName>` or `/dispensary/<cName>`.
+    - Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
+    - If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in `provider_detection_data.resolution_error`.
+    - Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
+
+19) **API endpoints (AZ pipeline)**
+    - Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
+    - Rebuild frontend with VITE_API_URL pointing to the backend
+    - Dispensary Detail and analytics must use AZ endpoints
+
+20) **Monitoring and logging**
+    - /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
+    - Auto-refresh every 30 seconds
+    - System Logs page should show real log data, not just startup messages
+
+21) **Dashboard Architecture - CRITICAL**
+    - **Frontend**: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with `VITE_API_URL` pointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up.
+    - **Backend**: `/api/dashboard/stats` MUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables: `dutchie_products`, `dispensaries`, and views like `v_dashboard_stats`, `v_latest_snapshots`. Do NOT use a separate legacy connection. Do NOT query `az_products` (doesn't exist) or legacy `stores`/`products` tables.
+    - **DB Connectivity**: Use the proper DB host/role. Errors like `role "dutchie" does not exist` mean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correct `DATABASE_URL` and test: `kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt'`
+    - **After fixing**: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
+    - **Checklist**:
+      1. Rebuild/redeploy frontend with correct API URL, clear cache
+      2. Fix `/api/dashboard/*` to use the consolidated DB pool and dutchie views/tables
+      3. Test `/api/dashboard/stats` from the scraper pod; then reload the UI
+
+22) **Deployment (Gitea + Kubernetes)**
+    - **Registry**: Gitea at `code.cannabrands.app/creationshop/dispensary-scraper`
+    - **Build and push** (from backend directory):
+      ```bash
+      # Login to Gitea container registry
+      docker login code.cannabrands.app
+
+      # Build the image
+      cd backend
+      docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
+
+      # Push to registry
+      docker push code.cannabrands.app/creationshop/dispensary-scraper:latest
+      ```
+    - **Deploy to Kubernetes**:
+      ```bash
+      # Restart deployments to pull new image
+      kubectl rollout restart deployment/scraper -n dispensary-scraper
+      kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
+
+      # Watch rollout status
+      kubectl rollout status deployment/scraper -n dispensary-scraper
+      kubectl rollout status deployment/scraper-worker -n dispensary-scraper
+      ```
+    - **Check pods**:
+      ```bash
+      kubectl get pods -n dispensary-scraper
+      kubectl logs -f deployment/scraper -n dispensary-scraper
+      kubectl logs -f deployment/scraper-worker -n dispensary-scraper
+      ```
+    - K8s manifests are in `/k8s/` folder (scraper.yaml, scraper-worker.yaml, etc.)
+    - imagePullSecrets use `regcred` secret for Gitea registry auth
+
+23) **Crawler Architecture**
+    - **Scraper pod (1 replica)**: Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (`crawl_jobs` table).
+    - **Scraper-worker pods (5 replicas)**: Each worker runs `dist/dutchie-az/services/worker.js`, polling the job queue and processing jobs.
+    - **Job types processed by workers**:
+      - `menu_detection` / `menu_detection_single`: Detect menu provider type and resolve platform_dispensary_id from menu_url
+      - `dutchie_product_crawl`: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
+    - **Job schedules** (managed in `job_schedules` table):
+      - `dutchie_az_menu_detection`: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_type
+      - `dutchie_az_product_crawl`: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
+    - **Trigger schedules manually**: `curl -X POST /api/az/admin/schedules/{id}/trigger`
+    - **Check schedule status**: `curl /api/az/admin/schedules`
+    - **Worker logs**: `kubectl logs -f deployment/scraper-worker -n dispensary-scraper`