Files
cannaiq/CLAUDE.md

214 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Claude Guidelines for this Project
### Core Rules Summary
- **DB**: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
- **Images**: No MinIO. Save to local /images/products/<disp>/<prod>-<hash>.webp (and brands); preserve original URL; serve via backend static.
- **Dutchie GraphQL**: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
- **cName/slug**: Derive cName from each store's menu_url (/embedded-menu/<cName> or /dispensary/<slug>). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
- **Dual-mode always**: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
- **Batch DB writes**: Chunk products/snapshots/missing (100200) to avoid OOM.
- **OOS/missing**: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
- **API/Frontend**: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
- **Scheduling**: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
- **Monitor**: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
- **No slug guessing**: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.
---
### Detailed Rules
1) **Use the consolidated DB everywhere**
- Preferred env: `CRAWLSY_DATABASE_URL` (fallback `DATABASE_URL`).
- Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.
2) **Dispensary vs Store**
- Dutchie pipeline uses `dispensaries` (not legacy `stores`). For dutchie crawls, always work with dispensary ID.
- Ignore legacy fields like `dutchie_plus_id` and slug guessing. Use the record's `menu_url` and `platform_dispensary_id`.
3) **Menu detection and platform IDs**
- Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`.
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.
4) **Queries and mapping**
- The DB returns snake_case; code expects camelCase. Always alias/map:
- `platform_dispensary_id AS "platformDispensaryId"`
- Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl).
- Avoid `SELECT *`; explicitly select and/or map fields.
5) **Scheduling**
- `/scraper-schedule` should accept filters/search (All vs AZ-only, name).
- "Run Now"/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing.
- Use `dispensary_crawl_status` view; show reason when not crawlable.
6) **Crawling**
- Trigger dutchie crawls by dispensary ID (e.g., `/api/az/admin/crawl/:id` or `runDispensaryOrchestrator(id)`).
- Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (`/images/...`), store local URLs.
- Use dutchie GraphQL pipeline only for `menu_type='dutchie'`.
7) **Frontend**
- Forward-facing URLs: `/api/az`, `/az`, `/az-schedule`; no vendor names.
- `/scraper-schedule`: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).
8) **No slug guessing**
- Do not guess slugs; use the DB record's `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
9) **Verify locally before pushing**
- Apply migrations, restart backend, ensure auth (`users` table) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check `/api/az/dashboard`, `/api/az/stores/:id/products`, `/az`, `/scraper-schedule`.
10) **Image storage (no MinIO)**
- Save images to local filesystem only. Do not create or use MinIO in Docker.
- Product images: `/images/products/<dispensary_id>/<product_id>-<hash>.webp` (+medium/+thumb).
- Brand images: `/images/brands/<brand_slug_or_sku>-<hash>.webp`.
- Store local URLs in DB fields (keep original URLs as fallback only).
- Serve `/images` via backend static middleware.
11) **Dutchie GraphQL fetch rules**
- **Endpoint**: `https://dutchie.com/api-3/graphql` (NOT `api-gw.dutchie.com` which no longer exists).
- **Variables**: Use `productsFilter.dispensaryId` = `platform_dispensary_id` (MongoDB ObjectId, e.g., `6405ef617056e8014d79101b`).
- Do NOT use `dispensaryFilter.cNameOrID` - that's outdated.
- `cName` (e.g., `AZ-Deeply-Rooted`) is only for Referer/Origin headers and Puppeteer session bootstrapping.
- **Mode A**: `Status: "Active"` - returns active products with pricing
- **Mode B**: `Status: null` / `activeOnly: false` - returns all products including OOS/inactive
- **Example payload**:
```json
{
"operationName": "FilteredProducts",
"variables": {
"productsFilter": {
"dispensaryId": "6405ef617056e8014d79101b",
"pricingType": "rec",
"Status": "Active"
}
},
"extensions": {
"persistedQuery": { "version": 1, "sha256Hash": "<hash>" }
}
}
```
- **Headers** (server-side axios only): Chrome UA, `Origin: https://dutchie.com`, `Referer: https://dutchie.com/embedded-menu/<cName>`, `accept: application/json`, `content-type: application/json`.
- If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
- Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.
12) **Stop over-prep; run the crawl**
- To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
```
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"
```
If local DNS is blocked, run the same command inside the scraper pod via `kubectl exec ... -- bash -lc '...'`.
- After crawl, verify counts via `dutchie_products`, `dutchie_product_snapshots`, and `dispensaries.last_crawl_at`. Do not inspect the legacy `products` table for Dutchie.
13) **Fetch troubleshooting**
- If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use `productsFilter.dispensaryId`.
- If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
14) **Views and metrics**
- Keep v_brands/v_categories/v_brand_history based on `dutchie_products` and preserve brand_count metrics. Do not drop brand_count.
15) **Batch DB writes to avoid OOM**
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
- Chunk arrays (e.g., 100200 items) and upsert/insert in a loop; drop references after each chunk.
- Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
16) **Use dual-mode crawls by default**
- Always run with `useBothModes:true` to combine:
- Mode A (active feed with pricing/stock)
- Mode B (max coverage including OOS/inactive)
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
- If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
17) **Capture OOS and missing items**
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
- Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
- Verify with `/api/az/stores/:id/products?stockStatus=out_of_stock` and `?stockStatus=missing_from_feed`.
18) **Menu discovery must crawl the website when menu_url is null**
- For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
- If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
- Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
- Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
19) **Preserve all stock statuses (including unknown)**
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored on the latest snapshot/product. Expected values include: in_stock, out_of_stock (if exposed), missing_from_feed, unknown. Only apply filters when explicitly requested by the user.
20) **Never delete or overwrite historical data**
- Do not delete products/snapshots or overwrite historical records. Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. Historical data must remain intact for analytics.
21) **Deployment via CI/CD only**
- Test locally, commit clean changes, and let CI/CD build and deploy to Kubernetes at code.cannabrands.app. Do NOT manually build/push images or tweak prod pods. Deploy backend first, smoke-test APIs, then frontend; roll back via CI/CD if needed.
18) **Per-location cName and platform_dispensary_id resolution**
- For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
- Derive cName from menu_url per store: `/embedded-menu/<cName>` or `/dispensary/<cName>`.
- Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
- If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in `provider_detection_data.resolution_error`.
- Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
19) **API endpoints (AZ pipeline)**
- Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
- Rebuild frontend with VITE_API_URL pointing to the backend
- Dispensary Detail and analytics must use AZ endpoints
20) **Monitoring and logging**
- /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
- Auto-refresh every 30 seconds
- System Logs page should show real log data, not just startup messages
21) **Dashboard Architecture - CRITICAL**
- **Frontend**: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with `VITE_API_URL` pointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up.
- **Backend**: `/api/dashboard/stats` MUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables: `dutchie_products`, `dispensaries`, and views like `v_dashboard_stats`, `v_latest_snapshots`. Do NOT use a separate legacy connection. Do NOT query `az_products` (doesn't exist) or legacy `stores`/`products` tables.
- **DB Connectivity**: Use the proper DB host/role. Errors like `role "dutchie" does not exist` mean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correct `DATABASE_URL` and test: `kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt'`
- **After fixing**: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
- **Checklist**:
1. Rebuild/redeploy frontend with correct API URL, clear cache
2. Fix `/api/dashboard/*` to use the consolidated DB pool and dutchie views/tables
3. Test `/api/dashboard/stats` from the scraper pod; then reload the UI
22) **Deployment (Gitea + Kubernetes)**
- **Registry**: Gitea at `code.cannabrands.app/creationshop/dispensary-scraper`
- **Build and push** (from backend directory):
```bash
# Login to Gitea container registry
docker login code.cannabrands.app
# Build the image
cd backend
docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
# Push to registry
docker push code.cannabrands.app/creationshop/dispensary-scraper:latest
```
- **Deploy to Kubernetes**:
```bash
# Restart deployments to pull new image
kubectl rollout restart deployment/scraper -n dispensary-scraper
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
# Watch rollout status
kubectl rollout status deployment/scraper -n dispensary-scraper
kubectl rollout status deployment/scraper-worker -n dispensary-scraper
```
- **Check pods**:
```bash
kubectl get pods -n dispensary-scraper
kubectl logs -f deployment/scraper -n dispensary-scraper
kubectl logs -f deployment/scraper-worker -n dispensary-scraper
```
- K8s manifests are in `/k8s/` folder (scraper.yaml, scraper-worker.yaml, etc.)
- imagePullSecrets use `regcred` secret for Gitea registry auth
23) **Crawler Architecture**
- **Scraper pod (1 replica)**: Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (`crawl_jobs` table).
- **Scraper-worker pods (5 replicas)**: Each worker runs `dist/dutchie-az/services/worker.js`, polling the job queue and processing jobs.
- **Job types processed by workers**:
- `menu_detection` / `menu_detection_single`: Detect menu provider type and resolve platform_dispensary_id from menu_url
- `dutchie_product_crawl`: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
- **Job schedules** (managed in `job_schedules` table):
- `dutchie_az_menu_detection`: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_type
- `dutchie_az_product_crawl`: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
- **Trigger schedules manually**: `curl -X POST /api/az/admin/schedules/{id}/trigger`
- **Check schedule status**: `curl /api/az/admin/schedules`
- **Worker logs**: `kubectl logs -f deployment/scraper-worker -n dispensary-scraper`