CLAUDE.md now requires explicit local mode confirmation before running any crawler, orchestrator, sandbox test, or image scrape. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
669 lines
30 KiB
Markdown
669 lines
30 KiB
Markdown
## Claude Guidelines for this Project
|
||
|
||
---
|
||
|
||
## PERMANENT RULES (NEVER VIOLATE)
|
||
|
||
### 1. NO DELETION OF DATA — EVER
|
||
|
||
CannaiQ is a **historical analytics system**. Data retention is **permanent by design**.
|
||
|
||
**NEVER delete:**
|
||
- Product records
|
||
- Crawled snapshots
|
||
- Images
|
||
- Directories
|
||
- Logs
|
||
- Orchestrator traces
|
||
- Profiles
|
||
- Selector configs
|
||
- Crawl outcomes
|
||
- Store data
|
||
- Brand data
|
||
|
||
**NEVER automate cleanup:**
|
||
- No cron or scheduled job may `rm`, `unlink`, `delete`, `purge`, `prune`, `clean`, or `reset` any storage directory or DB row
|
||
- No migration may DELETE data — only add/update/alter columns
|
||
- If cleanup is required, ONLY the user may issue a manual command
|
||
|
||
**Code enforcement:**
|
||
- `local-storage.ts` must only: write files, create directories, read files
|
||
- No `deleteImage`, `deleteProductImages`, or similar functions
|
||
|
||
### 2. DEPLOYMENT AUTHORIZATION REQUIRED
|
||
|
||
**NEVER deploy to production unless the user explicitly says:**
|
||
> "CLAUDE — DEPLOYMENT IS NOW AUTHORIZED."
|
||
|
||
Until then:
|
||
- All work is LOCAL ONLY
|
||
- No `kubectl apply`, `docker push`, or remote operations
|
||
- No port-forwarding to production
|
||
- No connecting to Kubernetes clusters
|
||
|
||
### 3. LOCAL DEVELOPMENT BY DEFAULT
|
||
|
||
**In local mode:**
|
||
- Use `docker-compose.local.yml` (NO MinIO)
|
||
- Use local filesystem storage at `./storage`
|
||
- Connect to local PostgreSQL at `localhost:54320`
|
||
- Backend runs at `localhost:3010`
|
||
- NO remote connections, NO Kubernetes, NO MinIO
|
||
|
||
**Environment:**
|
||
```bash
|
||
STORAGE_DRIVER=local
|
||
STORAGE_BASE_PATH=./storage
|
||
DATABASE_URL=postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus
|
||
# MINIO_ENDPOINT is NOT set (forces local storage)
|
||
```
|
||
|
||
### 4. MANDATORY LOCAL MODE FOR ALL CRAWLS AND TESTS
|
||
|
||
**Before running ANY of the following, CONFIRM local mode is active:**
|
||
- Crawler execution
|
||
- Orchestrator flows
|
||
- Sandbox tests
|
||
- Image scrape tests
|
||
- Module import tests
|
||
|
||
**Pre-execution checklist:**
|
||
1. ✅ `./start-local.sh` or `docker-compose -f docker-compose.local.yml up` running
|
||
2. ✅ `STORAGE_DRIVER=local`
|
||
3. ✅ `STORAGE_BASE_PATH=./storage`
|
||
4. ✅ NO MinIO, NO S3
|
||
5. ✅ NO port-forward
|
||
6. ✅ NO Kubernetes connection
|
||
7. ✅ Storage writes go to `/storage/products/{brand}/{state}/{product_id}/`
|
||
|
||
**If any condition is not met, DO NOT proceed with the crawl or test.**
|
||
|
||
---
|
||
|
||
## STORAGE BEHAVIOR
|
||
|
||
### Local Storage Structure
|
||
|
||
```
|
||
/storage/products/{brand}/{state}/{product_id}/
|
||
image-{hash}.webp
|
||
image-{hash}-medium.webp
|
||
image-{hash}-thumb.webp
|
||
|
||
/storage/brands/{brand}/
|
||
logo-{hash}.webp
|
||
```
|
||
|
||
### Storage Adapter
|
||
|
||
```typescript
|
||
import { saveImage, getImageUrl } from '../utils/storage-adapter';
|
||
|
||
// Automatically uses local storage when STORAGE_DRIVER=local
|
||
```
|
||
|
||
### Files
|
||
|
||
| File | Purpose |
|
||
|------|---------|
|
||
| `backend/src/utils/local-storage.ts` | Local filesystem adapter |
|
||
| `backend/src/utils/storage-adapter.ts` | Unified storage abstraction |
|
||
| `docker-compose.local.yml` | Local stack without MinIO |
|
||
| `start-local.sh` | Convenience startup script |
|
||
|
||
---
|
||
|
||
## FORBIDDEN ACTIONS
|
||
|
||
1. **Deleting any data** (products, snapshots, images, logs, traces)
|
||
2. **Deploying without explicit authorization**
|
||
3. **Connecting to Kubernetes** without authorization
|
||
4. **Port-forwarding to production** without authorization
|
||
5. **Starting MinIO** in local development
|
||
6. **Using S3/MinIO SDKs** when `STORAGE_DRIVER=local`
|
||
7. **Automating cleanup** of any kind
|
||
8. **Dropping database tables or columns**
|
||
9. **Overwriting historical records** (always append snapshots)
|
||
|
||
---
|
||
|
||
## UI ANONYMIZATION RULES
|
||
|
||
- No vendor names in forward-facing URLs: use `/api/az/...`, `/az`, `/az-schedule`
|
||
- No "dutchie", "treez", "jane", "weedmaps", "leafly" visible in consumer UIs
|
||
- Internal admin tools may show provider names for debugging
|
||
|
||
---
|
||
|
||
## FUTURE TODO / PENDING FEATURES
|
||
|
||
- [ ] Orchestrator observability dashboard
|
||
- [ ] Crawl profile management UI
|
||
- [ ] State machine sandbox (disabled until authorized)
|
||
- [ ] Multi-state expansion beyond AZ
|
||
|
||
---
|
||
|
||
### Multi-Site Architecture (CRITICAL)
|
||
|
||
This project has **5 working locations** - always clarify which one before making changes:
|
||
|
||
| Folder | Domain | Type | Purpose |
|
||
|--------|--------|------|---------|
|
||
| `backend/` | (shared) | Express API | Single backend serving all frontends |
|
||
| `frontend/` | dispos.crawlsy.com | React SPA (Vite) | Legacy admin dashboard (internal use) |
|
||
| `cannaiq/` | cannaiq.co | React SPA + PWA | NEW admin dashboard / B2B analytics |
|
||
| `findadispo/` | findadispo.com | React SPA + PWA | Consumer dispensary finder |
|
||
| `findagram/` | findagram.co | React SPA + PWA | Consumer delivery marketplace |
|
||
|
||
**IMPORTANT: `frontend/` vs `cannaiq/` confusion:**
|
||
- `frontend/` = OLD/legacy dashboard design, deployed to `dispos.crawlsy.com` (internal admin)
|
||
- `cannaiq/` = NEW dashboard design, deployed to `cannaiq.co` (customer-facing B2B)
|
||
- These are DIFFERENT codebases - do NOT confuse them!
|
||
|
||
**Before any frontend work, ASK: "Which site? cannaiq, findadispo, findagram, or legacy (frontend/)?"**
|
||
|
||
All four frontends share:
|
||
- Same backend API (port 3010)
|
||
- Same PostgreSQL database
|
||
- Same Kubernetes deployment for backend
|
||
|
||
Each frontend has:
|
||
- Its own folder, package.json, Dockerfile
|
||
- Its own domain and branding
|
||
- Its own PWA manifest and service worker (cannaiq, findadispo, findagram)
|
||
- Separate Docker containers in production
|
||
|
||
---
|
||
|
||
### Multi-Domain Hosting Architecture
|
||
|
||
All three frontends are served from the **same IP** using **host-based routing**:
|
||
|
||
**Kubernetes Ingress (Production):**
|
||
```yaml
|
||
# Each domain routes to its own frontend service
|
||
apiVersion: networking.k8s.io/v1
|
||
kind: Ingress
|
||
metadata:
|
||
name: multi-site-ingress
|
||
spec:
|
||
rules:
|
||
- host: cannaiq.co
|
||
http:
|
||
paths:
|
||
- path: /
|
||
backend:
|
||
service:
|
||
name: cannaiq-frontend
|
||
port: 80
|
||
- path: /api
|
||
backend:
|
||
service:
|
||
name: scraper # shared backend
|
||
port: 3010
|
||
- host: findadispo.com
|
||
http:
|
||
paths:
|
||
- path: /
|
||
backend:
|
||
service:
|
||
name: findadispo-frontend
|
||
port: 80
|
||
- path: /api
|
||
backend:
|
||
service:
|
||
name: scraper
|
||
port: 3010
|
||
- host: findagram.co
|
||
http:
|
||
paths:
|
||
- path: /
|
||
backend:
|
||
service:
|
||
name: findagram-frontend
|
||
port: 80
|
||
- path: /api
|
||
backend:
|
||
service:
|
||
name: scraper
|
||
port: 3010
|
||
```
|
||
|
||
**Key Points:**
|
||
- DNS A records for all 3 domains point to same IP
|
||
- Ingress controller routes based on `Host` header
|
||
- Each frontend is a separate Docker container (nginx serving static files)
|
||
- All frontends share the same backend API at `/api/*`
|
||
- SSL/TLS handled at ingress level (cert-manager)
|
||
|
||
---
|
||
|
||
### PWA Setup Requirements
|
||
|
||
Each frontend is a **Progressive Web App (PWA)**. Required files in each `public/` folder:
|
||
|
||
1. **manifest.json** - App metadata, icons, theme colors
|
||
2. **service-worker.js** - Offline caching, background sync
|
||
3. **Icons** - 192x192 and 512x512 PNG icons
|
||
|
||
**Vite PWA Plugin Setup** (in each frontend's vite.config.ts):
|
||
```typescript
|
||
import { VitePWA } from 'vite-plugin-pwa'
|
||
|
||
export default defineConfig({
|
||
plugins: [
|
||
react(),
|
||
VitePWA({
|
||
registerType: 'autoUpdate',
|
||
manifest: {
|
||
name: 'Site Name',
|
||
short_name: 'Short',
|
||
theme_color: '#10b981',
|
||
icons: [
|
||
{ src: '/icon-192.png', sizes: '192x192', type: 'image/png' },
|
||
{ src: '/icon-512.png', sizes: '512x512', type: 'image/png' }
|
||
]
|
||
},
|
||
workbox: {
|
||
globPatterns: ['**/*.{js,css,html,ico,png,svg,woff2}']
|
||
}
|
||
})
|
||
]
|
||
})
|
||
```
|
||
|
||
---
|
||
|
||
### Core Rules Summary
|
||
|
||
- **DB**: Use the single consolidated DB (CRAWLSY_DATABASE_URL → DATABASE_URL); no dual pools; schema_migrations must exist; apply migrations 031/032/033.
|
||
- **Images**: No MinIO. Save to local /images/products/<disp>/<prod>-<hash>.webp (and brands); preserve original URL; serve via backend static.
|
||
- **Dutchie GraphQL**: Endpoint https://dutchie.com/api-3/graphql. Variables must use productsFilter.dispensaryId (platform_dispensary_id). Mode A: Status="Active". Mode B: Status=null/activeOnly:false. No dispensaryFilter.cNameOrID.
|
||
- **cName/slug**: Derive cName from each store's menu_url (/embedded-menu/<cName> or /dispensary/<slug>). No hardcoded defaults. Each location must have its own valid menu_url and platform_dispensary_id; do not reuse IDs across locations. If slug is invalid/missing, mark not crawlable and log; resolve ID before crawling.
|
||
- **Dual-mode always**: useBothModes:true to get pricing (Mode A) + full coverage (Mode B).
|
||
- **Batch DB writes**: Chunk products/snapshots/missing (100–200) to avoid OOM.
|
||
- **OOS/missing**: Include inactive/OOS in Mode B. Union A+B, dedupe by external_product_id+dispensary_id. Insert snapshots with stock_status; if absent from both modes, insert missing_from_feed. Do not filter OOS by default.
|
||
- **API/Frontend**: Use /api/az/... endpoints (stores/products/brands/categories/summary/dashboard). Rebuild frontend with VITE_API_URL pointing to the backend.
|
||
- **Scheduling**: Crawl only menu_type='dutchie' AND platform_dispensary_id IS NOT NULL. 4-hour crawl with jitter; detection job to set menu_type and resolve platform IDs.
|
||
- **Monitor**: /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs, with auto-refresh.
|
||
- **No slug guessing**: Never use defaults like "AZ-Deeply-Rooted." Always derive per store from menu_url and resolve platform IDs per location.
|
||
|
||
---
|
||
|
||
### Detailed Rules
|
||
|
||
1) **Use the consolidated DB everywhere**
|
||
- Preferred env: `CRAWLSY_DATABASE_URL` (fallback `DATABASE_URL`).
|
||
- Do NOT create dutchie tables in the legacy DB. Apply migrations 031/032/033 to the consolidated DB and restart.
|
||
|
||
2) **Dispensary vs Store**
|
||
- Dutchie pipeline uses `dispensaries` (not legacy `stores`). For dutchie crawls, always work with dispensary ID.
|
||
- Ignore legacy fields like `dutchie_plus_id` and slug guessing. Use the record's `menu_url` and `platform_dispensary_id`.
|
||
|
||
3) **Menu detection and platform IDs**
|
||
- Set `menu_type` from `menu_url` detection; resolve `platform_dispensary_id` for `menu_type='dutchie'`.
|
||
- Admin should have "refresh detection" and "resolve ID" actions; schedule/crawl only when `menu_type='dutchie'` AND `platform_dispensary_id` is set.
|
||
|
||
4) **Queries and mapping**
|
||
- The DB returns snake_case; code expects camelCase. Always alias/map:
|
||
- `platform_dispensary_id AS "platformDispensaryId"`
|
||
- Map via `mapDbRowToDispensary` when loading dispensaries (scheduler, crawler, admin crawl).
|
||
- Avoid `SELECT *`; explicitly select and/or map fields.
|
||
|
||
5) **Scheduling**
|
||
- `/scraper-schedule` should accept filters/search (All vs AZ-only, name).
|
||
- "Run Now"/scheduler must skip or warn if `menu_type!='dutchie'` or `platform_dispensary_id` missing.
|
||
- Use `dispensary_crawl_status` view; show reason when not crawlable.
|
||
|
||
6) **Crawling**
|
||
- Trigger dutchie crawls by dispensary ID (e.g., `/api/az/admin/crawl/:id` or `runDispensaryOrchestrator(id)`).
|
||
- Update existing products (by stable product ID), append snapshots for history (every 4h cadence), download images locally (`/images/...`), store local URLs.
|
||
- Use dutchie GraphQL pipeline only for `menu_type='dutchie'`.
|
||
|
||
7) **Frontend**
|
||
- Forward-facing URLs: `/api/az`, `/az`, `/az-schedule`; no vendor names.
|
||
- `/scraper-schedule`: add filters/search, keep as master view for all schedules; reflect platform ID/menu_type status and controls (resolve ID, run now, enable/disable/delete).
|
||
|
||
8) **No slug guessing**
|
||
- Do not guess slugs; use the DB record's `menu_url` and ID. Resolve platform ID from the URL/cName; if set, crawl directly by ID.
|
||
|
||
9) **Verify locally before pushing**
|
||
- Apply migrations, restart backend, ensure auth (`users` table) exists, run dutchie crawl for a known dispensary (e.g., Deeply Rooted), check `/api/az/dashboard`, `/api/az/stores/:id/products`, `/az`, `/scraper-schedule`.
|
||
|
||
10) **Image storage (no MinIO)**
|
||
- Save images to local filesystem only. Do not create or use MinIO in Docker.
|
||
- Product images: `/images/products/<dispensary_id>/<product_id>-<hash>.webp` (+medium/+thumb).
|
||
- Brand images: `/images/brands/<brand_slug_or_sku>-<hash>.webp`.
|
||
- Store local URLs in DB fields (keep original URLs as fallback only).
|
||
- Serve `/images` via backend static middleware.
|
||
|
||
11) **Dutchie GraphQL fetch rules**
|
||
- **Endpoint**: `https://dutchie.com/api-3/graphql` (NOT `api-gw.dutchie.com` which no longer exists).
|
||
- **Variables**: Use `productsFilter.dispensaryId` = `platform_dispensary_id` (MongoDB ObjectId, e.g., `6405ef617056e8014d79101b`).
|
||
- Do NOT use `dispensaryFilter.cNameOrID` - that's outdated.
|
||
- `cName` (e.g., `AZ-Deeply-Rooted`) is only for Referer/Origin headers and Puppeteer session bootstrapping.
|
||
- **Mode A**: `Status: "Active"` - returns active products with pricing
|
||
- **Mode B**: `Status: null` / `activeOnly: false` - returns all products including OOS/inactive
|
||
- **Example payload**:
|
||
```json
|
||
{
|
||
"operationName": "FilteredProducts",
|
||
"variables": {
|
||
"productsFilter": {
|
||
"dispensaryId": "6405ef617056e8014d79101b",
|
||
"pricingType": "rec",
|
||
"Status": "Active"
|
||
}
|
||
},
|
||
"extensions": {
|
||
"persistedQuery": { "version": 1, "sha256Hash": "<hash>" }
|
||
}
|
||
}
|
||
```
|
||
- **Headers** (server-side axios only): Chrome UA, `Origin: https://dutchie.com`, `Referer: https://dutchie.com/embedded-menu/<cName>`, `accept: application/json`, `content-type: application/json`.
|
||
- If local DNS can't resolve, run fetch from an environment that can (K8s pod/remote host), not from browser.
|
||
- Use server-side axios with embedded-menu headers; include CF/session cookie from Puppeteer if needed.
|
||
|
||
12) **Stop over-prep; run the crawl**
|
||
- To seed/refresh a store, run a one-off crawl by dispensary ID (example for Deeply Rooted):
|
||
```
|
||
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
|
||
npx tsx -e "const { crawlDispensaryProducts } = require('./src/dutchie-az/services/product-crawler'); const d={id:112,name:'Deeply Rooted',platform:'dutchie',platformDispensaryId:'6405ef617056e8014d79101b',menuType:'dutchie'}; crawlDispensaryProducts(d,'rec',{useBothModes:true}).then(r=>{console.log(r);process.exit(0);}).catch(e=>{console.error(e);process.exit(1);});"
|
||
```
|
||
If local DNS is blocked, run the same command inside the scraper pod via `kubectl exec ... -- bash -lc '...'`.
|
||
- After crawl, verify counts via `dutchie_products`, `dutchie_product_snapshots`, and `dispensaries.last_crawl_at`. Do not inspect the legacy `products` table for Dutchie.
|
||
|
||
13) **Fetch troubleshooting**
|
||
- If 403 or empty data: log status + first GraphQL error; include cf_clearance/session cookie from Puppeteer; ensure headers match a real Chrome request; ensure variables use `productsFilter.dispensaryId`.
|
||
- If DNS fails locally, do NOT debug DNS—run the fetch from an environment that resolves (K8s/remote) or via Puppeteer-captured headers/cookies. No browser/CORS attempts.
|
||
|
||
14) **Views and metrics**
|
||
- Keep v_brands/v_categories/v_brand_history based on `dutchie_products` and preserve brand_count metrics. Do not drop brand_count.
|
||
|
||
15) **Batch DB writes to avoid OOM**
|
||
- Do NOT build one giant upsert/insert payload for products/snapshots/missing marks.
|
||
- Chunk arrays (e.g., 100–200 items) and upsert/insert in a loop; drop references after each chunk.
|
||
- Apply to products, product snapshots, and any "mark missing" logic to keep memory low during crawls.
|
||
|
||
16) **Use dual-mode crawls by default**
|
||
- Always run with `useBothModes:true` to combine:
|
||
- Mode A (active feed with pricing/stock)
|
||
- Mode B (max coverage including OOS/inactive)
|
||
- Union/dedupe by product ID so you keep full coverage and pricing in one run.
|
||
- If you only run Mode B, prices will be null; dual-mode fills pricing while retaining OOS items.
|
||
|
||
17) **Capture OOS and missing items**
|
||
- GraphQL variables must include inactive/OOS (Status: All / activeOnly:false). Mode B already returns OOS/inactive; union with Mode A to keep pricing.
|
||
- After unioning Mode A/B, upsert products and insert snapshots with stock_status from the feed. If an existing product is absent from both Mode A and Mode B for the run, insert a snapshot with is_present_in_feed=false and stock_status='missing_from_feed'.
|
||
- Do not filter out OOS/missing in the API; only filter when the user requests (e.g., stockStatus=in_stock). Expose stock_status/in_stock from the latest snapshot (fallback to product).
|
||
- Verify with `/api/az/stores/:id/products?stockStatus=out_of_stock` and `?stockStatus=missing_from_feed`.
|
||
|
||
18) **Menu discovery must crawl the website when menu_url is null**
|
||
- For dispensaries with no menu_url or unknown menu_type, crawl the dispensary.website (if present) to find provider links (dutchie, treez, jane, weedmaps, leafly, etc.). Follow “menu/order/shop” links up to a shallow depth with timeouts/rate limits.
|
||
- If a provider link is found, set menu_url, set menu_type, and store detection metadata; if dutchie, derive cName from menu_url and resolve platform_dispensary_id; store resolved_at and detection details.
|
||
- Do NOT mark a dispensary not_crawlable solely because menu_url is null; only mark not_crawlable if the website crawl fails to find a menu or returns 403/404/invalid. Log the reason in provider_detection_data and crawl_status_reason.
|
||
- Keep this as the menu discovery job (separate from product crawls); log successes/errors to job_run_logs. Only schedule product crawls for stores with menu_type='dutchie' AND platform_dispensary_id IS NOT NULL.
|
||
|
||
19) **Preserve all stock statuses (including unknown)**
|
||
- Do not filter or drop stock_status values in API/UI; pass through whatever is stored on the latest snapshot/product. Expected values include: in_stock, out_of_stock (if exposed), missing_from_feed, unknown. Only apply filters when explicitly requested by the user.
|
||
|
||
20) **Never delete or overwrite historical data**
|
||
- Do not delete products/snapshots or overwrite historical records. Always append snapshots for changes (price/stock/qty), and mark missing_from_feed instead of removing records. Historical data must remain intact for analytics.
|
||
|
||
21) **Deployment via CI/CD only**
|
||
- Test locally, commit clean changes, and let CI/CD build and deploy to Kubernetes at code.cannabrands.app. Do NOT manually build/push images or tweak prod pods. Deploy backend first, smoke-test APIs, then frontend; roll back via CI/CD if needed.
|
||
|
||
18) **Per-location cName and platform_dispensary_id resolution**
|
||
- For each dispensary, menu_url and cName must be valid for that exact location; no hardcoded defaults and no sharing platform_dispensary_id across locations.
|
||
- Derive cName from menu_url per store: `/embedded-menu/<cName>` or `/dispensary/<cName>`.
|
||
- Resolve platform_dispensary_id from that cName using GraphQL GetAddressBasedDispensaryData.
|
||
- If the slug is invalid/missing, mark the store not crawlable and log it; do not crawl with a mismatched cName/ID. Store the error in `provider_detection_data.resolution_error`.
|
||
- Before crawling, validate that the cName from menu_url matches the resolved platform ID; if mismatched, re-resolve before proceeding.
|
||
|
||
19) **API endpoints (AZ pipeline)**
|
||
- Use /api/az/... endpoints: stores, products, brands, categories, summary, dashboard
|
||
- Rebuild frontend with VITE_API_URL pointing to the backend
|
||
- Dispensary Detail and analytics must use AZ endpoints
|
||
|
||
20) **Monitoring and logging**
|
||
- /scraper-monitor (and /az-schedule) should show active/recent jobs from job_run_logs/crawl_jobs
|
||
- Auto-refresh every 30 seconds
|
||
- System Logs page should show real log data, not just startup messages
|
||
|
||
21) **Dashboard Architecture - CRITICAL**
|
||
- **Frontend**: If you see old labels like "Active Proxies" or "Active Stores", it means the old dashboard bundle is being served. Rebuild the frontend with `VITE_API_URL` pointing to the correct backend and redeploy. Clear browser cache. Confirm new labels show up.
|
||
- **Backend**: `/api/dashboard/stats` MUST use the consolidated DB (same pool as dutchie-az module). Use the correct tables: `dutchie_products`, `dispensaries`, and views like `v_dashboard_stats`, `v_latest_snapshots`. Do NOT use a separate legacy connection. Do NOT query `az_products` (doesn't exist) or legacy `stores`/`products` tables.
|
||
- **DB Connectivity**: Use the proper DB host/role. Errors like `role "dutchie" does not exist` mean you're exec'ing into the wrong Postgres pod or using wrong credentials. Confirm the correct `DATABASE_URL` and test: `kubectl exec deployment/scraper -n dispensary-scraper -- psql $DATABASE_URL -c '\dt'`
|
||
- **After fixing**: Dashboard should show real data (e.g., 777 products) instead of zeros. Do NOT revert to legacy tables; point dashboard queries to the consolidated DB/views.
|
||
- **Checklist**:
|
||
1. Rebuild/redeploy frontend with correct API URL, clear cache
|
||
2. Fix `/api/dashboard/*` to use the consolidated DB pool and dutchie views/tables
|
||
3. Test `/api/dashboard/stats` from the scraper pod; then reload the UI
|
||
|
||
22) **Deployment (Gitea + Kubernetes)**
|
||
- **Registry**: Gitea at `code.cannabrands.app/creationshop/dispensary-scraper`
|
||
- **Build and push** (from backend directory):
|
||
```bash
|
||
# Login to Gitea container registry
|
||
docker login code.cannabrands.app
|
||
|
||
# Build the image
|
||
cd backend
|
||
docker build -t code.cannabrands.app/creationshop/dispensary-scraper:latest .
|
||
|
||
# Push to registry
|
||
docker push code.cannabrands.app/creationshop/dispensary-scraper:latest
|
||
```
|
||
- **Deploy to Kubernetes**:
|
||
```bash
|
||
# Restart deployments to pull new image
|
||
kubectl rollout restart deployment/scraper -n dispensary-scraper
|
||
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
|
||
|
||
# Watch rollout status
|
||
kubectl rollout status deployment/scraper -n dispensary-scraper
|
||
kubectl rollout status deployment/scraper-worker -n dispensary-scraper
|
||
```
|
||
- **Check pods**:
|
||
```bash
|
||
kubectl get pods -n dispensary-scraper
|
||
kubectl logs -f deployment/scraper -n dispensary-scraper
|
||
kubectl logs -f deployment/scraper-worker -n dispensary-scraper
|
||
```
|
||
- K8s manifests are in `/k8s/` folder (scraper.yaml, scraper-worker.yaml, etc.)
|
||
- imagePullSecrets use `regcred` secret for Gitea registry auth
|
||
|
||
23) **Crawler Architecture**
|
||
- **Scraper pod (1 replica)**: Runs the Express API server + scheduler. The scheduler enqueues detection and crawl jobs to the database queue (`crawl_jobs` table).
|
||
- **Scraper-worker pods (5 replicas)**: Each worker runs `dist/dutchie-az/services/worker.js`, polling the job queue and processing jobs.
|
||
- **Job types processed by workers**:
|
||
- `menu_detection` / `menu_detection_single`: Detect menu provider type and resolve platform_dispensary_id from menu_url
|
||
- `dutchie_product_crawl`: Crawl products from Dutchie GraphQL API for dispensaries with valid platform IDs
|
||
- **Job schedules** (managed in `job_schedules` table):
|
||
- `dutchie_az_menu_detection`: Runs daily with 60-min jitter, detects menu type for dispensaries with unknown menu_type
|
||
- `dutchie_az_product_crawl`: Runs every 4 hours with 30-min jitter, crawls products from all detected Dutchie dispensaries
|
||
- **Trigger schedules manually**: `curl -X POST /api/az/admin/schedules/{id}/trigger`
|
||
- **Check schedule status**: `curl /api/az/admin/schedules`
|
||
- **Worker logs**: `kubectl logs -f deployment/scraper-worker -n dispensary-scraper`
|
||
|
||
24) **Crawler Maintenance Procedure (Check Jobs, Requeue, Restart)**
|
||
When crawlers are stuck or jobs aren't processing, follow this procedure:
|
||
|
||
**Step 1: Check Job Status**
|
||
```bash
|
||
# Port-forward to production
|
||
kubectl port-forward -n dispensary-scraper deployment/scraper 3099:3010 &
|
||
|
||
# Check active/stuck jobs
|
||
curl -s http://localhost:3099/api/az/monitor/active-jobs | jq .
|
||
|
||
# Check recent job history
|
||
curl -s "http://localhost:3099/api/az/monitor/jobs?limit=20" | jq '.jobs[] | {id, job_type, status, dispensary_id, started_at, products_found, duration_min: (.duration_ms/60000 | floor)}'
|
||
|
||
# Check schedule status
|
||
curl -s http://localhost:3099/api/az/admin/schedules | jq '.schedules[] | {id, jobName, enabled, lastRunAt, lastStatus, nextRunAt}'
|
||
```
|
||
|
||
**Step 2: Reset Stuck Jobs**
|
||
Jobs are considered stuck if they have `status='running'` but no heartbeat in >30 minutes:
|
||
```bash
|
||
# Via API (if endpoint exists)
|
||
curl -s -X POST http://localhost:3099/api/az/admin/reset-stuck-jobs
|
||
|
||
# Via direct DB (if API not available)
|
||
kubectl exec -n dispensary-scraper deployment/scraper -- psql $DATABASE_URL -c "
|
||
UPDATE dispensary_crawl_jobs
|
||
SET status = 'failed',
|
||
error_message = 'Job timed out - worker stopped sending heartbeats',
|
||
completed_at = NOW()
|
||
WHERE status = 'running'
|
||
AND (last_heartbeat_at < NOW() - INTERVAL '30 minutes' OR last_heartbeat_at IS NULL);
|
||
"
|
||
```
|
||
|
||
**Step 3: Requeue Jobs (Trigger Fresh Crawl)**
|
||
```bash
|
||
# Trigger product crawl schedule (typically ID 1)
|
||
curl -s -X POST http://localhost:3099/api/az/admin/schedules/1/trigger
|
||
|
||
# Trigger menu detection schedule (typically ID 2)
|
||
curl -s -X POST http://localhost:3099/api/az/admin/schedules/2/trigger
|
||
|
||
# Or crawl a specific dispensary
|
||
curl -s -X POST http://localhost:3099/api/az/admin/crawl/112
|
||
```
|
||
|
||
**Step 4: Restart Crawler Workers**
|
||
```bash
|
||
# Restart scraper-worker pods (clears any stuck processes)
|
||
kubectl rollout restart deployment/scraper-worker -n dispensary-scraper
|
||
|
||
# Watch rollout progress
|
||
kubectl rollout status deployment/scraper-worker -n dispensary-scraper
|
||
|
||
# Optionally restart main scraper pod too
|
||
kubectl rollout restart deployment/scraper -n dispensary-scraper
|
||
```
|
||
|
||
**Step 5: Monitor Recovery**
|
||
```bash
|
||
# Watch worker logs
|
||
kubectl logs -f deployment/scraper-worker -n dispensary-scraper --tail=50
|
||
|
||
# Check dashboard for product counts
|
||
curl -s http://localhost:3099/api/az/dashboard | jq '{totalStores, totalProducts, storesByType}'
|
||
|
||
# Verify jobs are processing
|
||
curl -s http://localhost:3099/api/az/monitor/active-jobs | jq .
|
||
```
|
||
|
||
**Quick One-Liner for Full Reset:**
|
||
```bash
|
||
# Reset stuck jobs and restart workers
|
||
kubectl exec -n dispensary-scraper deployment/scraper -- psql $DATABASE_URL -c "UPDATE dispensary_crawl_jobs SET status='failed', completed_at=NOW() WHERE status='running' AND (last_heartbeat_at < NOW() - INTERVAL '30 minutes' OR last_heartbeat_at IS NULL);" && kubectl rollout restart deployment/scraper-worker -n dispensary-scraper && kubectl rollout status deployment/scraper-worker -n dispensary-scraper
|
||
```
|
||
|
||
**Cleanup port-forwards when done:**
|
||
```bash
|
||
pkill -f "port-forward.*dispensary-scraper"
|
||
```
|
||
|
||
25) **Frontend Architecture - AVOID OVER-ENGINEERING**
|
||
|
||
**Key Principles:**
|
||
- **ONE BACKEND** serves ALL domains (cannaiq.co, findadispo.com, findagram.co)
|
||
- Do NOT create separate backend services for each domain
|
||
- The existing `dispensary-scraper` backend handles everything
|
||
|
||
**Frontend Build Differences:**
|
||
- `frontend/` uses **Vite** (outputs to `dist/`, uses `VITE_` env vars) → dispos.crawlsy.com (legacy)
|
||
- `cannaiq/` uses **Vite** (outputs to `dist/`, uses `VITE_` env vars) → cannaiq.co (NEW)
|
||
- `findadispo/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findadispo.com
|
||
- `findagram/` uses **Create React App** (outputs to `build/`, uses `REACT_APP_` env vars) → findagram.co
|
||
|
||
**CRA vs Vite Dockerfile Differences:**
|
||
```dockerfile
|
||
# Vite (frontend, cannaiq)
|
||
ENV VITE_API_URL=https://api.domain.com
|
||
RUN npm run build
|
||
COPY --from=builder /app/dist /usr/share/nginx/html
|
||
|
||
# CRA (findadispo, findagram)
|
||
ENV REACT_APP_API_URL=https://api.domain.com
|
||
RUN npm run build
|
||
COPY --from=builder /app/build /usr/share/nginx/html
|
||
```
|
||
|
||
**lucide-react Icon Gotchas:**
|
||
- Not all icons exist in older versions (e.g., `Cannabis` doesn't exist)
|
||
- Use `Leaf` as a substitute for cannabis-related icons
|
||
- When doing search/replace for icon names, be careful not to replace text content
|
||
- Example: "Cannabis-infused food" should NOT become "Leaf-infused food"
|
||
|
||
**Deployment Options:**
|
||
1. **Separate containers** (current): Each frontend in its own nginx container
|
||
2. **Single container** (better): One nginx with multi-domain config serving all frontends
|
||
|
||
**Single Container Multi-Domain Approach:**
|
||
```dockerfile
|
||
# Build all frontends
|
||
FROM node:20-slim AS builder-cannaiq
|
||
WORKDIR /app/cannaiq
|
||
COPY cannaiq/package*.json ./
|
||
RUN npm install
|
||
COPY cannaiq/ ./
|
||
RUN npm run build
|
||
|
||
FROM node:20-slim AS builder-findadispo
|
||
WORKDIR /app/findadispo
|
||
COPY findadispo/package*.json ./
|
||
RUN npm install
|
||
COPY findadispo/ ./
|
||
RUN npm run build
|
||
|
||
FROM node:20-slim AS builder-findagram
|
||
WORKDIR /app/findagram
|
||
COPY findagram/package*.json ./
|
||
RUN npm install
|
||
COPY findagram/ ./
|
||
RUN npm run build
|
||
|
||
# Production nginx with multi-domain routing
|
||
FROM nginx:alpine
|
||
COPY --from=builder-cannaiq /app/cannaiq/dist /var/www/cannaiq
|
||
COPY --from=builder-findadispo /app/findadispo/dist /var/www/findadispo
|
||
COPY --from=builder-findagram /app/findagram/build /var/www/findagram
|
||
COPY nginx-multi-domain.conf /etc/nginx/conf.d/default.conf
|
||
```
|
||
|
||
**nginx-multi-domain.conf:**
|
||
```nginx
|
||
server {
|
||
listen 80;
|
||
server_name cannaiq.co www.cannaiq.co;
|
||
root /var/www/cannaiq;
|
||
location / { try_files $uri $uri/ /index.html; }
|
||
}
|
||
|
||
server {
|
||
listen 80;
|
||
server_name findadispo.com www.findadispo.com;
|
||
root /var/www/findadispo;
|
||
location / { try_files $uri $uri/ /index.html; }
|
||
}
|
||
|
||
server {
|
||
listen 80;
|
||
server_name findagram.co www.findagram.co;
|
||
root /var/www/findagram;
|
||
location / { try_files $uri $uri/ /index.html; }
|
||
}
|
||
```
|
||
|
||
**Common Mistakes to AVOID:**
|
||
- Creating a FastAPI/Express backend just for findagram or findadispo
|
||
- Creating separate Docker images per domain when one would work
|
||
- Replacing icon names with sed without checking for text content collisions
|
||
- Using `npm ci` in Dockerfiles when package-lock.json doesn't exist (use `npm install`)
|