# Dispensary Scraper - Getting Started Guide ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ K8s Cluster (Production) │ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │ │ frontend │ │ scraper │ │ postgres │ │ │ │ (React Admin) │ │ (Node + API) │ │ (Database) │ │ │ │ port: 80 │ │ port: 3010 │ │ port: 5432 │ │ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │ │ │ │ │ └─────────────────────┘ │ │ │ │ Ingress: dispos.crawlsy.com │ │ / → frontend │ │ /api → scraper │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### Components | Component | Description | Image | |-----------|-------------|-------| | **frontend** | React admin panel for monitoring crawls | `code.cannabrands.app/creationshop/dispensary-scraper-frontend:latest` | | **scraper** | Node.js backend with API + Puppeteer crawlers | `code.cannabrands.app/creationshop/dispensary-scraper:latest` | | **postgres** | PostgreSQL database | `postgres:15` | ### Key Directories ``` dispensary-scraper/ ├── backend/ # Node.js backend │ ├── src/ │ │ ├── routes/ # API endpoints │ │ ├── services/ # Business logic │ │ ├── scraper-v2/ # Scraper engine (Scrapy-inspired) │ │ └── scripts/ # CLI scripts │ └── migrations/ # SQL migrations ├── frontend/ # React admin panel ├── k8s/ # Kubernetes manifests ├── wordpress-plugin/ # WP plugin for dispensary sites └── docs/ # Architecture documentation ``` --- ## Local Development ### Prerequisites - Node.js 20+ - Docker & Docker Compose - kubectl configured for the k8s cluster - SSH tunnel to production database (port 54320) ### 1. Start Local Services ```bash cd /home/kelly/git/dispensary-scraper # Start local postgres and minio docker start dutchie-postgres dutchie-minio # Or use docker-compose for fresh setup docker compose up -d ``` ### 2. Environment Variables ```bash # Production database (via SSH tunnel) export DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" ``` ### 3. Run Backend ```bash cd backend npm install npm run dev # API available at http://localhost:3010 ``` ### 4. Run Frontend ```bash cd frontend npm install npm run dev # UI available at http://localhost:5173 ``` ### 5. Run Migrations ```bash cd backend DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" npx tsx src/db/migrate.ts ``` --- ## Database Architecture ### Two Databases | Database | Port | User | Purpose | |----------|------|------|---------| | `dutchie_menus` | 54320 | dutchie | **Production** - SSH tunnel to remote k8s postgres | | `dispensary_scraper` | 5432 | scraper | **Remote only** - Accessed via k8s pods | **Local development uses port 54320** (SSH tunnel to production). ### Key Tables | Table | Purpose | |-------|---------| | `dispensaries` | Master list of dispensaries (from AZDHS) with crawl metadata | | `stores` | Legacy table - being migrated to dispensaries | | `products` | Scraped product data | | `proxies` | Proxy pool for bot detection avoidance | | `crawl_jobs` | Crawl job history and status | ### Dispensary Crawl Fields The `dispensaries` table now contains crawl metadata: - `provider_type` - Menu provider (dutchie, jane, etc.) - `menu_provider` - Detected provider for crawling - `crawler_mode` - 'production' or 'sandbox' - `crawl_status` - Current status (idle, running, ok, error) - `menu_url` - URL to crawl --- ## Kubernetes Deployment ### Namespace ```bash kubectl get all -n dispensary-scraper ``` ### Image Registry Images are stored at `code.cannabrands.app/creationshop/`: - `dispensary-scraper:latest` - Backend - `dispensary-scraper-frontend:latest` - Frontend ### Deploy New Code The CI/CD pipeline builds and deploys automatically on push to master. To manually deploy: ```bash # Restart deployment to pull latest image kubectl rollout restart deployment/scraper -n dispensary-scraper # Check status kubectl rollout status deployment/scraper -n dispensary-scraper # View logs kubectl logs -f deployment/scraper -n dispensary-scraper ``` ### Run Migrations on Remote ```bash # Copy migration file to pod kubectl cp backend/migrations/029_link_dutchie_stores_to_dispensaries.sql \ dispensary-scraper/scraper-POD-NAME:/tmp/migration.sql # Execute via psql in pod kubectl exec -n dispensary-scraper deployment/scraper -- \ sh -c 'PGPASSWORD=$POSTGRES_PASSWORD psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f /tmp/migration.sql' ``` --- ## Running Crawlers ### Remote (Production) Crawlers should run on the k8s cluster, not locally. ```bash # Run crawl script inside pod kubectl exec -n dispensary-scraper deployment/scraper -- \ node dist/scripts/queue-dispensaries.js --production # Or for a specific dispensary kubectl exec -n dispensary-scraper deployment/scraper -- \ node -e "require('./dist/services/crawler-jobs').runDutchieMenuCrawlJob(112)" ``` ### Trigger via API ```bash # Queue all Dutchie dispensaries curl -X POST https://dispos.crawlsy.com/api/crawl/queue-all?type=dutchie ``` ### Local Testing For code development/testing only: ```bash cd backend DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ npx tsx src/scripts/queue-dispensaries.ts --production --limit=1 ``` --- ## Proxy System ### How It Works 1. Proxies stored in `proxies` table with `active=true/false` 2. `getActiveProxy()` selects random active proxy 3. On bot detection, proxy gets 35-second in-memory cooldown 4. Permanent failures mark `active=false` in DB ### Adding New Proxies ```sql INSERT INTO proxies (ip, port, protocol, active) VALUES ('123.45.67.89', 3128, 'http', true); ``` --- ## Common Tasks ### Check Crawl Status ```bash # Via kubectl kubectl exec -n dispensary-scraper deployment/scraper -- \ node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT id, name, crawl_status FROM dispensaries WHERE provider_type = \\'dutchie\\' LIMIT 10').then(r => console.log(r.rows))" # Via admin UI # https://dispos.crawlsy.com/schedule ``` ### View Recent Products ```bash kubectl exec -n dispensary-scraper deployment/scraper -- \ node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*), MAX(updated_at) FROM products').then(r => console.log(r.rows))" ``` ### Restart Stuck Crawl ```sql UPDATE dispensaries SET crawl_status = 'idle' WHERE crawl_status = 'running'; ``` --- ## Troubleshooting ### Pod Not Starting ```bash kubectl describe pod -n dispensary-scraper -l app=scraper kubectl logs -n dispensary-scraper -l app=scraper --previous ``` ### Database Connection Issues ```bash # Test from pod kubectl exec -n dispensary-scraper deployment/scraper -- \ node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT 1').then(() => console.log('OK'))" ``` ### Proxy Issues ```bash # Check active proxy count kubectl exec -n dispensary-scraper deployment/scraper -- \ node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*) FROM proxies WHERE active = true').then(r => console.log('Active proxies:', r.rows[0].count))" ``` --- ## Related Documentation - `docs/CRAWL_OPERATIONS.md` - Crawl scheduling and data philosophy - `docs/PRODUCT_BRAND_INTELLIGENCE_ARCHITECTURE.md` - Data processing - `DATABASE-GUIDE.md` - Database usage details