Includes: - Architecture overview with k8s diagram - Local development setup - Database information (local vs remote) - K8s deployment instructions - Running crawlers remotely - Proxy system documentation - Common tasks and troubleshooting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.7 KiB
8.7 KiB
Dispensary Scraper - Getting Started Guide
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ K8s Cluster (Production) │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ frontend │ │ scraper │ │ postgres │ │
│ │ (React Admin) │ │ (Node + API) │ │ (Database) │ │
│ │ port: 80 │ │ port: 3010 │ │ port: 5432 │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │ │ │
│ └─────────────────────┘ │
│ │
│ Ingress: dispos.crawlsy.com │
│ / → frontend │
│ /api → scraper │
└─────────────────────────────────────────────────────────────────────────┘
Components
| Component | Description | Image |
|---|---|---|
| frontend | React admin panel for monitoring crawls | code.cannabrands.app/creationshop/dispensary-scraper-frontend:latest |
| scraper | Node.js backend with API + Puppeteer crawlers | code.cannabrands.app/creationshop/dispensary-scraper:latest |
| postgres | PostgreSQL database | postgres:15 |
Key Directories
dispensary-scraper/
├── backend/ # Node.js backend
│ ├── src/
│ │ ├── routes/ # API endpoints
│ │ ├── services/ # Business logic
│ │ ├── scraper-v2/ # Scraper engine (Scrapy-inspired)
│ │ └── scripts/ # CLI scripts
│ └── migrations/ # SQL migrations
├── frontend/ # React admin panel
├── k8s/ # Kubernetes manifests
├── wordpress-plugin/ # WP plugin for dispensary sites
└── docs/ # Architecture documentation
Local Development
Prerequisites
- Node.js 20+
- Docker & Docker Compose
- kubectl configured for the k8s cluster
- SSH tunnel to production database (port 54320)
1. Start Local Services
cd /home/kelly/git/dispensary-scraper
# Start local postgres and minio
docker start dutchie-postgres dutchie-minio
# Or use docker-compose for fresh setup
docker compose up -d
2. Environment Variables
# Production database (via SSH tunnel)
export DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus"
3. Run Backend
cd backend
npm install
npm run dev
# API available at http://localhost:3010
4. Run Frontend
cd frontend
npm install
npm run dev
# UI available at http://localhost:5173
5. Run Migrations
cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" npx tsx src/db/migrate.ts
Database Architecture
Two Databases
| Database | Port | User | Purpose |
|---|---|---|---|
dutchie_menus |
54320 | dutchie | Production - SSH tunnel to remote k8s postgres |
dispensary_scraper |
5432 | scraper | Remote only - Accessed via k8s pods |
Local development uses port 54320 (SSH tunnel to production).
Key Tables
| Table | Purpose |
|---|---|
dispensaries |
Master list of dispensaries (from AZDHS) with crawl metadata |
stores |
Legacy table - being migrated to dispensaries |
products |
Scraped product data |
proxies |
Proxy pool for bot detection avoidance |
crawl_jobs |
Crawl job history and status |
Dispensary Crawl Fields
The dispensaries table now contains crawl metadata:
provider_type- Menu provider (dutchie, jane, etc.)menu_provider- Detected provider for crawlingcrawler_mode- 'production' or 'sandbox'crawl_status- Current status (idle, running, ok, error)menu_url- URL to crawl
Kubernetes Deployment
Namespace
kubectl get all -n dispensary-scraper
Image Registry
Images are stored at code.cannabrands.app/creationshop/:
dispensary-scraper:latest- Backenddispensary-scraper-frontend:latest- Frontend
Deploy New Code
The CI/CD pipeline builds and deploys automatically on push to master. To manually deploy:
# Restart deployment to pull latest image
kubectl rollout restart deployment/scraper -n dispensary-scraper
# Check status
kubectl rollout status deployment/scraper -n dispensary-scraper
# View logs
kubectl logs -f deployment/scraper -n dispensary-scraper
Run Migrations on Remote
# Copy migration file to pod
kubectl cp backend/migrations/029_link_dutchie_stores_to_dispensaries.sql \
dispensary-scraper/scraper-POD-NAME:/tmp/migration.sql
# Execute via psql in pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
sh -c 'PGPASSWORD=$POSTGRES_PASSWORD psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f /tmp/migration.sql'
Running Crawlers
Remote (Production)
Crawlers should run on the k8s cluster, not locally.
# Run crawl script inside pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
node dist/scripts/queue-dispensaries.js --production
# Or for a specific dispensary
kubectl exec -n dispensary-scraper deployment/scraper -- \
node -e "require('./dist/services/crawler-jobs').runDutchieMenuCrawlJob(112)"
Trigger via API
# Queue all Dutchie dispensaries
curl -X POST https://dispos.crawlsy.com/api/crawl/queue-all?type=dutchie
Local Testing
For code development/testing only:
cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
npx tsx src/scripts/queue-dispensaries.ts --production --limit=1
Proxy System
How It Works
- Proxies stored in
proxiestable withactive=true/false getActiveProxy()selects random active proxy- On bot detection, proxy gets 35-second in-memory cooldown
- Permanent failures mark
active=falsein DB
Adding New Proxies
INSERT INTO proxies (ip, port, protocol, active)
VALUES ('123.45.67.89', 3128, 'http', true);
Common Tasks
Check Crawl Status
# Via kubectl
kubectl exec -n dispensary-scraper deployment/scraper -- \
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT id, name, crawl_status FROM dispensaries WHERE provider_type = \\'dutchie\\' LIMIT 10').then(r => console.log(r.rows))"
# Via admin UI
# https://dispos.crawlsy.com/schedule
View Recent Products
kubectl exec -n dispensary-scraper deployment/scraper -- \
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*), MAX(updated_at) FROM products').then(r => console.log(r.rows))"
Restart Stuck Crawl
UPDATE dispensaries SET crawl_status = 'idle' WHERE crawl_status = 'running';
Troubleshooting
Pod Not Starting
kubectl describe pod -n dispensary-scraper -l app=scraper
kubectl logs -n dispensary-scraper -l app=scraper --previous
Database Connection Issues
# Test from pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT 1').then(() => console.log('OK'))"
Proxy Issues
# Check active proxy count
kubectl exec -n dispensary-scraper deployment/scraper -- \
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*) FROM proxies WHERE active = true').then(r => console.log('Active proxies:', r.rows[0].count))"
Related Documentation
docs/CRAWL_OPERATIONS.md- Crawl scheduling and data philosophydocs/PRODUCT_BRAND_INTELLIGENCE_ARCHITECTURE.md- Data processingDATABASE-GUIDE.md- Database usage details