From e5b88b093cc878ca2be76a86bebe8f686e85ed46 Mon Sep 17 00:00:00 2001 From: Kelly Date: Mon, 1 Dec 2025 08:41:29 -0700 Subject: [PATCH] Add Getting Started guide for development and deployment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Includes: - Architecture overview with k8s diagram - Local development setup - Database information (local vs remote) - K8s deployment instructions - Running crawlers remotely - Proxy system documentation - Common tasks and troubleshooting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- GETTING-STARTED.md | 291 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 291 insertions(+) create mode 100644 GETTING-STARTED.md diff --git a/GETTING-STARTED.md b/GETTING-STARTED.md new file mode 100644 index 00000000..73256d2a --- /dev/null +++ b/GETTING-STARTED.md @@ -0,0 +1,291 @@ +# Dispensary Scraper - Getting Started Guide + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ K8s Cluster (Production) │ +│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ +│ │ frontend │ │ scraper │ │ postgres │ │ +│ │ (React Admin) │ │ (Node + API) │ │ (Database) │ │ +│ │ port: 80 │ │ port: 3010 │ │ port: 5432 │ │ +│ └────────────────┘ └────────────────┘ └────────────────┘ │ +│ │ │ │ +│ └─────────────────────┘ │ +│ │ +│ Ingress: dispos.crawlsy.com │ +│ / → frontend │ +│ /api → scraper │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### Components + +| Component | Description | Image | +|-----------|-------------|-------| +| **frontend** | React admin panel for monitoring crawls | `code.cannabrands.app/creationshop/dispensary-scraper-frontend:latest` | +| **scraper** | Node.js backend with API + Puppeteer crawlers | `code.cannabrands.app/creationshop/dispensary-scraper:latest` | +| **postgres** | PostgreSQL database | `postgres:15` | + +### Key Directories + +``` +dispensary-scraper/ +├── backend/ # Node.js backend +│ ├── src/ +│ │ ├── routes/ # API endpoints +│ │ ├── services/ # Business logic +│ │ ├── scraper-v2/ # Scraper engine (Scrapy-inspired) +│ │ └── scripts/ # CLI scripts +│ └── migrations/ # SQL migrations +├── frontend/ # React admin panel +├── k8s/ # Kubernetes manifests +├── wordpress-plugin/ # WP plugin for dispensary sites +└── docs/ # Architecture documentation +``` + +--- + +## Local Development + +### Prerequisites + +- Node.js 20+ +- Docker & Docker Compose +- kubectl configured for the k8s cluster +- SSH tunnel to production database (port 54320) + +### 1. Start Local Services + +```bash +cd /home/kelly/git/dispensary-scraper + +# Start local postgres and minio +docker start dutchie-postgres dutchie-minio + +# Or use docker-compose for fresh setup +docker compose up -d +``` + +### 2. Environment Variables + +```bash +# Production database (via SSH tunnel) +export DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" +``` + +### 3. Run Backend + +```bash +cd backend +npm install +npm run dev +# API available at http://localhost:3010 +``` + +### 4. Run Frontend + +```bash +cd frontend +npm install +npm run dev +# UI available at http://localhost:5173 +``` + +### 5. Run Migrations + +```bash +cd backend +DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" npx tsx src/db/migrate.ts +``` + +--- + +## Database Architecture + +### Two Databases + +| Database | Port | User | Purpose | +|----------|------|------|---------| +| `dutchie_menus` | 54320 | dutchie | **Production** - SSH tunnel to remote k8s postgres | +| `dispensary_scraper` | 5432 | scraper | **Remote only** - Accessed via k8s pods | + +**Local development uses port 54320** (SSH tunnel to production). + +### Key Tables + +| Table | Purpose | +|-------|---------| +| `dispensaries` | Master list of dispensaries (from AZDHS) with crawl metadata | +| `stores` | Legacy table - being migrated to dispensaries | +| `products` | Scraped product data | +| `proxies` | Proxy pool for bot detection avoidance | +| `crawl_jobs` | Crawl job history and status | + +### Dispensary Crawl Fields + +The `dispensaries` table now contains crawl metadata: +- `provider_type` - Menu provider (dutchie, jane, etc.) +- `menu_provider` - Detected provider for crawling +- `crawler_mode` - 'production' or 'sandbox' +- `crawl_status` - Current status (idle, running, ok, error) +- `menu_url` - URL to crawl + +--- + +## Kubernetes Deployment + +### Namespace + +```bash +kubectl get all -n dispensary-scraper +``` + +### Image Registry + +Images are stored at `code.cannabrands.app/creationshop/`: +- `dispensary-scraper:latest` - Backend +- `dispensary-scraper-frontend:latest` - Frontend + +### Deploy New Code + +The CI/CD pipeline builds and deploys automatically on push to master. To manually deploy: + +```bash +# Restart deployment to pull latest image +kubectl rollout restart deployment/scraper -n dispensary-scraper + +# Check status +kubectl rollout status deployment/scraper -n dispensary-scraper + +# View logs +kubectl logs -f deployment/scraper -n dispensary-scraper +``` + +### Run Migrations on Remote + +```bash +# Copy migration file to pod +kubectl cp backend/migrations/029_link_dutchie_stores_to_dispensaries.sql \ + dispensary-scraper/scraper-POD-NAME:/tmp/migration.sql + +# Execute via psql in pod +kubectl exec -n dispensary-scraper deployment/scraper -- \ + sh -c 'PGPASSWORD=$POSTGRES_PASSWORD psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f /tmp/migration.sql' +``` + +--- + +## Running Crawlers + +### Remote (Production) + +Crawlers should run on the k8s cluster, not locally. + +```bash +# Run crawl script inside pod +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node dist/scripts/queue-dispensaries.js --production + +# Or for a specific dispensary +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node -e "require('./dist/services/crawler-jobs').runDutchieMenuCrawlJob(112)" +``` + +### Trigger via API + +```bash +# Queue all Dutchie dispensaries +curl -X POST https://dispos.crawlsy.com/api/crawl/queue-all?type=dutchie +``` + +### Local Testing + +For code development/testing only: + +```bash +cd backend +DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \ + npx tsx src/scripts/queue-dispensaries.ts --production --limit=1 +``` + +--- + +## Proxy System + +### How It Works + +1. Proxies stored in `proxies` table with `active=true/false` +2. `getActiveProxy()` selects random active proxy +3. On bot detection, proxy gets 35-second in-memory cooldown +4. Permanent failures mark `active=false` in DB + +### Adding New Proxies + +```sql +INSERT INTO proxies (ip, port, protocol, active) +VALUES ('123.45.67.89', 3128, 'http', true); +``` + +--- + +## Common Tasks + +### Check Crawl Status + +```bash +# Via kubectl +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT id, name, crawl_status FROM dispensaries WHERE provider_type = \\'dutchie\\' LIMIT 10').then(r => console.log(r.rows))" + +# Via admin UI +# https://dispos.crawlsy.com/schedule +``` + +### View Recent Products + +```bash +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*), MAX(updated_at) FROM products').then(r => console.log(r.rows))" +``` + +### Restart Stuck Crawl + +```sql +UPDATE dispensaries SET crawl_status = 'idle' WHERE crawl_status = 'running'; +``` + +--- + +## Troubleshooting + +### Pod Not Starting + +```bash +kubectl describe pod -n dispensary-scraper -l app=scraper +kubectl logs -n dispensary-scraper -l app=scraper --previous +``` + +### Database Connection Issues + +```bash +# Test from pod +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT 1').then(() => console.log('OK'))" +``` + +### Proxy Issues + +```bash +# Check active proxy count +kubectl exec -n dispensary-scraper deployment/scraper -- \ + node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*) FROM proxies WHERE active = true').then(r => console.log('Active proxies:', r.rows[0].count))" +``` + +--- + +## Related Documentation + +- `docs/CRAWL_OPERATIONS.md` - Crawl scheduling and data philosophy +- `docs/PRODUCT_BRAND_INTELLIGENCE_ARCHITECTURE.md` - Data processing +- `DATABASE-GUIDE.md` - Database usage details