Files
cannaiq/GETTING-STARTED.md
Kelly e5b88b093c Add Getting Started guide for development and deployment
Includes:
- Architecture overview with k8s diagram
- Local development setup
- Database information (local vs remote)
- K8s deployment instructions
- Running crawlers remotely
- Proxy system documentation
- Common tasks and troubleshooting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-01 08:41:29 -07:00

8.7 KiB

Dispensary Scraper - Getting Started Guide

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                           K8s Cluster (Production)                        │
│  ┌────────────────┐   ┌────────────────┐   ┌────────────────┐            │
│  │    frontend    │   │    scraper     │   │   postgres     │            │
│  │  (React Admin) │   │  (Node + API)  │   │   (Database)   │            │
│  │   port: 80     │   │  port: 3010    │   │  port: 5432    │            │
│  └────────────────┘   └────────────────┘   └────────────────┘            │
│                               │                     │                      │
│                               └─────────────────────┘                      │
│                                                                            │
│  Ingress: dispos.crawlsy.com                                              │
│    /      → frontend                                                       │
│    /api   → scraper                                                        │
└─────────────────────────────────────────────────────────────────────────┘

Components

Component Description Image
frontend React admin panel for monitoring crawls code.cannabrands.app/creationshop/dispensary-scraper-frontend:latest
scraper Node.js backend with API + Puppeteer crawlers code.cannabrands.app/creationshop/dispensary-scraper:latest
postgres PostgreSQL database postgres:15

Key Directories

dispensary-scraper/
├── backend/                    # Node.js backend
│   ├── src/
│   │   ├── routes/            # API endpoints
│   │   ├── services/          # Business logic
│   │   ├── scraper-v2/        # Scraper engine (Scrapy-inspired)
│   │   └── scripts/           # CLI scripts
│   └── migrations/            # SQL migrations
├── frontend/                   # React admin panel
├── k8s/                        # Kubernetes manifests
├── wordpress-plugin/          # WP plugin for dispensary sites
└── docs/                      # Architecture documentation

Local Development

Prerequisites

  • Node.js 20+
  • Docker & Docker Compose
  • kubectl configured for the k8s cluster
  • SSH tunnel to production database (port 54320)

1. Start Local Services

cd /home/kelly/git/dispensary-scraper

# Start local postgres and minio
docker start dutchie-postgres dutchie-minio

# Or use docker-compose for fresh setup
docker compose up -d

2. Environment Variables

# Production database (via SSH tunnel)
export DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus"

3. Run Backend

cd backend
npm install
npm run dev
# API available at http://localhost:3010

4. Run Frontend

cd frontend
npm install
npm run dev
# UI available at http://localhost:5173

5. Run Migrations

cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" npx tsx src/db/migrate.ts

Database Architecture

Two Databases

Database Port User Purpose
dutchie_menus 54320 dutchie Production - SSH tunnel to remote k8s postgres
dispensary_scraper 5432 scraper Remote only - Accessed via k8s pods

Local development uses port 54320 (SSH tunnel to production).

Key Tables

Table Purpose
dispensaries Master list of dispensaries (from AZDHS) with crawl metadata
stores Legacy table - being migrated to dispensaries
products Scraped product data
proxies Proxy pool for bot detection avoidance
crawl_jobs Crawl job history and status

Dispensary Crawl Fields

The dispensaries table now contains crawl metadata:

  • provider_type - Menu provider (dutchie, jane, etc.)
  • menu_provider - Detected provider for crawling
  • crawler_mode - 'production' or 'sandbox'
  • crawl_status - Current status (idle, running, ok, error)
  • menu_url - URL to crawl

Kubernetes Deployment

Namespace

kubectl get all -n dispensary-scraper

Image Registry

Images are stored at code.cannabrands.app/creationshop/:

  • dispensary-scraper:latest - Backend
  • dispensary-scraper-frontend:latest - Frontend

Deploy New Code

The CI/CD pipeline builds and deploys automatically on push to master. To manually deploy:

# Restart deployment to pull latest image
kubectl rollout restart deployment/scraper -n dispensary-scraper

# Check status
kubectl rollout status deployment/scraper -n dispensary-scraper

# View logs
kubectl logs -f deployment/scraper -n dispensary-scraper

Run Migrations on Remote

# Copy migration file to pod
kubectl cp backend/migrations/029_link_dutchie_stores_to_dispensaries.sql \
  dispensary-scraper/scraper-POD-NAME:/tmp/migration.sql

# Execute via psql in pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
  sh -c 'PGPASSWORD=$POSTGRES_PASSWORD psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f /tmp/migration.sql'

Running Crawlers

Remote (Production)

Crawlers should run on the k8s cluster, not locally.

# Run crawl script inside pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
  node dist/scripts/queue-dispensaries.js --production

# Or for a specific dispensary
kubectl exec -n dispensary-scraper deployment/scraper -- \
  node -e "require('./dist/services/crawler-jobs').runDutchieMenuCrawlJob(112)"

Trigger via API

# Queue all Dutchie dispensaries
curl -X POST https://dispos.crawlsy.com/api/crawl/queue-all?type=dutchie

Local Testing

For code development/testing only:

cd backend
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
  npx tsx src/scripts/queue-dispensaries.ts --production --limit=1

Proxy System

How It Works

  1. Proxies stored in proxies table with active=true/false
  2. getActiveProxy() selects random active proxy
  3. On bot detection, proxy gets 35-second in-memory cooldown
  4. Permanent failures mark active=false in DB

Adding New Proxies

INSERT INTO proxies (ip, port, protocol, active)
VALUES ('123.45.67.89', 3128, 'http', true);

Common Tasks

Check Crawl Status

# Via kubectl
kubectl exec -n dispensary-scraper deployment/scraper -- \
  node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT id, name, crawl_status FROM dispensaries WHERE provider_type = \\'dutchie\\' LIMIT 10').then(r => console.log(r.rows))"

# Via admin UI
# https://dispos.crawlsy.com/schedule

View Recent Products

kubectl exec -n dispensary-scraper deployment/scraper -- \
  node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*), MAX(updated_at) FROM products').then(r => console.log(r.rows))"

Restart Stuck Crawl

UPDATE dispensaries SET crawl_status = 'idle' WHERE crawl_status = 'running';

Troubleshooting

Pod Not Starting

kubectl describe pod -n dispensary-scraper -l app=scraper
kubectl logs -n dispensary-scraper -l app=scraper --previous

Database Connection Issues

# Test from pod
kubectl exec -n dispensary-scraper deployment/scraper -- \
  node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT 1').then(() => console.log('OK'))"

Proxy Issues

# Check active proxy count
kubectl exec -n dispensary-scraper deployment/scraper -- \
  node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*) FROM proxies WHERE active = true').then(r => console.log('Active proxies:', r.rows[0].count))"

  • docs/CRAWL_OPERATIONS.md - Crawl scheduling and data philosophy
  • docs/PRODUCT_BRAND_INTELLIGENCE_ARCHITECTURE.md - Data processing
  • DATABASE-GUIDE.md - Database usage details