Add Getting Started guide for development and deployment
Includes: - Architecture overview with k8s diagram - Local development setup - Database information (local vs remote) - K8s deployment instructions - Running crawlers remotely - Proxy system documentation - Common tasks and troubleshooting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
291
GETTING-STARTED.md
Normal file
291
GETTING-STARTED.md
Normal file
@@ -0,0 +1,291 @@
|
|||||||
|
# Dispensary Scraper - Getting Started Guide
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ K8s Cluster (Production) │
|
||||||
|
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
|
||||||
|
│ │ frontend │ │ scraper │ │ postgres │ │
|
||||||
|
│ │ (React Admin) │ │ (Node + API) │ │ (Database) │ │
|
||||||
|
│ │ port: 80 │ │ port: 3010 │ │ port: 5432 │ │
|
||||||
|
│ └────────────────┘ └────────────────┘ └────────────────┘ │
|
||||||
|
│ │ │ │
|
||||||
|
│ └─────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ Ingress: dispos.crawlsy.com │
|
||||||
|
│ / → frontend │
|
||||||
|
│ /api → scraper │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
| Component | Description | Image |
|
||||||
|
|-----------|-------------|-------|
|
||||||
|
| **frontend** | React admin panel for monitoring crawls | `code.cannabrands.app/creationshop/dispensary-scraper-frontend:latest` |
|
||||||
|
| **scraper** | Node.js backend with API + Puppeteer crawlers | `code.cannabrands.app/creationshop/dispensary-scraper:latest` |
|
||||||
|
| **postgres** | PostgreSQL database | `postgres:15` |
|
||||||
|
|
||||||
|
### Key Directories
|
||||||
|
|
||||||
|
```
|
||||||
|
dispensary-scraper/
|
||||||
|
├── backend/ # Node.js backend
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── routes/ # API endpoints
|
||||||
|
│ │ ├── services/ # Business logic
|
||||||
|
│ │ ├── scraper-v2/ # Scraper engine (Scrapy-inspired)
|
||||||
|
│ │ └── scripts/ # CLI scripts
|
||||||
|
│ └── migrations/ # SQL migrations
|
||||||
|
├── frontend/ # React admin panel
|
||||||
|
├── k8s/ # Kubernetes manifests
|
||||||
|
├── wordpress-plugin/ # WP plugin for dispensary sites
|
||||||
|
└── docs/ # Architecture documentation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Local Development
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- Node.js 20+
|
||||||
|
- Docker & Docker Compose
|
||||||
|
- kubectl configured for the k8s cluster
|
||||||
|
- SSH tunnel to production database (port 54320)
|
||||||
|
|
||||||
|
### 1. Start Local Services
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/kelly/git/dispensary-scraper
|
||||||
|
|
||||||
|
# Start local postgres and minio
|
||||||
|
docker start dutchie-postgres dutchie-minio
|
||||||
|
|
||||||
|
# Or use docker-compose for fresh setup
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Production database (via SSH tunnel)
|
||||||
|
export DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Run Backend
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend
|
||||||
|
npm install
|
||||||
|
npm run dev
|
||||||
|
# API available at http://localhost:3010
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Run Frontend
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd frontend
|
||||||
|
npm install
|
||||||
|
npm run dev
|
||||||
|
# UI available at http://localhost:5173
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Run Migrations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend
|
||||||
|
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" npx tsx src/db/migrate.ts
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Database Architecture
|
||||||
|
|
||||||
|
### Two Databases
|
||||||
|
|
||||||
|
| Database | Port | User | Purpose |
|
||||||
|
|----------|------|------|---------|
|
||||||
|
| `dutchie_menus` | 54320 | dutchie | **Production** - SSH tunnel to remote k8s postgres |
|
||||||
|
| `dispensary_scraper` | 5432 | scraper | **Remote only** - Accessed via k8s pods |
|
||||||
|
|
||||||
|
**Local development uses port 54320** (SSH tunnel to production).
|
||||||
|
|
||||||
|
### Key Tables
|
||||||
|
|
||||||
|
| Table | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `dispensaries` | Master list of dispensaries (from AZDHS) with crawl metadata |
|
||||||
|
| `stores` | Legacy table - being migrated to dispensaries |
|
||||||
|
| `products` | Scraped product data |
|
||||||
|
| `proxies` | Proxy pool for bot detection avoidance |
|
||||||
|
| `crawl_jobs` | Crawl job history and status |
|
||||||
|
|
||||||
|
### Dispensary Crawl Fields
|
||||||
|
|
||||||
|
The `dispensaries` table now contains crawl metadata:
|
||||||
|
- `provider_type` - Menu provider (dutchie, jane, etc.)
|
||||||
|
- `menu_provider` - Detected provider for crawling
|
||||||
|
- `crawler_mode` - 'production' or 'sandbox'
|
||||||
|
- `crawl_status` - Current status (idle, running, ok, error)
|
||||||
|
- `menu_url` - URL to crawl
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kubernetes Deployment
|
||||||
|
|
||||||
|
### Namespace
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get all -n dispensary-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Registry
|
||||||
|
|
||||||
|
Images are stored at `code.cannabrands.app/creationshop/`:
|
||||||
|
- `dispensary-scraper:latest` - Backend
|
||||||
|
- `dispensary-scraper-frontend:latest` - Frontend
|
||||||
|
|
||||||
|
### Deploy New Code
|
||||||
|
|
||||||
|
The CI/CD pipeline builds and deploys automatically on push to master. To manually deploy:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Restart deployment to pull latest image
|
||||||
|
kubectl rollout restart deployment/scraper -n dispensary-scraper
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
kubectl rollout status deployment/scraper -n dispensary-scraper
|
||||||
|
|
||||||
|
# View logs
|
||||||
|
kubectl logs -f deployment/scraper -n dispensary-scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Migrations on Remote
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Copy migration file to pod
|
||||||
|
kubectl cp backend/migrations/029_link_dutchie_stores_to_dispensaries.sql \
|
||||||
|
dispensary-scraper/scraper-POD-NAME:/tmp/migration.sql
|
||||||
|
|
||||||
|
# Execute via psql in pod
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
sh -c 'PGPASSWORD=$POSTGRES_PASSWORD psql -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -f /tmp/migration.sql'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running Crawlers
|
||||||
|
|
||||||
|
### Remote (Production)
|
||||||
|
|
||||||
|
Crawlers should run on the k8s cluster, not locally.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run crawl script inside pod
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node dist/scripts/queue-dispensaries.js --production
|
||||||
|
|
||||||
|
# Or for a specific dispensary
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node -e "require('./dist/services/crawler-jobs').runDutchieMenuCrawlJob(112)"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Trigger via API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Queue all Dutchie dispensaries
|
||||||
|
curl -X POST https://dispos.crawlsy.com/api/crawl/queue-all?type=dutchie
|
||||||
|
```
|
||||||
|
|
||||||
|
### Local Testing
|
||||||
|
|
||||||
|
For code development/testing only:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd backend
|
||||||
|
DATABASE_URL="postgresql://dutchie:dutchie_local_pass@localhost:54320/dutchie_menus" \
|
||||||
|
npx tsx src/scripts/queue-dispensaries.ts --production --limit=1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proxy System
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
|
||||||
|
1. Proxies stored in `proxies` table with `active=true/false`
|
||||||
|
2. `getActiveProxy()` selects random active proxy
|
||||||
|
3. On bot detection, proxy gets 35-second in-memory cooldown
|
||||||
|
4. Permanent failures mark `active=false` in DB
|
||||||
|
|
||||||
|
### Adding New Proxies
|
||||||
|
|
||||||
|
```sql
|
||||||
|
INSERT INTO proxies (ip, port, protocol, active)
|
||||||
|
VALUES ('123.45.67.89', 3128, 'http', true);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Tasks
|
||||||
|
|
||||||
|
### Check Crawl Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Via kubectl
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT id, name, crawl_status FROM dispensaries WHERE provider_type = \\'dutchie\\' LIMIT 10').then(r => console.log(r.rows))"
|
||||||
|
|
||||||
|
# Via admin UI
|
||||||
|
# https://dispos.crawlsy.com/schedule
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Recent Products
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*), MAX(updated_at) FROM products').then(r => console.log(r.rows))"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Restart Stuck Crawl
|
||||||
|
|
||||||
|
```sql
|
||||||
|
UPDATE dispensaries SET crawl_status = 'idle' WHERE crawl_status = 'running';
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Pod Not Starting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl describe pod -n dispensary-scraper -l app=scraper
|
||||||
|
kubectl logs -n dispensary-scraper -l app=scraper --previous
|
||||||
|
```
|
||||||
|
|
||||||
|
### Database Connection Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test from pod
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT 1').then(() => console.log('OK'))"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Proxy Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check active proxy count
|
||||||
|
kubectl exec -n dispensary-scraper deployment/scraper -- \
|
||||||
|
node -e "require('pg').Pool({connectionString: process.env.DATABASE_URL}).query('SELECT COUNT(*) FROM proxies WHERE active = true').then(r => console.log('Active proxies:', r.rows[0].count))"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Documentation
|
||||||
|
|
||||||
|
- `docs/CRAWL_OPERATIONS.md` - Crawl scheduling and data philosophy
|
||||||
|
- `docs/PRODUCT_BRAND_INTELLIGENCE_ARCHITECTURE.md` - Data processing
|
||||||
|
- `DATABASE-GUIDE.md` - Database usage details
|
||||||
Reference in New Issue
Block a user