Compare commits

..

8 Commits

Author SHA1 Message Date
Kelly
80f048ad57 feat(k8s): Add StatefulSet for persistent workers
- Add scraper-worker-statefulset.yaml with 8 persistent pods
- updateStrategy: OnDelete prevents automatic restarts
- Workers maintain stable identity across restarts
- Document worker architecture in CLAUDE.md
- Add worker registry API endpoint documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:37:28 -07:00
kelly
2ed088b4d8 Merge pull request 'feat(api): Add preflight columns to worker registry API response' (#50) from feat/preflight-api-fields into master 2025-12-12 06:22:06 +00:00
Kelly
d3c49fa246 feat(api): Add preflight columns to worker registry API response
Exposes curl_ip, http_ip, preflight_status, preflight_at, and fingerprint_data
in the /api/worker-registry/workers response.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 23:12:55 -07:00
kelly
52cb5014fd Merge pull request 'feat: Dual-transport preflight system for worker fingerprinting' (#49) from fix/ci-and-preflight-enforcement into master 2025-12-12 06:08:13 +00:00
kelly
c68210c485 Merge pull request 'fix(ci): Remove buildx cache and add preflight enforcement' (#48) from fix/ci-and-preflight-enforcement into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/48
2025-12-12 04:53:20 +00:00
kelly
eca9e85242 Merge pull request 'fix(ci): Fix buildx cache syntax and add proxy_test task' (#47) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/47
2025-12-12 04:30:21 +00:00
kelly
e1c67dcee5 Merge pull request 'feat: UI polish and CI caching improvements' (#46) from feat/ui-polish-and-ci-caching into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/46
2025-12-12 03:47:46 +00:00
kelly
34c8a8cc67 Merge pull request 'feat(cannaiq): Add clickable logo, favicon, and remove state selector' (#45) from feat/cannaiq-ui-polish into master
Reviewed-on: https://code.cannabrands.app/Creationshop/dispensary-scraper/pulls/45
2025-12-12 03:36:04 +00:00
3 changed files with 135 additions and 0 deletions

View File

@@ -205,6 +205,58 @@ These binaries mimic real browser TLS fingerprints to avoid detection.
---
## Worker Architecture (Kubernetes)
### Persistent Workers (StatefulSet)
Workers run as a **StatefulSet** with 8 persistent pods. They maintain identity across restarts.
**Pod Names**: `scraper-worker-0` through `scraper-worker-7`
**Key Properties**:
- `updateStrategy: OnDelete` - Pods only update when manually deleted (no automatic restarts)
- `podManagementPolicy: Parallel` - All pods start simultaneously
- Workers register with their pod name as identity
**K8s Manifest**: `backend/k8s/scraper-worker-statefulset.yaml`
### Worker Lifecycle
1. **Startup**: Worker registers in `worker_registry` table with pod name
2. **Preflight**: Runs dual-transport preflights (curl + http), reports IPs and fingerprint
3. **Task Loop**: Polls for tasks, executes them, reports status
4. **Shutdown**: Graceful 60-second termination period
### NEVER Restart Workers Unnecessarily
**Claude must NOT**:
- Restart workers unless explicitly requested
- Use `kubectl rollout restart` on workers
- Use `kubectl set image` on workers (this triggers restart)
**To update worker code** (only when user authorizes):
1. Build and push new image with version tag
2. Update StatefulSet image reference
3. Manually delete pods one at a time when ready: `kubectl delete pod scraper-worker-0 -n dispensary-scraper`
### Worker Registry API
**Endpoint**: `GET /api/worker-registry/workers`
**Response Fields**:
| Field | Description |
|-------|-------------|
| `pod_name` | Kubernetes pod name |
| `worker_id` | Internal worker UUID |
| `status` | active, idle, offline |
| `curl_ip` | IP from curl preflight |
| `http_ip` | IP from Puppeteer preflight |
| `preflight_status` | pending, passed, failed |
| `preflight_at` | Timestamp of last preflight |
| `fingerprint_data` | Browser fingerprint JSON |
---
## Documentation
| Doc | Purpose |

View File

@@ -0,0 +1,77 @@
apiVersion: v1
kind: Service
metadata:
name: scraper-worker
namespace: dispensary-scraper
labels:
app: scraper-worker
spec:
clusterIP: None # Headless service required for StatefulSet
selector:
app: scraper-worker
ports:
- port: 3010
name: http
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: scraper-worker
namespace: dispensary-scraper
spec:
serviceName: scraper-worker
replicas: 8
podManagementPolicy: Parallel # Start all pods at once
updateStrategy:
type: OnDelete # Pods only update when manually deleted - no automatic restarts
selector:
matchLabels:
app: scraper-worker
template:
metadata:
labels:
app: scraper-worker
spec:
terminationGracePeriodSeconds: 60
imagePullSecrets:
- name: regcred
containers:
- name: worker
image: code.cannabrands.app/creationshop/dispensary-scraper:2ed088b4
imagePullPolicy: Always
command: ["node"]
args: ["dist/tasks/task-worker.js"]
env:
- name: WORKER_MODE
value: "true"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MAX_CONCURRENT_TASKS
value: "50"
- name: API_BASE_URL
value: http://scraper
- name: NODE_OPTIONS
value: --max-old-space-size=1500
envFrom:
- configMapRef:
name: scraper-config
- secretRef:
name: scraper-secrets
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pgrep -f 'task-worker' > /dev/null
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3

View File

@@ -355,6 +355,12 @@ router.get('/workers', async (req: Request, res: Response) => {
-- Decommission fields
COALESCE(decommission_requested, false) as decommission_requested,
decommission_reason,
-- Preflight fields (dual-transport verification)
curl_ip,
http_ip,
preflight_status,
preflight_at,
fingerprint_data,
-- Full metadata for resources
metadata,
EXTRACT(EPOCH FROM (NOW() - last_heartbeat_at)) as seconds_since_heartbeat,