- Rewrote entry_point_discovery with auto-healing scheme: 1. Check dutchie_discovery_locations for existing platform_location_id 2. Browser-based GraphQL with 5x network retries 3. Mark as needs_investigation on hard failure - Browser (Puppeteer) is now DEFAULT transport - curl only when explicit - Added migration 091 for tracking columns: - last_store_discovery_at: When store_discovery updated record - last_payload_at: When last product payload was saved - Updated CODEBASE_MAP.md with transport rules documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code)
269 lines
9.7 KiB
Markdown
269 lines
9.7 KiB
Markdown
# CannaiQ Backend Codebase Map
|
|
|
|
**Last Updated:** 2025-12-12
|
|
**Purpose:** Help Claude and developers understand which code is current vs deprecated
|
|
|
|
---
|
|
|
|
## Quick Reference: What to Use
|
|
|
|
### For Crawling/Scraping
|
|
| Task | Use This | NOT This |
|
|
|------|----------|----------|
|
|
| Fetch products | `src/tasks/handlers/payload-fetch.ts` | `src/hydration/*` |
|
|
| Process products | `src/tasks/handlers/product-refresh.ts` | `src/scraper-v2/*` |
|
|
| GraphQL client | `src/platforms/dutchie/client.ts` | `src/dutchie-az/services/graphql-client.ts` |
|
|
| Worker system | `src/tasks/task-worker.ts` | `src/dutchie-az/services/worker.ts` |
|
|
|
|
### For Database
|
|
| Task | Use This | NOT This |
|
|
|------|----------|----------|
|
|
| Get DB pool | `src/db/pool.ts` | `src/dutchie-az/db/connection.ts` |
|
|
| Run migrations | `src/db/migrate.ts` (CLI only) | Never import at runtime |
|
|
| Query products | `store_products` table | `products`, `dutchie_products` |
|
|
| Query stores | `dispensaries` table | `stores` table |
|
|
|
|
### For Discovery
|
|
| Task | Use This |
|
|
|------|----------|
|
|
| Discover stores | `src/discovery/*.ts` |
|
|
| Run discovery | `npx tsx src/scripts/run-discovery.ts` |
|
|
|
|
---
|
|
|
|
## Directory Status
|
|
|
|
### ACTIVE DIRECTORIES (Use These)
|
|
|
|
```
|
|
src/
|
|
├── auth/ # JWT/session auth, middleware
|
|
├── db/ # Database pool, migrations
|
|
├── discovery/ # Dutchie store discovery pipeline
|
|
├── middleware/ # Express middleware
|
|
├── multi-state/ # Multi-state query support
|
|
├── platforms/ # Platform-specific clients (Dutchie, Jane, etc)
|
|
│ └── dutchie/ # THE Dutchie client - use this one
|
|
├── routes/ # Express API routes
|
|
├── services/ # Core services (logger, scheduler, etc)
|
|
├── tasks/ # Task system (workers, handlers, scheduler)
|
|
│ └── handlers/ # Task handlers (payload_fetch, product_refresh, etc)
|
|
├── types/ # TypeScript types
|
|
└── utils/ # Utilities (storage, image processing)
|
|
```
|
|
|
|
### DEPRECATED DIRECTORIES (DO NOT USE)
|
|
|
|
```
|
|
src/
|
|
├── hydration/ # DEPRECATED - Old pipeline approach
|
|
├── scraper-v2/ # DEPRECATED - Old scraper engine
|
|
├── canonical-hydration/# DEPRECATED - Merged into tasks/handlers
|
|
├── dutchie-az/ # PARTIAL - Some parts deprecated, some active
|
|
│ ├── db/ # DEPRECATED - Use src/db/pool.ts
|
|
│ └── services/ # PARTIAL - worker.ts still runs, graphql-client.ts deprecated
|
|
├── portals/ # FUTURE - Not yet implemented
|
|
├── seo/ # PARTIAL - Settings work, templates WIP
|
|
└── system/ # DEPRECATED - Old orchestration system
|
|
```
|
|
|
|
### DEPRECATED FILES (DO NOT USE)
|
|
|
|
```
|
|
src/dutchie-az/db/connection.ts # Use src/db/pool.ts instead
|
|
src/dutchie-az/services/graphql-client.ts # Use src/platforms/dutchie/client.ts
|
|
src/hydration/*.ts # Entire directory deprecated
|
|
src/scraper-v2/*.ts # Entire directory deprecated
|
|
```
|
|
|
|
---
|
|
|
|
## Key Files Reference
|
|
|
|
### Entry Points
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `src/index.ts` | Main Express server | ACTIVE |
|
|
| `src/dutchie-az/services/worker.ts` | Worker process entry | ACTIVE |
|
|
| `src/tasks/task-worker.ts` | Task worker (new system) | ACTIVE |
|
|
|
|
### Dutchie Integration
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `src/platforms/dutchie/client.ts` | GraphQL client, hashes, curl | **PRIMARY** |
|
|
| `src/platforms/dutchie/queries.ts` | High-level query functions | ACTIVE |
|
|
| `src/platforms/dutchie/index.ts` | Re-exports | ACTIVE |
|
|
|
|
### Task Handlers
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `src/tasks/handlers/payload-fetch.ts` | Fetch products from Dutchie | **PRIMARY** |
|
|
| `src/tasks/handlers/product-refresh.ts` | Process payload into DB | **PRIMARY** |
|
|
| `src/tasks/handlers/entry-point-discovery.ts` | Resolve platform IDs (auto-healing) | **PRIMARY** |
|
|
| `src/tasks/handlers/menu-detection.ts` | Detect menu type | ACTIVE |
|
|
| `src/tasks/handlers/id-resolution.ts` | Resolve platform IDs (legacy) | LEGACY |
|
|
| `src/tasks/handlers/image-download.ts` | Download product images | ACTIVE |
|
|
|
|
---
|
|
|
|
## Transport Rules (CRITICAL)
|
|
|
|
**Browser-based (Puppeteer) is the DEFAULT transport. curl is ONLY allowed when explicitly specified.**
|
|
|
|
### Transport Selection
|
|
| `task.method` | Transport Used | Notes |
|
|
|---------------|----------------|-------|
|
|
| `null` | Browser (Puppeteer) | DEFAULT - use this for most tasks |
|
|
| `'http'` | Browser (Puppeteer) | Explicit browser request |
|
|
| `'curl'` | curl-impersonate | ONLY when explicitly needed |
|
|
|
|
### Why Browser-First?
|
|
1. **Anti-detection**: Puppeteer with StealthPlugin evades bot detection
|
|
2. **Session cookies**: Browser maintains session state automatically
|
|
3. **Fingerprinting**: Real browser fingerprint (TLS, headers, etc.)
|
|
4. **Age gates**: Browser can click through age verification
|
|
|
|
### Entry Point Discovery Auto-Healing
|
|
The `entry_point_discovery` handler uses a healing strategy:
|
|
|
|
```
|
|
1. FIRST: Check dutchie_discovery_locations for existing platform_location_id
|
|
- By linked dutchie_discovery_id
|
|
- By slug match in discovery data
|
|
→ If found, NO network call needed
|
|
|
|
2. SECOND: Browser-based GraphQL (Puppeteer)
|
|
- 5x retries for network/proxy failures
|
|
- On HTTP 403: rotate proxy and retry
|
|
- On HTTP 404 after 2 attempts: mark as 'removed'
|
|
|
|
3. HARD FAILURE: After exhausting options → 'needs_investigation'
|
|
```
|
|
|
|
### DO NOT Use curl Unless:
|
|
- Task explicitly has `method = 'curl'`
|
|
- You're testing curl-impersonate binaries
|
|
- The API explicitly requires curl fingerprinting
|
|
|
|
### Files
|
|
| File | Transport | Purpose |
|
|
|------|-----------|---------|
|
|
| `src/services/puppeteer-preflight.ts` | Browser | Preflight check |
|
|
| `src/services/curl-preflight.ts` | curl | Preflight check |
|
|
| `src/tasks/handlers/entry-point-discovery.ts` | Browser | Platform ID resolution |
|
|
| `src/tasks/handlers/payload-fetch.ts` | Both | Product fetching |
|
|
|
|
### Database
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `src/db/pool.ts` | Canonical DB pool | **PRIMARY** |
|
|
| `src/db/migrate.ts` | Migration runner (CLI only) | CLI ONLY |
|
|
| `src/db/auto-migrate.ts` | Auto-run migrations on startup | ACTIVE |
|
|
|
|
### Configuration
|
|
| File | Purpose | Status |
|
|
|------|---------|--------|
|
|
| `.env` | Environment variables | ACTIVE |
|
|
| `package.json` | Dependencies | ACTIVE |
|
|
| `tsconfig.json` | TypeScript config | ACTIVE |
|
|
|
|
---
|
|
|
|
## GraphQL Hashes (CRITICAL)
|
|
|
|
The correct hashes are in `src/platforms/dutchie/client.ts`:
|
|
|
|
```typescript
|
|
export const GRAPHQL_HASHES = {
|
|
FilteredProducts: 'ee29c060826dc41c527e470e9ae502c9b2c169720faa0a9f5d25e1b9a530a4a0',
|
|
GetAddressBasedDispensaryData: '13461f73abf7268770dfd05fe7e10c523084b2bb916a929c08efe3d87531977b',
|
|
ConsumerDispensaries: '0a5bfa6ca1d64ae47bcccb7c8077c87147cbc4e6982c17ceec97a2a4948b311b',
|
|
GetAllCitiesByState: 'ae547a0466ace5a48f91e55bf6699eacd87e3a42841560f0c0eabed5a0a920e6',
|
|
};
|
|
```
|
|
|
|
**ALWAYS** use `Status: 'Active'` for FilteredProducts (not `null` or `'All'`).
|
|
|
|
---
|
|
|
|
## Scripts Reference
|
|
|
|
### Useful Scripts (in `src/scripts/`)
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `run-discovery.ts` | Run Dutchie discovery |
|
|
| `crawl-single-store.ts` | Test crawl a single store |
|
|
| `test-dutchie-graphql.ts` | Test GraphQL queries |
|
|
|
|
### One-Off Scripts (probably don't need)
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `harmonize-az-dispensaries.ts` | One-time data cleanup |
|
|
| `bootstrap-stores-for-dispensaries.ts` | One-time migration |
|
|
| `backfill-*.ts` | Historical backfill scripts |
|
|
|
|
---
|
|
|
|
## API Routes
|
|
|
|
### Active Routes (in `src/routes/`)
|
|
| Route File | Mount Point | Purpose |
|
|
|------------|-------------|---------|
|
|
| `auth.ts` | `/api/auth` | Login/logout/session |
|
|
| `stores.ts` | `/api/stores` | Store CRUD |
|
|
| `dashboard.ts` | `/api/dashboard` | Dashboard stats |
|
|
| `workers.ts` | `/api/workers` | Worker monitoring |
|
|
| `pipeline.ts` | `/api/pipeline` | Crawl triggers |
|
|
| `discovery.ts` | `/api/discovery` | Discovery management |
|
|
| `analytics.ts` | `/api/analytics` | Analytics queries |
|
|
| `wordpress.ts` | `/api/v1/wordpress` | WordPress plugin API |
|
|
|
|
---
|
|
|
|
## Documentation Files
|
|
|
|
### Current Docs (in `backend/docs/`)
|
|
| Doc | Purpose | Currency |
|
|
|-----|---------|----------|
|
|
| `TASK_WORKFLOW_2024-12-10.md` | Task system architecture | CURRENT |
|
|
| `WORKER_TASK_ARCHITECTURE.md` | Worker/task design | CURRENT |
|
|
| `CRAWL_PIPELINE.md` | Crawl pipeline overview | CURRENT |
|
|
| `ORGANIC_SCRAPING_GUIDE.md` | Browser-based scraping | CURRENT |
|
|
| `CODEBASE_MAP.md` | This file | CURRENT |
|
|
| `ANALYTICS_V2_EXAMPLES.md` | Analytics API examples | CURRENT |
|
|
| `BRAND_INTELLIGENCE_API.md` | Brand API docs | CURRENT |
|
|
|
|
### Root Docs
|
|
| Doc | Purpose | Currency |
|
|
|-----|---------|----------|
|
|
| `CLAUDE.md` | Claude instructions | **PRIMARY** |
|
|
| `README.md` | Project overview | NEEDS UPDATE |
|
|
|
|
---
|
|
|
|
## Common Mistakes to Avoid
|
|
|
|
1. **Don't use `src/hydration/`** - It's an old approach that was superseded by the task system
|
|
|
|
2. **Don't use `src/dutchie-az/db/connection.ts`** - Use `src/db/pool.ts` instead
|
|
|
|
3. **Don't import `src/db/migrate.ts` at runtime** - It will crash. Only use for CLI migrations.
|
|
|
|
4. **Don't query `stores` table** - It's empty. Use `dispensaries`.
|
|
|
|
5. **Don't query `products` table** - It's empty. Use `store_products`.
|
|
|
|
6. **Don't use wrong GraphQL hash** - Always get hash from `GRAPHQL_HASHES` in client.ts
|
|
|
|
7. **Don't use `Status: null`** - It returns 0 products. Use `Status: 'Active'`.
|
|
|
|
---
|
|
|
|
## When in Doubt
|
|
|
|
1. Check if the file is imported in `src/index.ts` - if not, it may be deprecated
|
|
2. Check the last modified date - older files may be stale
|
|
3. Look for `DEPRECATED` comments in the code
|
|
4. Ask: "Is there a newer version of this in `src/tasks/` or `src/platforms/`?"
|
|
5. Read the relevant doc in `docs/` before modifying code
|