541 lines
11 KiB
Markdown
541 lines
11 KiB
Markdown
# Horizontal Scaling Strategies
|
|
|
|
## Overview
|
|
|
|
Changemaker Lite V2 can scale horizontally to handle increased traffic and data volume. This guide covers strategies for scaling each component.
|
|
|
|
**When to Scale:**
|
|
- API response time >500ms (P95)
|
|
- CPU usage >70% sustained
|
|
- Memory usage >80% sustained
|
|
- Database connection pool exhausted
|
|
- Job queue backing up (>100 jobs waiting)
|
|
|
|
---
|
|
|
|
## Database Scaling
|
|
|
|
### Read Replicas
|
|
|
|
**PostgreSQL streaming replication** for read-heavy workloads.
|
|
|
|
**Setup** (docker-compose.yml):
|
|
```yaml
|
|
v2-postgres-replica:
|
|
image: postgres:16-alpine
|
|
container_name: changemaker-v2-postgres-replica
|
|
environment:
|
|
POSTGRES_USER: replicator
|
|
POSTGRES_PASSWORD: ${REPLICA_PASSWORD}
|
|
command: |
|
|
postgres -c wal_level=replica
|
|
-c hot_standby=on
|
|
-c max_wal_senders=3
|
|
-c hot_standby_feedback=on
|
|
volumes:
|
|
- v2-postgres-replica-data:/var/lib/postgresql/data
|
|
```
|
|
|
|
**Primary config** (postgresql.conf):
|
|
```ini
|
|
wal_level = replica
|
|
max_wal_senders = 3
|
|
wal_keep_size = 64MB
|
|
```
|
|
|
|
**Replication user**:
|
|
```sql
|
|
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replica-password';
|
|
```
|
|
|
|
**Prisma read replica** (planned feature):
|
|
```typescript
|
|
// Future: Prisma read replicas
|
|
const prisma = new PrismaClient({
|
|
datasources: {
|
|
db: {
|
|
url: process.env.DATABASE_URL, // Primary (writes)
|
|
replicaUrl: process.env.REPLICA_URL, // Replica (reads)
|
|
},
|
|
},
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
### Connection Pooling
|
|
|
|
**PgBouncer** for connection pooling.
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
pgbouncer:
|
|
image: pgbouncer/pgbouncer:latest
|
|
container_name: pgbouncer-changemaker
|
|
environment:
|
|
DATABASES_HOST: changemaker-v2-postgres
|
|
DATABASES_PORT: 5432
|
|
DATABASES_USER: changemaker
|
|
DATABASES_PASSWORD: ${V2_POSTGRES_PASSWORD}
|
|
DATABASES_DBNAME: changemaker_v2
|
|
POOL_MODE: transaction
|
|
MAX_CLIENT_CONN: 1000
|
|
DEFAULT_POOL_SIZE: 20
|
|
ports:
|
|
- "6432:6432"
|
|
```
|
|
|
|
**Update DATABASE_URL**:
|
|
```bash
|
|
# Before (direct)
|
|
DATABASE_URL=postgresql://changemaker:pass@changemaker-v2-postgres:5432/changemaker_v2
|
|
|
|
# After (pooled)
|
|
DATABASE_URL=postgresql://changemaker:pass@pgbouncer:6432/changemaker_v2
|
|
```
|
|
|
|
**Benefits**:
|
|
- Handles 1000+ client connections with only 20 PostgreSQL connections
|
|
- Reduces connection overhead
|
|
- Prevents "too many connections" errors
|
|
|
|
---
|
|
|
|
## API Scaling
|
|
|
|
### Multiple API Containers
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
api:
|
|
# ... existing config
|
|
deploy:
|
|
replicas: 3 # Run 3 API containers
|
|
```
|
|
|
|
**Or manual scaling**:
|
|
```bash
|
|
docker compose up -d --scale api=3
|
|
```
|
|
|
|
**Load balancer** (Nginx upstream):
|
|
```nginx
|
|
upstream api_backend {
|
|
least_conn; # Load balancing algorithm
|
|
server changemaker-v2-api-1:4000;
|
|
server changemaker-v2-api-2:4000;
|
|
server changemaker-v2-api-3:4000;
|
|
}
|
|
|
|
server {
|
|
location /api/ {
|
|
proxy_pass http://api_backend;
|
|
}
|
|
}
|
|
```
|
|
|
|
**Session affinity** (sticky sessions):
|
|
```nginx
|
|
upstream api_backend {
|
|
ip_hash; # Route same IP to same backend
|
|
server changemaker-v2-api-1:4000;
|
|
server changemaker-v2-api-2:4000;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### Vertical Scaling (Resource Limits)
|
|
|
|
**Increase container resources**:
|
|
```yaml
|
|
api:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '4' # 4 CPU cores
|
|
memory: 4G # 4GB RAM
|
|
reservations:
|
|
cpus: '1'
|
|
memory: 1G
|
|
```
|
|
|
|
**Node.js memory limit**:
|
|
```yaml
|
|
api:
|
|
environment:
|
|
- NODE_OPTIONS=--max-old-space-size=3072 # 3GB heap
|
|
```
|
|
|
|
---
|
|
|
|
## Redis Scaling
|
|
|
|
### Redis Cluster (Sharding)
|
|
|
|
**For >100GB datasets** or high throughput.
|
|
|
|
**docker-compose.yml** (6-node cluster):
|
|
```yaml
|
|
redis-cluster-1:
|
|
image: redis:7-alpine
|
|
command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf
|
|
|
|
# ... repeat for redis-cluster-2 through redis-cluster-6
|
|
```
|
|
|
|
**Create cluster**:
|
|
```bash
|
|
docker compose exec redis-cluster-1 redis-cli --cluster create \
|
|
redis-cluster-1:6379 \
|
|
redis-cluster-2:6379 \
|
|
redis-cluster-3:6379 \
|
|
redis-cluster-4:6379 \
|
|
redis-cluster-5:6379 \
|
|
redis-cluster-6:6379 \
|
|
--cluster-replicas 1
|
|
```
|
|
|
|
---
|
|
|
|
### Redis Sentinel (High Availability)
|
|
|
|
**Automatic failover** for Redis.
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
redis-sentinel-1:
|
|
image: redis:7-alpine
|
|
command: redis-sentinel /etc/redis/sentinel.conf
|
|
volumes:
|
|
- ./configs/redis/sentinel.conf:/etc/redis/sentinel.conf
|
|
|
|
# ... repeat for sentinel-2, sentinel-3
|
|
```
|
|
|
|
**sentinel.conf**:
|
|
```ini
|
|
sentinel monitor mymaster redis-primary 6379 2
|
|
sentinel down-after-milliseconds mymaster 5000
|
|
sentinel parallel-syncs mymaster 1
|
|
sentinel failover-timeout mymaster 10000
|
|
```
|
|
|
|
---
|
|
|
|
## Media API Scaling
|
|
|
|
### Separate Media Containers
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
media-api:
|
|
deploy:
|
|
replicas: 2 # Run 2 media API containers
|
|
```
|
|
|
|
**Nginx load balancer**:
|
|
```nginx
|
|
upstream media_backend {
|
|
server changemaker-media-api-1:4100;
|
|
server changemaker-media-api-2:4100;
|
|
}
|
|
|
|
location /api/media/ {
|
|
proxy_pass http://media_backend;
|
|
}
|
|
```
|
|
|
|
**Shared volume** (read-only):
|
|
```yaml
|
|
media-api:
|
|
volumes:
|
|
- ${MEDIA_ROOT}:/media:ro # All replicas read same library
|
|
```
|
|
|
|
---
|
|
|
|
### CDN for Static Media
|
|
|
|
**Cloudflare CDN** (or similar):
|
|
|
|
**Setup**:
|
|
1. Enable Cloudflare proxy (orange cloud)
|
|
2. Configure cache rules:
|
|
- Cache `/media/library/*.mp4` for 30 days
|
|
- Bypass cache for `/api/media/` (dynamic)
|
|
|
|
**Benefits**:
|
|
- Offload video bandwidth
|
|
- Global edge caching
|
|
- DDoS protection
|
|
|
|
---
|
|
|
|
## Frontend Scaling
|
|
|
|
### CDN for Static Assets
|
|
|
|
**Vite production build** → static files → CDN.
|
|
|
|
**Build**:
|
|
```bash
|
|
cd admin && npm run build
|
|
```
|
|
|
|
**Upload to CDN** (S3 + CloudFront):
|
|
```bash
|
|
aws s3 sync dist/ s3://changemaker-static/ --delete
|
|
aws cloudfront create-invalidation --distribution-id XYZ --paths "/*"
|
|
```
|
|
|
|
**Benefits**:
|
|
- Global edge caching
|
|
- Reduced origin load
|
|
- Faster page loads
|
|
|
|
---
|
|
|
|
### Nginx Caching
|
|
|
|
**Proxy cache for API responses**:
|
|
```nginx
|
|
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m max_size=1g inactive=60m;
|
|
|
|
location /api/campaigns {
|
|
proxy_cache api_cache;
|
|
proxy_cache_valid 200 10m;
|
|
proxy_cache_key "$scheme$request_method$host$request_uri";
|
|
proxy_pass http://changemaker-v2-api:4000;
|
|
}
|
|
```
|
|
|
|
**Cacheable endpoints**:
|
|
- `/api/campaigns` (public listing, 10 minutes)
|
|
- `/api/representatives` (lookup cache, 1 hour)
|
|
- `/api/locations/public` (map data, 5 minutes)
|
|
|
|
**Never cache**:
|
|
- POST/PUT/DELETE requests
|
|
- Authenticated endpoints
|
|
- Real-time data (canvass sessions)
|
|
|
|
---
|
|
|
|
## Job Queue Scaling
|
|
|
|
### Multiple BullMQ Workers
|
|
|
|
**API container scaling** also scales workers (each container runs worker).
|
|
|
|
**Alternative**: Dedicated worker containers.
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
email-worker:
|
|
build:
|
|
context: ./api
|
|
container_name: email-worker
|
|
command: node dist/workers/email-worker.js
|
|
environment:
|
|
- REDIS_URL=${REDIS_URL}
|
|
- SMTP_HOST=${SMTP_HOST}
|
|
# ... other env vars
|
|
depends_on:
|
|
- redis
|
|
```
|
|
|
|
**Worker script** (api/src/workers/email-worker.ts):
|
|
```typescript
|
|
import { emailQueue } from '../services/email-queue.service';
|
|
|
|
emailQueue.process(10, async (job) => {
|
|
// Process email job
|
|
});
|
|
|
|
console.log('Email worker started');
|
|
```
|
|
|
|
**Scale workers**:
|
|
```bash
|
|
docker compose up -d --scale email-worker=5
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Under Load
|
|
|
|
### Load Testing
|
|
|
|
**k6 script** (load-test.js):
|
|
```javascript
|
|
import http from 'k6/http';
|
|
import { check } from 'k6';
|
|
|
|
export let options = {
|
|
stages: [
|
|
{ duration: '1m', target: 50 }, // Ramp to 50 users
|
|
{ duration: '3m', target: 50 }, // Stay at 50 users
|
|
{ duration: '1m', target: 100 }, // Ramp to 100 users
|
|
{ duration: '3m', target: 100 }, // Stay at 100 users
|
|
{ duration: '1m', target: 0 }, // Ramp down
|
|
],
|
|
};
|
|
|
|
export default function () {
|
|
let res = http.get('http://api.cmlite.org/api/campaigns');
|
|
check(res, {
|
|
'status 200': (r) => r.status === 200,
|
|
'response time < 500ms': (r) => r.timings.duration < 500,
|
|
});
|
|
}
|
|
```
|
|
|
|
**Run test**:
|
|
```bash
|
|
k6 run load-test.js
|
|
```
|
|
|
|
---
|
|
|
|
### Prometheus Metrics
|
|
|
|
**Monitor scaling indicators**:
|
|
- `rate(http_requests_total[5m])` — Request rate
|
|
- `histogram_quantile(0.95, http_request_duration_seconds)` — P95 latency
|
|
- `container_cpu_usage_seconds_total` — CPU usage per container
|
|
- `container_memory_usage_bytes` — Memory usage per container
|
|
|
|
**Grafana alert**:
|
|
```yaml
|
|
- alert: HighAPILatency
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "P95 latency >500ms, consider scaling"
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### High CPU Usage
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Top processes
|
|
docker stats
|
|
|
|
# API CPU usage
|
|
docker stats changemaker-v2-api
|
|
|
|
# Profile Node.js
|
|
docker compose exec api node --prof dist/server.js
|
|
```
|
|
|
|
**Solutions**:
|
|
- Scale API containers (3-5 replicas)
|
|
- Increase CPU limit (2-4 cores)
|
|
- Optimize slow queries (add indexes)
|
|
- Enable caching (Nginx proxy cache)
|
|
|
|
---
|
|
|
|
### Memory Leaks
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Memory usage over time
|
|
docker stats --no-stream changemaker-v2-api
|
|
|
|
# Heap snapshot (Node.js)
|
|
docker compose exec api node --inspect dist/server.js
|
|
# Chrome DevTools → Memory → Take snapshot
|
|
```
|
|
|
|
**Solutions**:
|
|
- Restart containers daily (cron job)
|
|
- Increase memory limit (4-8GB)
|
|
- Fix code leaks (event listeners, circular refs)
|
|
|
|
---
|
|
|
|
### Database Connection Exhaustion
|
|
|
|
**Symptoms**: `Error: too many connections for role "changemaker"`
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check connection count
|
|
docker compose exec v2-postgres psql -U changemaker -c \
|
|
"SELECT COUNT(*) FROM pg_stat_activity WHERE usename='changemaker'"
|
|
|
|
# Check max connections
|
|
docker compose exec v2-postgres psql -U changemaker -c \
|
|
"SHOW max_connections"
|
|
```
|
|
|
|
**Solutions**:
|
|
- Add PgBouncer (connection pooling)
|
|
- Increase `max_connections` (PostgreSQL config)
|
|
- Fix connection leaks (always close Prisma clients)
|
|
|
|
---
|
|
|
|
## Cost Optimization
|
|
|
|
### Resource Allocation
|
|
|
|
**Right-sizing** (don't over-provision):
|
|
- Start with 1 CPU, 1GB RAM per container
|
|
- Monitor actual usage (Prometheus)
|
|
- Scale based on metrics (not guesses)
|
|
|
|
**Example** (production workload):
|
|
- API: 2 CPUs, 2GB RAM (3 replicas)
|
|
- PostgreSQL: 2 CPUs, 4GB RAM
|
|
- Redis: 1 CPU, 512MB RAM
|
|
- Media API: 2 CPUs, 2GB RAM (2 replicas)
|
|
|
|
---
|
|
|
|
### Autoscaling (Docker Swarm)
|
|
|
|
**Docker Swarm mode** (alternative to Compose):
|
|
```bash
|
|
# Initialize swarm
|
|
docker swarm init
|
|
|
|
# Deploy stack
|
|
docker stack deploy -c docker-compose.yml changemaker
|
|
|
|
# Autoscale API
|
|
docker service scale changemaker_api=3
|
|
|
|
# Update with zero downtime
|
|
docker service update --image api:v2.1 changemaker_api
|
|
```
|
|
|
|
**Autoscaling**:
|
|
```yaml
|
|
api:
|
|
deploy:
|
|
replicas: 3
|
|
update_config:
|
|
parallelism: 1
|
|
delay: 10s
|
|
restart_policy:
|
|
condition: on-failure
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **[Docker Compose](docker-compose.md)** — Container orchestration
|
|
- **[Monitoring Stack](monitoring-stack.md)** — Performance metrics
|
|
- **[Nginx Configuration](nginx.md)** — Load balancing
|
|
- **[Backup & Restore](backup-restore.md)** — Data protection at scale
|