Problem
Your production application experiences intermittent 502 Bad Gateway errors during traffic spikes. The errors affect roughly 2-5% of requests when traffic exceeds 500 requests per second. During normal traffic (200 req/s), the application works perfectly. The 502 errors appear randomly across all endpoints and are not reproducible with individual requests.
The Scenario
Your infrastructure consists of:
- NGINX load balancer (or AWS ALB) in front of the application
- 4 application server instances (Node.js/Express) running in Kubernetes pods
- PostgreSQL database with connection pooling via PgBouncer
- Redis for session storage
Investigation has revealed these clues:
- NGINX access logs show 502 errors with upstream response time of 0.000s (upstream never responded).
- Application server logs show no errors at the timestamps of the 502s.
- Kubernetes reports pods are "Ready" and passing health checks.
- During traffic spikes, CPU on app servers reaches 85%, memory stays at 60%.
- Some 502 errors cluster around pod restart events.
- The health check endpoint (
/health) returns 200 instantly even when the application is under heavy load.
- The application has a graceful shutdown handler, but it calls
process.exit(0) after only 1 second.
Your Task
- Identify all potential causes of the 502 errors based on the clues.
- Explain the interaction between load balancers, health checks, and application lifecycle that creates this problem.
- Fix each identified cause with specific configuration changes and code updates.
- Load test to verify: describe the load testing strategy to confirm the fix.
Constraints
- You cannot significantly change the infrastructure topology.
- The fix should handle both gradual scaling and sudden traffic spikes.
- The application must maintain sub-200ms p99 response time under normal load.
- Zero-downtime deployments must continue to work.