Debug Intermittent 502 Errors Under Load

Problem

Your production application experiences intermittent 502 Bad Gateway errors during traffic spikes. The errors affect roughly 2-5% of requests when traffic exceeds 500 requests per second. During normal traffic (200 req/s), the application works perfectly. The 502 errors appear randomly across all endpoints and are not reproducible with individual requests.

The Scenario

Your infrastructure consists of:

NGINX load balancer (or AWS ALB) in front of the application
4 application server instances (Node.js/Express) running in Kubernetes pods
PostgreSQL database with connection pooling via PgBouncer
Redis for session storage

Investigation has revealed these clues:

NGINX access logs show 502 errors with upstream response time of 0.000s (upstream never responded).
Application server logs show no errors at the timestamps of the 502s.
Kubernetes reports pods are "Ready" and passing health checks.
During traffic spikes, CPU on app servers reaches 85%, memory stays at 60%.
Some 502 errors cluster around pod restart events.
The health check endpoint (/health) returns 200 instantly even when the application is under heavy load.
The application has a graceful shutdown handler, but it calls process.exit(0) after only 1 second.

Your Task

Identify all potential causes of the 502 errors based on the clues.
Explain the interaction between load balancers, health checks, and application lifecycle that creates this problem.
Fix each identified cause with specific configuration changes and code updates.
Load test to verify: describe the load testing strategy to confirm the fix.

Constraints

You cannot significantly change the infrastructure topology.
The fix should handle both gradual scaling and sudden traffic spikes.
The application must maintain sub-200ms p99 response time under normal load.
Zero-downtime deployments must continue to work.

Problem

The Scenario

Your infrastructure consists of:

NGINX load balancer (or AWS ALB) in front of the application
4 application server instances (Node.js/Express) running in Kubernetes pods
PostgreSQL database with connection pooling via PgBouncer
Redis for session storage

Investigation has revealed these clues:

NGINX access logs show 502 errors with upstream response time of 0.000s (upstream never responded).
Application server logs show no errors at the timestamps of the 502s.
Kubernetes reports pods are "Ready" and passing health checks.
During traffic spikes, CPU on app servers reaches 85%, memory stays at 60%.
Some 502 errors cluster around pod restart events.
The health check endpoint (/health) returns 200 instantly even when the application is under heavy load.
The application has a graceful shutdown handler, but it calls process.exit(0) after only 1 second.

Your Task

Identify all potential causes of the 502 errors based on the clues.
Explain the interaction between load balancers, health checks, and application lifecycle that creates this problem.
Fix each identified cause with specific configuration changes and code updates.
Load test to verify: describe the load testing strategy to confirm the fix.

Constraints

You cannot significantly change the infrastructure topology.
The fix should handle both gradual scaling and sudden traffic spikes.
The application must maintain sub-200ms p99 response time under normal load.
Zero-downtime deployments must continue to work.

Debug Intermittent 502 Errors Under Load

Problem

The Scenario

Your Task

Constraints

Your Solution

Debug Intermittent 502 Errors Under Load

Problem

The Scenario

Your Task

Constraints

Your Solution