05/12/2026
2AM alert. Container restarting every 30 seconds.
First thing I do?
Check if it’s actually a crash loop. Then logs. Always logs first because they usually tell the story faster than assumptions.
Exit code helps narrow it down:
137 → memory issue
1 → app crash or bad config
0 → container exiting normally when it shouldn’t
I also check health checks because sometimes the app is fine, but the health check is killing it too early.
If production is still down and root cause isn’t clear fast enough, I rollback first. Service recovery comes before perfect debugging.
And throughout the incident, I keep the team updated. During outages, silence creates more problems.
That’s usually my approach.