Wednesday, October 14, 2020

Using Grafana to monitor pod health, but restarts destroy the pod, making it appear down


My company has moved over to K8S recently, from PCF, where app instances were simply numbered. If instance 4 is in a bad state and needs a restart, that's fine.But now in K8S, if instance "abcde-fghi" gets messed up, and our "Get it running smoothly again" process involves restarting (Deleting the pod), we get a new pod with a new name, resulting in the previous one appearing to have stopped sending metrics at all.This might be a silly question, and I feel like the answer is to "Rethink it entirely", but I'm a little hung up on how my company does things now, compared to how we need to start approaching them.Discussing with another member of my team, we considered the idea of deleting the data from Influx so Grafana doesn't think it's gone down, but that takes away our ability to look back at historical data.As a sort of stop-gap measure, I've had to switch all our metrics from fill(0) to fill(null), to at least stop things from alerting nonstop after performing a rolling restart. Naturally, after an issue that's taken some time to get to the root of, we've got some serious junk to clean up on dashboards that break stuff down by pod: https://ift.tt/3nQ7wh9 anybody else got experience with this sort of scenario, and maybe some thoughts on a more sensible approach? I'm certain that we'll have to do away with much of our old alerts from our PCF days, but currently a little stumped on what to replace them with. via /r/kubernetes https://ift.tt/3j3I9Vo

No comments:

Post a Comment